Tidyverse Exam v2.0

library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Basic operations

more-example-exams/#basic-operations

Question 1

Read the file person.csv and store the result in a tibble called person.

person <- read_csv("https://education.rstudio.com/blog/2020/08/more-example-exams/person.csv")

## Parsed with column specification:
## cols(
##   person_id = col_character(),
##   personal_name = col_character(),
##   family_name = col_character()
## )

class(person)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

Question 2

Create a tibble containing only family and personal names, in that order. You do not need to assign this tibble or any others to variables unless explicitly asked to do so. However, as noted in the introduction, you must use the pipe operator %>% and code that follows the tidyverse style guide.

# View(person)

person %>%
  select(family_name, personal_name)

Question 3

Create a new tibble containing only the rows in which family names come before the letter M. Your solution should work for tables with more rows than the example, i.e., you cannot rely on row numbers or select specific names.

person %>%
  arrange(family_name) %>%
  filter(family_name < "M")

Question 4

Display all the rows in person sorted by family name length with the longest name first.

person %>%
  arrange(desc(str_length(family_name)))

Cleaning and counting

more-sample-exams/#cleaning-and-counting

Question 1

Read the file measurements.csv to create a tibble called measurements. (The strings “rad”, “sal”, and “temp” in the quantity column stand for “radiation”, “salinity”, and “temperature” respectively.)

measurements <- read_csv("https://education.rstudio.com/blog/2020/08/more-example-exams/measurements.csv")

## Parsed with column specification:
## cols(
##   visit_id = col_double(),
##   visitor = col_character(),
##   quantity = col_character(),
##   reading = col_double()
## )

Question 2

Create a tibble containing only rows where none of the values are NA and save in a tibble called cleaned.

cleaned <-
measurements %>%
  filter(!is.na(visitor), !is.na(quantity), !is.na(reading))

# other option: use na.omit(measurements)

Question 3

Count the number of measurements of each type of quantity in cleaned. Your result should have one row for each quantity "rad", "sal", and "temp".

cleaned %>%
  group_by(quantity) %>%
  summarize(n())

## `summarise()` ungrouping output (override with `.groups` argument)

# other option: use count()

Question 4

Display the minimum and maximum value of reading separately for each quantity in cleaned. Your result should have one row for each quantity "rad", "sal", and "temp".

cleaned %>%
  group_by(quantity) %>%
  summarize(min(reading), max(reading))

## `summarise()` ungrouping output (override with `.groups` argument)

Question 5

Create a tibble in which all salinity ("sal") readings greater than 1 are divided by 100. (This is needed because some people wrote percentages as numbers from 0.0 to 1.0, but others wrote them as 0.0 to 100.0.)

measurements %>%
  filter(quantity == "sal") %>%
  mutate(new_reading = ifelse(reading > 1, reading/100, reading))

measurements %>%
  filter(quantity == "sal") %>%
  mutate(reading = reading/100)

Combining data

more-sample-exams/#combining-data

Question 1

Read visited.csv and drop rows containing any NAs, assigning the result to a new tibble called visited.

visited <-
  read_csv("https://education.rstudio.com/blog/2020/08/more-example-exams/visited.csv") %>%
  filter(!is.na(site_id), !is.na(visit_date))

## Parsed with column specification:
## cols(
##   visit_id = col_double(),
##   site_id = col_character(),
##   visit_date = col_date(format = "")
## )

Question 2

Use an inner join to combine visited with cleaned using the visit_id column for matches.

inner_join(visited, cleaned, by = "visit_id")

Question 3

Find the highest radiation ("rad") reading at each site. (Sites are identified by values in the site_id column.)

inner_join(visited, cleaned, by = "visit_id") %>%
  group_by(site_id) %>%
  summarize(max(reading))

## `summarise()` ungrouping output (override with `.groups` argument)

Question 4

Find the date of the highest radiation reading at each site.

inner_join(visited, cleaned, by = "visit_id") %>%
  group_by(site_id) %>%
  filter(reading == max(reading))

Plotting

more-example-exams/#plotting

Question 1

The code below is supposed to read the file home-range-database.csv to create a tibble called hra_raw, but contains a bug. Describe and fix the problem. (There are several ways to fix it: please use whichever you prefer.)

hra_raw <- read_csv(here::here("data", "home-range-database.csv"))

From looking at the documentation, the here::here() function is to be considered a replacement for “filepath” within a local directory. There is no “data” or “home-range-database.csv” in my local directory, so here() can’t find it. I might fix this by moving home-range-database.csv into the data folder in my directory. Below I use the url provided for the csv.

hra_raw <- read_csv("https://education.rstudio.com/blog/2020/08/more-example-exams/home-range-database.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   mean.mass.g = col_double(),
##   log10.mass = col_double(),
##   mean.hra.m2 = col_double(),
##   log10.hra = col_double(),
##   preymass = col_double(),
##   log10.preymass = col_double(),
##   PPMR = col_double()
## )

## See spec(...) for full column specifications.

Question 2

Convert the class column (which is text) to create a factor column class_fct and assign the result to a tibble hra. Use forcats to order the factor levels as:

mammalia
reptilia
aves
actinopterygii

hra <-
hra_raw %>%
  mutate(class_fct = factor(class,
                            levels = c("mammalia", "reptilia", "aves", "actinopterygii")))

Question 3

Create a scatterplot showing the relationship between log10.mass and log10.hra in hra.

ggplot(hra, aes(x = log10.mass, y = log10.hra)) +
  geom_point()

Question 4

Colorize the points in the scatterplot by class_fct.

ggplot(hra, aes(x = log10.mass, y = log10.hra)) +
  geom_point(aes(color = class_fct))

Question 5

Display a scatterplot showing only data for birds (class aves) and fit a linear regression to that data using the lm function.

hra %>% 
  filter(class == "aves") %>%
  ggplot(aes(x = log10.mass, y = log10.hra)) +
  geom_point(aes(color = class_fct)) +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

Functional programming

more-sample-exams/#functional-programming

Question 1

Write a function called summarize_table that takes a title string and a tibble as input and returns a string that says something like, “title has # rows and # columns”. For example, summarize_table('our table', person) should return the string "our table has 5 rows and 3 columns".

summarize_table <- function(title, tibble) {
  num_rows <- nrow(tibble)
  num_cols <- ncol(tibble)
  result <- str_c(title,"has", num_rows, 
                  "rows and", num_cols, "columns", sep = " ")
  print(result)
}

summarize_table("HRA dataset", hra)

## [1] "HRA dataset has 566 rows and 25 columns"

Question 2

Write another function called show_columns that takes a string and a tibble as input and returns a string that says something like, “table has columns name, name, name”. For example, show_columns('person', person) should return the string "person has columns person_id, personal_name, family_name".

show_columns <- function(title, tibble) {
  col_names <- names(tibble)
  col_names_collapsed <- str_c(col_names, collapse = ", ")
  result <- str_c(title, "has columns", 
                  col_names_collapsed, sep = " ")
  print(result)  
}

show_columns("HRA", hra)

## [1] "HRA has columns taxon, common.name, class, order, family, genus, species, primarymethod, N, mean.mass.g, log10.mass, alternative.mass.reference, mean.hra.m2, log10.hra, hra.reference, realm, thermoregulation, locomotion, trophic.guild, dimension, preymass, log10.preymass, PPMR, prey.size.reference, class_fct"

Question 3

The function rows_from_file returns the first N rows from a table in a CSV file given the file’s name and the number of rows desired. Modify it so that if no value is specified for the number of rows, a default of 3 is used.

# https://www.r-bloggers.com/2015/08/function-argument-lists-and-missing/
# if the argument is optional
  
rows_from_file <- function(filename, num_rows = NULL){
  name <- readr::read_csv(filename)

    if (is.null(num_rows)){
      head(name, 3)
    } else {
      head(name, n = num_rows)  
    }
    #ifelse(num_rows != NA, head(n = num_rows), head(3))
}

# should show 3 rows
rows_from_file("https://education.rstudio.com/blog/2020/08/more-example-exams/measurements.csv")

## Parsed with column specification:
## cols(
##   visit_id = col_double(),
##   visitor = col_character(),
##   quantity = col_character(),
##   reading = col_double()
## )

Question 4

The function long_name checks whether a string is longer than 4 characters. Use this function and a function from purrr to create a logical vector that contains the value TRUE where family names in the tibble person are longer than 4 characters, and FALSE where they are 4 characters or less.

    long_name <- function(name) {
      stringr::str_length(name) > 4
    }

person$family_name %>% map_lgl(long_name)

## [1] FALSE  TRUE FALSE  TRUE  TRUE

Wrapping up

more-sample-exams/#wrapping-up

Modify the YAML header of this file so that a table of contents is automatically created each time this document is knit, and fix any errors that are preventing the document from knitting cleanly.

---
title: "Tidyverse Exam Version 2.0"
output:
html_document:
    theme: flatly
---

---
title: "Tidyverse Exam Version 2.0"
output:
  html_document: # this was indented
    theme: flatly
    toc: true    # this was added
---

Tidyverse Exam v2.0

Solutions for August 2020 Sample Exam

Silvia Canelón

Basic operations

Question 1

Question 2

Question 3

Question 4

Cleaning and counting

Question 1

Question 2

Question 3

Question 4

Question 5

Combining data

Question 1

Question 2

Question 3

Question 4

Plotting

Question 1

Question 2

Question 3

Question 4

Question 5

Functional programming

Question 1

Question 2

Question 3

Question 4

Wrapping up