Lab 06: Intro to dplyr

The Inconvenience of Subsetting

Up until now, we’ve used base R to create and subset data. An example:

sw_males <- starwars[
    starwars$sex == "male",
    !(names(starwars) %in% c("sex", "gender"))
]

What does this do?

It’s reasonable to find this difficult to read.

In general, we prefer to be more expressive with our programming.

A Lack of Expression…

Suppose we wanted to evaluate whether some condition is true for any of the elements in a vector.

heights <- starwars$height
length(heights[heights > 150]) > 0
[1] TRUE

What about this instead?

sum(as.numeric(heights > 150), na.rm = TRUE) > 0
[1] TRUE

These are both awful. Our burden reading and writing this is too great for a task so trivial.

Importantly, the code doesn’t convey what it does. Instead, it’s the reader’s responsibility to make sense of the computations.

Expressiveness

Ideally, we’d prefer for our code to say what it does directly. This makes it easier to read and write.

any(heights > 100)
[1] TRUE
all(heights > 20, na.rm = TRUE)
[1] TRUE

In both cases, you know what it does immediately.

There’s no need to make sense of confusing computations.

Expressiveness with Data Frames

The package dplyr provides us with functions (“a grammar”) that make our data frame manipulations more expressive.

Let’s consider some common data frame tasks and what verbs dplyr provides to solve them.

slice

Suppose we want some subset of the rows. In base R,

starwars[seq(1, nrow(starwars), 2),]

dplyr provides us with the slice function.

slice(starwars, seq(1, nrow(starwars), 2))

We can provide multiple vectors.

slice(starwars, 1:5, 10:15, 17, 18)

Note that slice works only with indices. Unlike base R subsetting, you can’t use strings or logical values.

select

Suppose we want to select some of the columns. In base R,

starwars[,c("height", "mass")]

dplyr provides us with the select function.

select(starwars, c("height", "mass"))

Alternatively,

select(starwars, "height", "mass")

There’s no need to provide the columns as strings.

select(starwars, height, mass)

filter

Suppose we want to filter the rows by some condition. In base R,

starwars[starwars$height > 100,]

dplyr provides us with the filter function.

filter(starwars, height > 100)  # no need to specify starwars$height

We can easily combine conditions with the logical operators.

filter(starwars, height > 100 & mass > 100 & sex == "male")
filter(starwars, is.na(hair_color) | species == "Droid")
filter(starwars, !(eye_color == "brown" & skin_color == "blue"))

mutate

Suppose we want to add new columns. In base R,

starwars$is_chosen_one <- starwars$name == "Darth Vader"

dplyr provides us with the function mutate.

mutate(starwars, is_chosen_one = name == "Darth Vader")

Wait, did that change starwars?

No, it returned a new, mutated data frame. To modify, we must reassign.

starwars <- mutate(starwars, is_chosen_one = name == "Darth Vader")

If you use an existing column, it will be replaced. Replace an existing column with NULL to delete it.

starwars <- mutate(starwars, hair_color = NULL)

summarise/summarize

We can use summarise to obtain summary statistics on the data frame (in the form of a new data frame).

summarise(
    starwars,  # what we are summarising
    avg_height = mean(height, na.rm = TRUE),
    avg_mass = mean(mass, na.rm = TRUE),
    avg_birth_year = mean(birth_year, na.rm = TRUE)
)
# A tibble: 1 × 3
  avg_height avg_mass avg_birth_year
       <dbl>    <dbl>          <dbl>
1       174.     97.3           87.6

But what if we wanted to get these values for each group (say, each species)?

Would we have to filter the data frame three times, then call summarise for each?

group_by

We can use group_by to form a new table which is grouped by the values in particular columns. The functions you use will be applied to the groups separately, after which dplyr groups the results.

summarise(
    group_by(starwars, sex),
    avg_height = mean(height, na.rm = TRUE),
    avg_mass = mean(mass, na.rm = TRUE),
    avg_birth_year = mean(birth_year, na.rm = TRUE)
)
# A tibble: 5 × 4
  sex            avg_height avg_mass avg_birth_year
  <chr>               <dbl>    <dbl>          <dbl>
1 female               169.     54.7           47.2
2 hermaphroditic       175    1358            600  
3 male                 179.     81.0           85.5
4 none                 131.     69.8           53.3
5 <NA>                 181.     48             62  

Much More

I’ve only introduced the most basic functionality of these functions. Enter ?<function> to explore additional functionality.

One neat example, the function ends_with:

head(select(starwars, ends_with("color")))
# A tibble: 6 × 2
  skin_color  eye_color
  <chr>       <chr>    
1 fair        blue     
2 gold        yellow   
3 white, blue red      
4 white       yellow   
5 light       brown    
6 light       blue     

An Aside: Piping

Piping will be explored in greater detail later.

Suppose we have this code:

f(g(h()))

In what order are the functions f(), g(), and h() called?

h(), then g(), then f()! We’ve written it backwards?

An Aside: Piping

To make nested functions easier to read and write, we use the pipe operator.

a() %>% b() becomes b(a()).

That is, the left operand is passed as the first argument to the second operand.

3 %>% sum(2)
[1] 5
3 %>% sum(2) %>% sum(5)
[1] 10

This allows us to rewrite f(g(h())) as

h() %>% g() %>% f()

That is, in the logical order of the function calls.

An Aside: Piping

The %>% symbol comes from the package magrittr.

# To use...
install.packages("magrittr")
library(magrittr)

If you’re using a new version of R (>4.1.0), I’d recommend using the new pipe operator built into R, |>.

Piping in Action

A more elaborate example from the Algorithmic Problems page:

qaq <- function(string) {
  # Obtain the individual characters in `string`.
  characters <- strsplit(string, "")[[1]]
  
  # Form a prefix vector for the number of "Q"s at each index.
  q_count <- characters |>
    equals("Q") |>
    as.numeric() |>
    accumulate(`+`)
  
  # For each "A" we can form <"Q"s before "A"> * <"Q"s after "A"> "QAQ"s. We 
  # obtain this value for each "A" and sum them.
  characters |>
    equals("A") |>
    multiply_by(q_count) |>
    multiply_by(tail(q_count, 1) - q_count) |>
    sum()
}