dplyr
Up until now, we’ve used base R to create and subset data. An example:
What does this do?
It’s reasonable to find this difficult to read.
In general, we prefer to be more expressive with our programming.
Suppose we wanted to evaluate whether some condition is true for any of the elements in a vector.
What about this instead?
These are both awful. Our burden reading and writing this is too great for a task so trivial.
Importantly, the code doesn’t convey what it does. Instead, it’s the reader’s responsibility to make sense of the computations.
Ideally, we’d prefer for our code to say what it does directly. This makes it easier to read and write.
The package dplyr
provides us with functions (“a grammar”) that make our data frame manipulations more expressive.
Let’s consider some common data frame tasks and what verbs dplyr
provides to solve them.
slice
Suppose we want some subset of the rows. In base R,
Note that slice works only with indices. Unlike base R subsetting, you can’t use strings or logical values.
select
Suppose we want to select some of the columns. In base R,
filter
Suppose we want to filter the rows by some condition. In base R,
dplyr
provides us with the filter
function.
mutate
Suppose we want to add new columns. In base R,
dplyr
provides us with the function mutate
.
Wait, did that change starwars
?
No, it returned a new, mutated data frame. To modify, we must reassign.
summarise
/summarize
We can use summarise
to obtain summary statistics on the data frame (in the form of a new data frame).
summarise(
starwars, # what we are summarising
avg_height = mean(height, na.rm = TRUE),
avg_mass = mean(mass, na.rm = TRUE),
avg_birth_year = mean(birth_year, na.rm = TRUE)
)
# A tibble: 1 × 3
avg_height avg_mass avg_birth_year
<dbl> <dbl> <dbl>
1 174. 97.3 87.6
But what if we wanted to get these values for each group (say, each species)?
Would we have to filter
the data frame three times, then call summarise for each?
group_by
We can use group_by
to form a new table which is grouped by the values in particular columns. The functions you use will be applied to the groups separately, after which dplyr
groups the results.
summarise(
group_by(starwars, sex),
avg_height = mean(height, na.rm = TRUE),
avg_mass = mean(mass, na.rm = TRUE),
avg_birth_year = mean(birth_year, na.rm = TRUE)
)
# A tibble: 5 × 4
sex avg_height avg_mass avg_birth_year
<chr> <dbl> <dbl> <dbl>
1 female 169. 54.7 47.2
2 hermaphroditic 175 1358 600
3 male 179. 81.0 85.5
4 none 131. 69.8 53.3
5 <NA> 181. 48 62
I’ve only introduced the most basic functionality of these functions. Enter ?<function>
to explore additional functionality.
Piping will be explored in greater detail later.
Suppose we have this code:
In what order are the functions f()
, g()
, and h()
called?
h()
, then g()
, then f()
! We’ve written it backwards?
To make nested functions easier to read and write, we use the pipe operator.
a() %>% b()
becomes b(a())
.
That is, the left operand is passed as the first argument to the second operand.
The %>%
symbol comes from the package magrittr
.
If you’re using a new version of R (>4.1.0), I’d recommend using the new pipe operator built into R, |>
.
A more elaborate example from the Algorithmic Problems page:
qaq <- function(string) {
# Obtain the individual characters in `string`.
characters <- strsplit(string, "")[[1]]
# Form a prefix vector for the number of "Q"s at each index.
q_count <- characters |>
equals("Q") |>
as.numeric() |>
accumulate(`+`)
# For each "A" we can form <"Q"s before "A"> * <"Q"s after "A"> "QAQ"s. We
# obtain this value for each "A" and sum them.
characters |>
equals("A") |>
multiply_by(q_count) |>
multiply_by(tail(q_count, 1) - q_count) |>
sum()
}