dplyr 1.1.0: `pick()`, `reframe()`, and `arrange()`
This article is originally published at https://www.tidyverse.org/blog/
In this final
dplyr 1.1.0 post, we’ll take a look at two new verbs,
reframe(), along with some changes to
arrange() that improve both reproducibility and performance. If you missed our previous posts, you should definitely go back and
check them out!
You can install it from CRAN with:
One thing we noticed after dplyr 1.0.0 was released is that many people like to use
across() for its column selection features while working inside a data-masking function like
summarise(). This is typically useful if you have a function that takes data frames as inputs, or if you need to compute features about a specific subset of columns.
df <- tibble( x_1 = c(1, 3, 2, 1, 2), x_2 = 6:10, w_4 = 11:15, y_2 = c(5, 2, 4, 0, 6) ) df |> summarise( n_x = ncol(across(starts_with("x"))), n_y = ncol(across(starts_with("y"))) ) #> # A tibble: 1 × 2 #> n_x n_y #> <int> <int> #> 1 2 1
across() is intended to apply a function to each of these columns, rather than just select them, which is why its name doesn’t feel natural for this operation. In dplyr 1.1.0 we’ve introduced
pick(), a specialized column selection variant with a more natural name:
df |> summarise( n_x = ncol(pick(starts_with("x"))), n_y = ncol(pick(starts_with("y"))) ) #> # A tibble: 1 × 2 #> n_x n_y #> <int> <int> #> 1 2 1
pick() is particularly useful in combination with ranking functions like
dense_rank(), which have been upgraded in 1.1.0 to take data frames as inputs, serving as a way to jointly rank by multiple columns at once.
df |> mutate( rank1 = dense_rank(x_1), rank2 = dense_rank(pick(x_1, y_2)) # Using `y_2` to break ties in `x_1` ) #> # A tibble: 5 × 6 #> x_1 x_2 w_4 y_2 rank1 rank2 #> <dbl> <int> <int> <dbl> <int> <int> #> 1 1 6 11 5 1 2 #> 2 3 7 12 2 3 5 #> 3 2 8 13 4 2 3 #> 4 1 9 14 0 1 1 #> 5 2 10 15 6 2 4
As we mentioned in the
coming soon blog post, in dplyr 1.1.0 we’ve decided to walk back the change we introduced to
summarise() in dplyr 1.0.0 that allowed it to return per-group results of any length, rather than results of length 1. We think that the idea of multi-row results is extremely powerful, as it serves as a flexible way to apply arbitrary operations to each group, but we’ve realized that
summarise() wasn’t the best home for it because it increases the chance for users to run into silent recycling bugs (thanks to
Kirill Müller and
David Robinson for bringing this to our attention).
As an example, here we’re computing the mean and standard deviation of
x, grouped by
g. Unfortunately, I accidentally forgot to use
sd(x) and instead just typed
x. Because of how
tidyverse recycling rules work, the multi-row behavior silently recycled the size 1 mean values instead of erroring, so rather than 2 rows, we end up with 5.
df |> summarise( x_average = mean(x), x_sd = x, # Oops .by = g ) #> Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in #> dplyr 1.1.0. #> ℹ Please use `reframe()` instead. #> ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()` #> always returns an ungrouped data frame and adjust accordingly. #> # A tibble: 5 × 3 #> g x_average x_sd #> <dbl> <dbl> <dbl> #> 1 1 4.33 4 #> 2 1 4.33 3 #> 3 1 4.33 6 #> 4 2 5 2 #> 5 2 5 8
summarise() now throws a warning when any group returns a result that isn’t length 1. We expect to upgrade this to an error in the future to revert
summarise() back to its “safe” behavior of requiring 1 row per group.
summarise() also wasn’t the best name for a function with this feature, as the name itself implies one row per group. After
gathering some feedback, we’ve settled on a new verb with a more appropriate name,
reframe(). We think of
reframe() as a way to “do something” to each group, with no restrictions on the number of rows returned per group. The name has a nice connection to the tibble functions
tibble::deframe(), which are used for converting vectors to data frames and vice versa:
enframe(): Takes a vector, returns a data frame
deframe(): Takes a data frame, returns a vector
reframe(): Takes a data frame, returns a data frame
One nice application of
reframe() is computing quantiles at various probability thresholds. It’s particularly nice if we wrap
quantile() into a helper that returns a data frame, which
reframe() then automatically unpacks.
df |> reframe(quantile_df(x), .by = g) #> # A tibble: 6 × 3 #> g value prob #> <dbl> <dbl> <dbl> #> 1 1 3.5 0.25 #> 2 1 4 0.5 #> 3 1 5 0.75 #> 4 2 3.5 0.25 #> 5 2 5 0.5 #> 6 2 6.5 0.75
This also works well if you want to apply it to multiple columns using
quantile_df() returns a tibble, we end up with
packed data frame columns. You’ll often want to unpack these into their individual columns, and
across() has gained a new
.unpack argument in 1.1.0 that helps you do exactly that:
We expect that seeing
reframe() in a colleague’s code will serve as an extremely clear signal that something “special” is happening, because they’ve made a conscious decision to opt-into the 1% case of returning multiple rows per group.
When sorting character vectors, the C locale is now the default, rather than the system locale
.localeargument, powered by stringi, allows you to explicitly request an alternative locale using a stringi locale identifier (like
"en"for English, or
These changes were made for two reasons:
Much faster performance by default, due to usage of a custom radix sort algorithm inspired by data.table‘s
Improved reproducibility across R sessions, where different computers might use different system locales and different operating systems have different ways to specify the same system locale
If you use
arrange() for the purpose of grouping similar values together (and don’t care much about the specific locale that it uses to do so), then you’ll likely see performance improvements of up to 100x in dplyr 1.1.0. If you do care about the locale and supply
.locale, you should still see improvements of up to 10x.
# 10,000 random strings, sampled up to 1,000,000 rows dictionary <- stringi::stri_rand_strings(10000, length = 10, pattern = "[a-z]") str <- tibble(x = sample(dictionary, size = 1e6, replace = TRUE)) str #> # A tibble: 1,000,000 × 1 #> x #> <chr> #> 1 slpqkdtpyr #> 2 xtoucpndhc #> 3 vsvfoqcyqm #> 4 gnbpkwcmse #> 5 xutzdqxpsi #> 6 gkolsrndrz #> 7 mitqahkkou #> 8 eehfrrimhd #> 9 ymxxjczjsv #> 10 svpvizfxwe #> # … with 999,990 more rows
# dplyr 1.0.10 (American English system locale) bench::mark(arrange(str, x)) #> # A tibble: 1 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> #> 1 arrange(str, x) 4.38s 4.89s 0.204 12.7MB 0.148 # dplyr 1.1.0 (C locale default, 100x faster) bench::mark(arrange(str, x)) #> # A tibble: 1 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> #> 1 arrange(str, x) 42.3ms 46.6ms 20.8 22.4MB 46.0 # dplyr 1.1.0 (American English `.locale`, 10x faster) bench::mark(arrange(str, x, .locale = "en")) #> # A tibble: 1 × 6 #> expression min median `itr/sec` mem_alloc #> <bch:expr> <bch:tm> <bch:> <dbl> <bch:byt> #> 1 arrange(str, x, .locale = "en") 377ms 430ms 2.21 27.9MB #> # … with 1 more variable: `gc/sec` <dbl>
We are hopeful that switching to a C locale default will have a relatively small amount of impact in exchange for much faster performance. To read more about the exact differences between the C locale and locales like American English or Spanish, see the
coming soon post or our detailed
tidyup. If you are having trouble converting an existing script over to the new behavior, you can set the temporary global option
options(dplyr.legacy_locale = TRUE), which will revert to the pre-1.1.0 behavior of using the system locale. We expect to remove this option in a future release.
A big thanks to the 88 contributors who helped make the 1.1.0 release possible by opening issues, contributing features and documentation, and asking questions! @7708801314520dym, @abalter, @aghaynes, @AlbertRapp, @AlexGaithuma, @algsat, @andrewbaxter439, @andrewpbray, @asadow, @asmlgkj, @barbosawf, @barnabasharris, @bart1, @bergsmat, @chrisbrownlie, @cjyetman, @CNUlichao, @daattali, @DanChaltiel, @davidchall, @DavisVaughan, @ddsjoberg, @donboyd5, @drmowinckels, @dxtxs1, @eitsupi, @eogoodwin, @erhoppe, @eutwt, @ggrothendieck, @grayskripko, @H-Mateus, @hadley, @haozhou1988, @hassanjfry, @Hesham999666, @hideaki, @jeffreypullin, @jic007, @jmbarbone, @jonspring, @jonthegeek, @jpeacock29, @kendonB, @kenkoonwong, @kevinushey, @krlmlr, @larry77, @latot, @lionel-, @llayman12, @LukasWallrich, @m-sostero, @machow, @mc-unimi, @mgacc0, @mgirlich, @MichelleSMA, @mine-cetinkaya-rundel, @moodymudskipper, @moriarais, @NicChr, @nstjhp, @omarwh, @orgadish, @rempsyc, @rorynolan, @ryanvoyack, @selkamand, @seth-cp, @shalom-lab, @shannonpileggi, @simonpcouch, @sjackson1997, @spono, @stibu81, @tfehring, @Theresaliu, @TimBMK, @TimTeaFan, @Torvaney, @turbanisch, @weiyangtham, @wurli, @xet869, @yuliaUU, @yutannihilation, and @zeehio.
Please visit source website for post related comments.