dplyr 1.1.0 is coming soon
This article is originally published at https://www.tidyverse.org/blog/
dplyr 1.1.0 is coming soon! We haven’t started the official release process yet (where we inform maintainers), but that will start in the next few weeks, and then dplyr 1.1.0 is likely to be submitted to CRAN in late January 2023.
This is an exciting release for dplyr, incorporating a number of features that have been in flight for years, including:
An inline alternative to
group_by()
that implements temporary groupingNew join types, such as non-equi joins
arrange()
improvements with character vectorsreframe()
, a generalization ofsummarise()
This pre-release blog post will discuss these new features in more detail. By releasing this post before 1.1.0 is sent to CRAN, we’re hoping to get your feedback to catch any potential problems that we’ve missed! If you do find a bug, or have general feedback about the new features, we welcome discussion on the dplyr issues page.
You can see a full list of changes in the release notes. There are many additional improvements that couldn’t fit in a single blog post!
dplyr 1.1.0 is not on CRAN yet, but you can install the development version from GitHub with:
pak::pak("tidyverse/dplyr")
The development version is mostly stable, but is still subject to minor changes before the official release. We don’t encourage relying on it for production usage, but we would love for you to try out these new features.
Temporary grouping with .by
Verbs that work “by group,” such as
mutate()
,
summarise()
,
filter()
, and
slice()
, have gained an experimental new argument, .by
, which allows for inline and temporary grouping. Grouping radically affects the computation of the dplyr verb you use it with, and one of the goals of .by
is to allow you to place that grouping specification alongside the code that actually uses it. As an added benefit, with .by
you no longer need to remember to
ungroup()
after
summarise()
, and
summarise()
won’t ever message you about how it’s handling the groups!
This feature was inspired by data.table, which has always used per-operation grouping.
We’ll explore .by
with this expenses
dataset, containing various cost
s tracked across id
and region
.
expenses <- tibble(
id = c(1, 2, 1, 3, 1, 2, 3),
region = c("A", "A", "A", "B", "B", "A", "A"),
cost = c(25, 20, 19, 12, 9, 6, 6)
)
expenses
#> # A tibble: 7 × 3
#> id region cost
#> <dbl> <chr> <dbl>
#> 1 1 A 25
#> 2 2 A 20
#> 3 1 A 19
#> 4 3 B 12
#> 5 1 B 9
#> 6 2 A 6
#> 7 3 A 6
If I were to ask you to compute the average cost
per region
, you’d probably write something like:
expenses |>
group_by(region) |>
summarise(cost = mean(cost))
#> # A tibble: 2 × 2
#> region cost
#> <chr> <dbl>
#> 1 A 15.2
#> 2 B 10.5
With .by
, you can now write:
expenses |>
summarise(cost = mean(cost), .by = region)
#> # A tibble: 2 × 2
#> region cost
#> <chr> <dbl>
#> 1 A 15.2
#> 2 B 10.5
These two particular results look the same, but the behavior of .by
diverges from
group_by()
when multiple group columns are involved:
expenses |>
group_by(id, region) |>
summarise(cost = mean(cost))
#> `summarise()` has grouped output by 'id'. You can override using the `.groups`
#> argument.
#> # A tibble: 5 × 3
#> # Groups: id [3]
#> id region cost
#> <dbl> <chr> <dbl>
#> 1 1 A 22
#> 2 1 B 9
#> 3 2 A 13
#> 4 3 A 6
#> 5 3 B 12
expenses |>
summarise(cost = mean(cost), .by = c(id, region))
#> # A tibble: 5 × 3
#> id region cost
#> <dbl> <chr> <dbl>
#> 1 1 A 22
#> 2 2 A 13
#> 3 3 B 12
#> 4 1 B 9
#> 5 3 A 6
Usage of .by
always results in an ungrouped data frame, regardless of the number of group columns involved.
You might also recognize that these results aren’t returned in exactly the same order.
group_by()
always sorts the grouping keys in ascending order, but .by
retains the original ordering found in the data. If you need ordered summaries with .by
, we recommend calling
arrange()
explicitly before or after summarizing.
While here we’ve focused on using .by
with
summarise()
, it also works with other verbs, like
mutate()
and
slice()
:
expenses |>
mutate(mean = mean(cost), .by = region)
#> # A tibble: 7 × 4
#> id region cost mean
#> <dbl> <chr> <dbl> <dbl>
#> 1 1 A 25 15.2
#> 2 2 A 20 15.2
#> 3 1 A 19 15.2
#> 4 3 B 12 10.5
#> 5 1 B 9 10.5
#> 6 2 A 6 15.2
#> 7 3 A 6 15.2
expenses |>
slice(2, .by = region)
#> # A tibble: 2 × 3
#> id region cost
#> <dbl> <chr> <dbl>
#> 1 2 A 20
#> 2 1 B 9
group_by()
won’t ever disappear, but we are having a lot of fun writing new code with .by
, and we think you will too.
Join improvements
All of the join functions in dplyr, such as
left_join()
, now accept a flexible join specification created through the new
join_by()
helper.
join_by()
allows you to specify your join conditions as expressions rather than as named character vectors.
join_by(x_id == y_id, region)
#> Join By:
#> - x_id == y_id
#> - region
This join specification matches x_id
in the left-hand data frame with y_id
in the right-hand one, and also matches between a commonly named region
column, computing the following equi-join:
df1 <- tibble(x_id = c(1, 2, 2), region = c("A", "B", "A"), x = c(5, 10, 4))
df2 <- tibble(y_id = c(2, 1, 2), region = c("A", "A", "C"), y = c(12, 8, 7))
df1
#> # A tibble: 3 × 3
#> x_id region x
#> <dbl> <chr> <dbl>
#> 1 1 A 5
#> 2 2 B 10
#> 3 2 A 4
df2
#> # A tibble: 3 × 3
#> y_id region y
#> <dbl> <chr> <dbl>
#> 1 2 A 12
#> 2 1 A 8
#> 3 2 C 7
df1 |>
left_join(df2, join_by(x_id == y_id, region))
#> # A tibble: 3 × 4
#> x_id region x y
#> <dbl> <chr> <dbl> <dbl>
#> 1 1 A 5 8
#> 2 2 B 10 NA
#> 3 2 A 4 12
Non-equi joins
Allowing expressions in
join_by()
opens up a whole new world of joins in dplyr known as non-equi joins. As the name somewhat implies, these are joins that involve binary conditions other than equality. There are 4 particularly useful types of non-equi joins:
Cross joins match every pair of rows and were already supported in dplyr.
Inequality joins match using
>
,>=
,<
, or<=
instead of==
.Rolling joins are based on inequality joins, but only find the closest match.
Overlap joins are also based on inequality joins, but are specialized for working with ranges.
Non-equi joins were requested back in 2016, and were the highest requested dplyr feature at the time they were finally implemented, with over 147 thumbs up! data.table has had support for non-equi joins for many years, and their implementation greatly inspired the one used in dplyr.
To demonstrate the different types of non-equi joins, imagine that you are in charge of the party planning committee for your office. Unfortunately, you only get to have one party per quarter, but it is your job to ensure that every employee is assigned to a single party. Upper management has provided the following 4 party dates:
parties <- tibble(
q = 1:4,
party = date_parse(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03"))
)
parties
#> # A tibble: 4 × 2
#> q party
#> <int> <date>
#> 1 1 2022-01-10
#> 2 2 2022-04-04
#> 3 3 2022-07-11
#> 4 4 2022-10-03
With this set of employees:
employees <- tibble(
name = wakefield::name(100),
birthday = date_parse("2022-01-01") + (sample(365, 100, replace = TRUE) - 1)
)
employees
#> # A tibble: 100 × 2
#> name birthday
#> <variable> <date>
#> 1 Seager 2022-08-26
#> 2 Nathion 2022-10-04
#> 3 Sametra 2022-06-13
#> 4 Netty 2022-05-12
#> 5 Yalissa 2022-05-28
#> 6 Mirai 2022-08-20
#> 7 Toyoko 2022-08-23
#> 8 Earlene 2022-04-21
#> 9 Abbegayle 2022-01-27
#> 10 Valyssa 2022-03-06
#> # … with 90 more rows
One way to start approaching this problem is to look for the party that happened directly before each birthday. You can do this with an inequality join:
employees |>
left_join(parties, join_by(birthday >= party))
#> # A tibble: 251 × 4
#> name birthday q party
#> <variable> <date> <int> <date>
#> 1 Seager 2022-08-26 1 2022-01-10
#> 2 Seager 2022-08-26 2 2022-04-04
#> 3 Seager 2022-08-26 3 2022-07-11
#> 4 Nathion 2022-10-04 1 2022-01-10
#> 5 Nathion 2022-10-04 2 2022-04-04
#> 6 Nathion 2022-10-04 3 2022-07-11
#> 7 Nathion 2022-10-04 4 2022-10-03
#> 8 Sametra 2022-06-13 1 2022-01-10
#> 9 Sametra 2022-06-13 2 2022-04-04
#> 10 Netty 2022-05-12 1 2022-01-10
#> # … with 241 more rows
This looks like a good start, but we’ve assigned people with birthdays later in the year to multiple parties. We can restrict this to only the party that is closest to the employee’s birthday by using a rolling join. Rolling joins are activated by wrapping an inequality in closest()
.
closest <- employees |>
left_join(parties, join_by(closest(birthday >= party)))
closest
#> # A tibble: 100 × 4
#> name birthday q party
#> <variable> <date> <int> <date>
#> 1 Seager 2022-08-26 3 2022-07-11
#> 2 Nathion 2022-10-04 4 2022-10-03
#> 3 Sametra 2022-06-13 2 2022-04-04
#> 4 Netty 2022-05-12 2 2022-04-04
#> 5 Yalissa 2022-05-28 2 2022-04-04
#> 6 Mirai 2022-08-20 3 2022-07-11
#> 7 Toyoko 2022-08-23 3 2022-07-11
#> 8 Earlene 2022-04-21 2 2022-04-04
#> 9 Abbegayle 2022-01-27 1 2022-01-10
#> 10 Valyssa 2022-03-06 1 2022-01-10
#> # … with 90 more rows
This is close to what we want, but isn’t quite right. It turns out that poor Della hasn’t been assigned to a party.
filter(closest, is.na(party))
#> # A tibble: 1 × 4
#> name birthday q party
#> <variable> <date> <int> <date>
#> 1 Della 2022-01-06 NA NA
This is because their birthday occurred before the first party date, 2022-01-10
, so there wasn’t any “previous party” to match them to. It’s a little easier to fix this if we are explicit about the quarter start/end dates that form the ranges to look for matches in:
# Some helpers from {clock}
quarter_start <- function(x) {
x <- as_year_quarter_day(x)
x <- calendar_start(x, "quarter")
as_date(x)
}
quarter_end <- function(x) {
x <- as_year_quarter_day(x)
x <- calendar_end(x, "quarter")
as.Date(x)
}
parties <- parties |>
mutate(start = quarter_start(party), end = quarter_end(party))
parties
#> # A tibble: 4 × 4
#> q party start end
#> <int> <date> <date> <date>
#> 1 1 2022-01-10 2022-01-01 2022-03-31
#> 2 2 2022-04-04 2022-04-01 2022-06-30
#> 3 3 2022-07-11 2022-07-01 2022-09-30
#> 4 4 2022-10-03 2022-10-01 2022-12-31
Now that we have 4 distinct ranges of dates to work with, we’ll use an overlap join to figure out which range each birthday fell
between()
. Since we know that each birthday should be matched to exactly one party, we’ll also take this chance to set multiple
, a new argument to the join functions that allows you to optionally "error"
if a birthday is matched to multiple parties.
employees |>
left_join(
parties,
join_by(between(birthday, start, end)),
multiple = "error"
)
#> # A tibble: 100 × 6
#> name birthday q party start end
#> <variable> <date> <int> <date> <date> <date>
#> 1 Seager 2022-08-26 3 2022-07-11 2022-07-01 2022-09-30
#> 2 Nathion 2022-10-04 4 2022-10-03 2022-10-01 2022-12-31
#> 3 Sametra 2022-06-13 2 2022-04-04 2022-04-01 2022-06-30
#> 4 Netty 2022-05-12 2 2022-04-04 2022-04-01 2022-06-30
#> 5 Yalissa 2022-05-28 2 2022-04-04 2022-04-01 2022-06-30
#> 6 Mirai 2022-08-20 3 2022-07-11 2022-07-01 2022-09-30
#> 7 Toyoko 2022-08-23 3 2022-07-11 2022-07-01 2022-09-30
#> 8 Earlene 2022-04-21 2 2022-04-04 2022-04-01 2022-06-30
#> 9 Abbegayle 2022-01-27 1 2022-01-10 2022-01-01 2022-03-31
#> 10 Valyssa 2022-03-06 1 2022-01-10 2022-01-01 2022-03-31
#> # … with 90 more rows
We consider multiple
to be an important “quality control” argument to help you enforce constraints on the join procedure.
Multiple matches
Speaking of multiple
, we’ve also given this argument an important default. When doing data analysis with equi-joins, it is often surprising when a join returns more rows than were present in the left-hand side table.
df1
#> # A tibble: 3 × 3
#> x_id region x
#> <dbl> <chr> <dbl>
#> 1 1 A 5
#> 2 2 B 10
#> 3 2 A 4
df2 <- tibble(y_id = c(1, 2, 1, 2), region = c("A", "B", "A", "A"), y = c(9, 10, 12, 4))
df2
#> # A tibble: 4 × 3
#> y_id region y
#> <dbl> <chr> <dbl>
#> 1 1 A 9
#> 2 2 B 10
#> 3 1 A 12
#> 4 2 A 4
df1 |>
left_join(df2, join_by(x_id == y_id, region))
#> Warning in left_join(df1, df2, join_by(x_id == y_id, region)): Each row in `x` is expected to match at most 1 row in `y`.
#> ℹ Row 1 of `x` matches multiple rows.
#> ℹ If multiple matches are expected, set `multiple = "all"` to silence this
#> warning.
#> # A tibble: 4 × 4
#> x_id region x y
#> <dbl> <chr> <dbl> <dbl>
#> 1 1 A 5 9
#> 2 1 A 5 12
#> 3 2 B 10 10
#> 4 2 A 4 4
In this case, row 1 of df1
matched both rows 1
and 3
of df2
, so the output has 4 rows rather than df1
's 3. While this is standard SQL behavior, community feedback has shown that many people don’t expect this, and a number of people were horrified to learn that this was even possible! Because of this, we’ve made this case a warning by default, which you can silence with multiple = "all"
.
arrange()
improvements with character vectors
arrange()
now uses a new custom backend for generating the ordering. This generally improves performance, but it is especially apparent with character vectors.
# 10,000 random strings, sampled up to 1,000,000 rows
dictionary <- stringi::stri_rand_strings(10000, length = 10, pattern = "[a-z]")
str <- tibble(x = sample(dictionary, size = 1e6, replace = TRUE))
str
#> # A tibble: 1,000,000 × 1
#> x
#> <chr>
#> 1 btjgpowbav
#> 2 jrddujrxwt
#> 3 ofgkybvsoo
#> 4 dzyxfvwktu
#> 5 qobgfmkgof
#> 6 rmzjvtnpbf
#> 7 jxrqgxouqg
#> 8 empcmhnlqq
#> 9 nwfgauiurp
#> 10 hdswclaxys
#> # … with 999,990 more rows
# dplyr 1.0.10
bench::mark(arrange(str, x), iterations = 100)
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 arrange(str, x) 4.38s 4.89s 0.204 12.7MB 0.148
# dplyr 1.1.0
bench::mark(arrange(str, x), iterations = 100)
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 arrange(str, x) 42.3ms 46.6ms 20.8 22.4MB 46.0
For those keeping score, that is a 100x improvement! Now, I’ll be honest, I’m being a bit tricky here. The new backend for
arrange()
comes with a meaningful change in behavior - it now sorts character strings in the C locale by default, rather than in the much slower system locale (American English, for me). We made this change for two main reasons:
Much faster performance by default
Improved reproducibility across R sessions, where different computers might use different system locales
For English users, we expect this change to have fairly minimal impact. The largest difference in ordering between the C and American English locales has to do with capitalization. In the C locale, uppercase letters are always placed before any lowercase letters. In the American English locale, uppercase letters are placed directly after their lowercase equivalent.
df <- tibble(x = c("a", "B", "A", "b"))
arrange(df, x)
#> # A tibble: 4 × 1
#> x
#> <chr>
#> 1 A
#> 2 B
#> 3 a
#> 4 b
If you do need to order with a specific locale, you can specify the new .locale
argument, which takes a locale identifier string, just like
stringr::str_sort()
.
arrange(df, x, .locale = "en")
#> # A tibble: 4 × 1
#> x
#> <chr>
#> 1 a
#> 2 A
#> 3 b
#> 4 B
To use this optional .locale
feature, you must have the stringi package installed, but you likely already do because it is installed with the tidyverse by default.
It is also worth noting that using .locale
is still much faster than relying on the system locale.
# Compare with ~5 seconds above with dplyr 1.0.10
bench::mark(arrange(str, x, .locale = "en"), iterations = 100)
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <bch:tm> <bch:> <dbl> <bch:byt>
#> 1 arrange(str, x, .locale = "en") 377ms 430ms 2.21 27.9MB
#> # … with 1 more variable: `gc/sec` <dbl>
For non-English Latin script languages, such as Spanish, you may see more of a change, as characters such as ñ
are ordered after z
rather than before n
in the C locale:
df <- tibble(x = c("\u00F1", "n", "z"))
df
#> # A tibble: 3 × 1
#> x
#> <chr>
#> 1 ñ
#> 2 n
#> 3 z
arrange(df, x)
#> # A tibble: 3 × 1
#> x
#> <chr>
#> 1 n
#> 2 z
#> 3 ñ
arrange(df, x, .locale = "es")
#> # A tibble: 3 × 1
#> x
#> <chr>
#> 1 n
#> 2 ñ
#> 3 z
We are optimistic that this change is an overall net positive. We anticipate that many users use
arrange()
to simply group similar looking observations together, and we expect that the main places you’ll need to care about localized ordering are the few places when you are generating human readable output, such as a table or a chart, at which point you might consider using .locale
.
If you are having trouble converting an existing script over to the new behavior, you can set the temporary global option options(dplyr.legacy_locale = TRUE)
, which will revert to the pre-1.1.0 behavior of using the system locale. We expect to remove this option in a future release.
To learn more low-level details about this change, you can read our tidyup.
reframe()
, a generalization of summarise()
In dplyr 1.0.0, we introduced a powerful new feature:
summarise()
could return per-group results of any length, rather than just length 1. For example:
table <- c("a", "b", "d", "f")
df <- tibble(
g = c(1, 1, 1, 2, 2, 2, 2),
x = c("e", "a", "b", "c", "f", "d", "a")
)
df |>
summarise(x = intersect(x, table), .by = g)
#> # A tibble: 5 × 2
#> g x
#> <dbl> <chr>
#> 1 1 a
#> 2 1 b
#> 3 2 f
#> 4 2 d
#> 5 2 a
While extremely powerful, community feedback has raised the valid concern that allowing
summarise()
to return any number of rows per group:
Increases the chance for accidental bugs
Is against the spirit of a “summary,” which implies 1 row per group
Makes translation to dbplyr very difficult
We agree! In response to this, we’ve decided to walk back that change to
summarise()
, which will now throw a warning when either 0 or >1 rows are returned per group:
df |>
summarise(x = intersect(x, table), .by = g)
#> Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
#> dplyr 1.1.0.
#> ℹ Please use `reframe()` instead.
#> ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
#> always returns an ungrouped data frame and adjust accordingly.
#> # A tibble: 5 × 2
#> g x
#> <dbl> <chr>
#> 1 1 a
#> 2 1 b
#> 3 2 f
#> 4 2 d
#> 5 2 a
That said, we still believe that this is a powerful tool, so we’ve moved these features to a new verb,
reframe()
. Think of
reframe()
as a generic tool for “doing something to each group,” with no restrictions on the number of rows returned per group.
df |>
reframe(x = intersect(x, table), .by = g)
#> # A tibble: 5 × 2
#> g x
#> <dbl> <chr>
#> 1 1 a
#> 2 1 b
#> 3 2 f
#> 4 2 d
#> 5 2 a
One big difference between
summarise()
and
reframe()
is that
reframe()
always returns an ungrouped data frame, even if the input was a grouped data frame with multiple group columns. This simplifies
reframe()
immensely, as it doesn’t need to inherit the .groups
argument of
summarise()
, and never emits any messages.
We expect that you’ll continue to use
summarise()
much more often than
reframe()
, but if you ever find yourself applying a function to each group that returns an arbitrary number of rows,
reframe()
should be your go-to tool!
reframe()
is one of the places we could use your feedback! We aren’t completely confident about this function name yet, so if you have any feedback about it or suggestions for an alternate one, please leave a comment on this
issue.
Thanks for visiting r-craft.org
This article is originally published at https://www.tidyverse.org/blog/
Please visit source website for post related comments.