dplyr 1.1.0: Joins
This article is originally published at https://www.tidyverse.org/blog/
dplyr 1.1.0 is out now! This is a giant release, so we’re splitting the release announcement up into four blog posts which we’ll post over the course of this week. Today, we’re focusing on joins, including the new
join_by()
syntax, new warnings for multiple matches, inequality joins, rolling joins, and new tools for handling unmatched rows. To learn more about joins, you might want to read the updated
joins chapter in the upcoming 2nd edition of
R for Data Science.
You can install it from CRAN with:
install.packages("dplyr")
join_by()
Consider the following two tables, transactions
and companies
. transactions
tracks sales across various years for different companies, and companies
connects the short company id to its actual company name - either Patagonia (a fellow B-Corp!) or RStudio.
transactions <- tibble(
company = c("A", "A", "B", "B"),
year = c(2019, 2020, 2021, 2023),
revenue = c(50, 4, 10, 12)
)
transactions
#> # A tibble: 4 × 3
#> company year revenue
#> <chr> <dbl> <dbl>
#> 1 A 2019 50
#> 2 A 2020 4
#> 3 B 2021 10
#> 4 B 2023 12
companies <- tibble(
id = c("A", "B"),
name = c("Patagonia", "RStudio")
)
companies
#> # A tibble: 2 × 2
#> id name
#> <chr> <chr>
#> 1 A Patagonia
#> 2 B RStudio
To join these two tables together, we might use an inner join:
transactions |>
inner_join(companies, by = c(company = "id"))
#> # A tibble: 4 × 4
#> company year revenue name
#> <chr> <dbl> <dbl> <chr>
#> 1 A 2019 50 Patagonia
#> 2 A 2020 4 Patagonia
#> 3 B 2021 10 RStudio
#> 4 B 2023 12 RStudio
This works great, but has always felt a little clunky. Specifying c(company = "id")
is a little awkward because it uses =
, not ==
: here we’re asserting that we want company
to equal id
, not naming a function argument or performing assignment. We’ve improved on this with a new helper,
join_by()
, which takes expressions in a way that allows you to more naturally express this join:
join_by(company == id)
#> Join By:
#> - company == id
This join specification can be used as the by
argument in any of the *_join()
functions:
transactions |>
inner_join(companies, by = join_by(company == id))
#> # A tibble: 4 × 4
#> company year revenue name
#> <chr> <dbl> <dbl> <chr>
#> 1 A 2019 50 Patagonia
#> 2 A 2020 4 Patagonia
#> 3 B 2021 10 RStudio
#> 4 B 2023 12 RStudio
This small quality of life improvement is just one of the many new features that come with
join_by()
. We’ll look at more of these next.
Multiple matches
To make things a little more interesting, we’ll add one more column to companies
, and one more row:
companies <- tibble(
id = c("A", "B", "B"),
since = c(1973, 2009, 2022),
name = c("Patagonia", "RStudio", "Posit")
)
companies
#> # A tibble: 3 × 3
#> id since name
#> <chr> <dbl> <chr>
#> 1 A 1973 Patagonia
#> 2 B 2009 RStudio
#> 3 B 2022 Posit
This table now also tracks name changes that have happened over the course of a company’s history. In 2022, we changed our name from RStudio to Posit, so we’ve tracked that as an additional row in our dataset. Note that both RStudio and Posit are given an id
of B
, which links back to the transactions
table.
If we were to join these two tables together, ideally we’d bring over the name that was in effect when the transaction took place. For example, for the transaction in 2021, the company was still RStudio, so ideally we’d only match up against the RStudio row in companies
. If we colored the expected matches, they’d look something like this:
How can we do this? We can try the same join from before, but we won’t like the results:
faulty <- transactions |>
inner_join(companies, by = join_by(company == id))
#> Warning in inner_join(transactions, companies, by = join_by(company == id)): Each row in `x` is expected to match at most 1 row in `y`.
#> ℹ Row 3 of `x` matches multiple rows.
#> ℹ If multiple matches are expected, set `multiple = "all"` to silence this
#> warning.
faulty
#> # A tibble: 6 × 5
#> company year revenue since name
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 A 2019 50 1973 Patagonia
#> 2 A 2020 4 1973 Patagonia
#> 3 B 2021 10 2009 RStudio
#> 4 B 2021 10 2022 Posit
#> 5 B 2023 12 2009 RStudio
#> 6 B 2023 12 2022 Posit
Company A
matches correctly, but since we only joined on the company id, we get multiple matches for each of company B
's transactions and end up with more rows than we started with. This is a problem, as we were expecting a 1:1 match for each row in transactions
. Multiple matches in equality joins like this one are typically unexpected (even though they are baked in to SQL) so we’ve also added a new warning to alert you when this happens. If multiple matches are expected, you can explicitly set multiple = "all"
to silence this warning. This also serves as a code “sign post” for future readers of your code to let them know that this is a join that is expected to increase the number of rows in the data. If multiple matches aren’t expected, you can also set multiple = "error"
to immediately halt the analysis. We expect this will be useful as a quality control check for production code where you might rerun analyses with new data on a rolling basis.
Inequality joins
To actually fix this issue, we’ll need to expand our join specification to include another condition. Let’s zoom in to just 2021:
filter(faulty, company == "B", year == 2021)
#> # A tibble: 2 × 5
#> company year revenue since name
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 B 2021 10 2009 RStudio
#> 2 B 2021 10 2022 Posit
We want to retain the match with RStudio, but not with Posit (because the name hasn’t changed yet). One way to express this is by using the year
and since
columns to state that you only want a match if the transaction year
occurred after a name change:
# `year[i] >= since`?
2021 >= 2009
#> [1] TRUE
2021 >= 2022
#> [1] FALSE
Because
join_by()
accepts expressions, we can express this inequality directly inside the join specification:
join_by(company == id, year >= since)
#> Join By:
#> - company == id
#> - year >= since
transactions |>
inner_join(companies, join_by(company == id, year >= since))
#> # A tibble: 5 × 5
#> company year revenue since name
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 A 2019 50 1973 Patagonia
#> 2 A 2020 4 1973 Patagonia
#> 3 B 2021 10 2009 RStudio
#> 4 B 2023 12 2009 RStudio
#> 5 B 2023 12 2022 Posit
This eliminated the 2021 match to Posit, as expected! This type of join is known as an inequality join, i.e. it involves at least one join expression containing one of the following inequality conditions: >=
, >
, <=
, or <
.
However, we still have 2 matches corresponding to the 2023 year. In this case, we only wanted the match to Posit. We can understand why we are still getting multiple matches here by running the same row-by-row analysis as before:
# `year[i] >= since`? Both are true!
2023 >= 2009
#> [1] TRUE
2023 >= 2022
#> [1] TRUE
To remove the last problematic match of the 2023 transaction to the RStudio name, we’ll need to refine our join specification one more time.
Rolling joins
Inequality conditions like year >= since
are powerful, but since the condition is only bounded on one side it is common for them to return a large number of matches. Since multiple matches are the typical case with inequality joins, we don’t get a warning like with the equality join, but we clearly still haven’t gotten the join right. As a reminder, here are where we still have too many matches:
transactions |>
inner_join(companies, join_by(company == id, year >= since)) |>
filter(company == "B", year == 2023)
#> # A tibble: 2 × 5
#> company year revenue since name
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 B 2023 12 2009 RStudio
#> 2 B 2023 12 2022 Posit
We need a way to filter down the matches returned from year >= since
to only the most recent name change. In other words, we prefer the Posit match over the RStudio match because 2022 is closer to the transaction year of 2023 than 2009 is. We can express this in
join_by()
by using a helper named closest()
.
transactions |>
inner_join(companies, join_by(company == id, closest(year >= since)))
#> # A tibble: 4 × 5
#> company year revenue since name
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 A 2019 50 1973 Patagonia
#> 2 A 2020 4 1973 Patagonia
#> 3 B 2021 10 2009 RStudio
#> 4 B 2023 12 2022 Posit
closest(year >= since)
finds all of the matches in since
for a particular year
, and then filters them down to only the closest match to that year
. This is known as a rolling join, because in this case it rolls the most recent name change forward to match up with the transaction. Rolling joins were popularized by data.table, and are related to ASOF
joins supported by some SQL flavors.
There is a third new class of joins supported by
join_by()
that we won’t discuss today known as overlap joins. These are particularly useful in time series where you are looking for cases where a date or range of dates from one table overlaps a range of dates in another table. There are three helpers for overlap joins:
between()
,
overlaps()
, and
within()
, which you can read more about
in the documentation.
Unmatched rows
I mentioned earlier that we expected a 1:1 match between transactions
and companies
. We saw that multiple
can help protect us from having too many matches, but what about not having enough? Consider what happens if we add a new company to transactions
without a corresponding match in companies
.
transactions <- transactions |>
tibble::add_row(company = "C", year = 2023, revenue = 15)
transactions
#> # A tibble: 5 × 3
#> company year revenue
#> <chr> <dbl> <dbl>
#> 1 A 2019 50
#> 2 A 2020 4
#> 3 B 2021 10
#> 4 B 2023 12
#> 5 C 2023 15
transactions |>
inner_join(
companies,
join_by(company == id, closest(year >= since))
)
#> # A tibble: 4 × 5
#> company year revenue since name
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 A 2019 50 1973 Patagonia
#> 2 A 2020 4 1973 Patagonia
#> 3 B 2021 10 2009 RStudio
#> 4 B 2023 12 2022 Posit
We’ve accidentally lost the C
row! If you don’t expect any unmatched rows, you can now catch this problem automatically by using our other new quality control argument, unmatched
:
transactions |>
inner_join(
companies,
join_by(company == id, closest(year >= since)),
unmatched = "error"
)
#> Error in `inner_join()`:
#> ! Each row of `x` must have a match in `y`.
#> ℹ Row 5 of `x` does not have a match.
If you’ve been questioning why I’ve been using an
inner_join()
over a
left_join()
this whole time, unmatched
is why. We could use a
left_join()
:
transactions |>
left_join(
companies,
join_by(company == id, closest(year >= since)),
unmatched = "error"
)
#> # A tibble: 5 × 5
#> company year revenue since name
#> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 A 2019 50 1973 Patagonia
#> 2 A 2020 4 1973 Patagonia
#> 3 B 2021 10 2009 RStudio
#> 4 B 2023 12 2022 Posit
#> 5 C 2023 15 NA NA
But you’ll notice that we don’t get an error here. unmatched
will only error if the input that has the potential to drop rows has an unmatched row. The reason you’d use a
left_join()
is to ensure that rows from x
are always retained, so it wouldn’t make sense to error when rows from x
are also unmatched. If y
had unmatched rows instead, then it would have errored because those rows would otherwise be lost from the join. In an
inner_join()
, both inputs can potentially drop rows, so unmatched = "error"
checks for unmatched rows in both inputs.
Thanks for visiting r-craft.org
This article is originally published at https://www.tidyverse.org/blog/
Please visit source website for post related comments.