Pivoting data frames just got easier thanks to `pivot_wide()` and `pivot_long()`
This article is originally published at https://www.brodrigues.co/
There’s a lot going on in the development version of {tidyr}
. New functions for pivoting data
frames, pivot_wide()
and pivot_long()
are coming, and will replace the current functions,
spread()
and gather()
. spread()
and gather()
will remain in the package though:
You may have heard a rumour that gather/spread are going away. This is simply not true (they’ll stay around forever) but I am working on better replacements which you can learn about at https://t.co/sU2GzWeBaf. Now is a great time for feedback! #rstats
— Hadley Wickham (@hadleywickham) March 19, 2019
If you want to try out these new functions, you need to install the development version of {tidyr}
:
devtools::install_github("tidyverse/tidyr")
and you can read the vignette here. Because these functions are still being developed, some more changes might be introduced, but I guess that the main functionality will not change much.
Let’s play around with these functions and the mtcars
data set. First let’s load the packages and
the data:
library(tidyverse)
data(mtcars)
First, let’s create a wide dataset, by spreading the levels of the “am” column to two new columns:
mtcars_wide1 <- mtcars %>%
pivot_wide(names_from = "am", values_from = "mpg")
mtcars_wide1 %>%
select(`0`, `1`, everything())
## # A tibble: 32 x 11
## `0` `1` cyl disp hp drat wt qsec vs gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NA 21 6 160 110 3.9 2.62 16.5 0 4 4
## 2 NA 21 6 160 110 3.9 2.88 17.0 0 4 4
## 3 NA 22.8 4 108 93 3.85 2.32 18.6 1 4 1
## 4 21.4 NA 6 258 110 3.08 3.22 19.4 1 3 1
## 5 18.7 NA 8 360 175 3.15 3.44 17.0 0 3 2
## 6 18.1 NA 6 225 105 2.76 3.46 20.2 1 3 1
## 7 14.3 NA 8 360 245 3.21 3.57 15.8 0 3 4
## 8 24.4 NA 4 147. 62 3.69 3.19 20 1 4 2
## 9 22.8 NA 4 141. 95 3.92 3.15 22.9 1 4 2
## 10 19.2 NA 6 168. 123 3.92 3.44 18.3 1 4 4
## # … with 22 more rows
pivot_wide()
’s arguments are quite explicit: names_from =
is where you specify the column that
will be spread across the data frame, meaning, the levels of this column will become new columns.
values_from =
is where you specify the column that will fill in the values of the new columns.
“0” and “1” are the new columns (“am” had two levels, 0
and 1
), which contain the miles per
gallon for manual and automatic cars respectively. Let’s also take a look at the data frame itself:
mtcars_wide1 %>%
select(`0`, `1`, everything())
## # A tibble: 32 x 11
## `0` `1` cyl disp hp drat wt qsec vs gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NA 21 6 160 110 3.9 2.62 16.5 0 4 4
## 2 NA 21 6 160 110 3.9 2.88 17.0 0 4 4
## 3 NA 22.8 4 108 93 3.85 2.32 18.6 1 4 1
## 4 21.4 NA 6 258 110 3.08 3.22 19.4 1 3 1
## 5 18.7 NA 8 360 175 3.15 3.44 17.0 0 3 2
## 6 18.1 NA 6 225 105 2.76 3.46 20.2 1 3 1
## 7 14.3 NA 8 360 245 3.21 3.57 15.8 0 3 4
## 8 24.4 NA 4 147. 62 3.69 3.19 20 1 4 2
## 9 22.8 NA 4 141. 95 3.92 3.15 22.9 1 4 2
## 10 19.2 NA 6 168. 123 3.92 3.44 18.3 1 4 4
## # … with 22 more rows
Now suppose that we want to spread the values of “am” times “cyl”, and filling the data with the values of “mpg”:
mtcars_wide2 <- mtcars %>%
pivot_wide(names_from = c("am", "cyl"), values_from = "mpg")
mtcars_wide2 %>%
select(matches("^0|1"), everything())
## # A tibble: 32 x 14
## `1_6` `1_4` `0_6` `0_8` `0_4` `1_8` disp hp drat wt qsec vs
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 NA NA NA NA NA 160 110 3.9 2.62 16.5 0
## 2 21 NA NA NA NA NA 160 110 3.9 2.88 17.0 0
## 3 NA 22.8 NA NA NA NA 108 93 3.85 2.32 18.6 1
## 4 NA NA 21.4 NA NA NA 258 110 3.08 3.22 19.4 1
## 5 NA NA NA 18.7 NA NA 360 175 3.15 3.44 17.0 0
## 6 NA NA 18.1 NA NA NA 225 105 2.76 3.46 20.2 1
## 7 NA NA NA 14.3 NA NA 360 245 3.21 3.57 15.8 0
## 8 NA NA NA NA 24.4 NA 147. 62 3.69 3.19 20 1
## 9 NA NA NA NA 22.8 NA 141. 95 3.92 3.15 22.9 1
## 10 NA NA 19.2 NA NA NA 168. 123 3.92 3.44 18.3 1
## # … with 22 more rows, and 2 more variables: gear <dbl>, carb <dbl>
As you can see, this is easily achieved by simply providing more columns to names_from =
.
Finally, it is also possible to use an optional data set which contains the specifications of the new columns:
mtcars_spec <- mtcars %>%
expand(am, cyl, .value = "mpg") %>%
unite(".name", am, cyl, remove = FALSE)
mtcars_spec
## # A tibble: 6 x 4
## .name am cyl .value
## <chr> <dbl> <dbl> <chr>
## 1 0_4 0 4 mpg
## 2 0_6 0 6 mpg
## 3 0_8 0 8 mpg
## 4 1_4 1 4 mpg
## 5 1_6 1 6 mpg
## 6 1_8 1 8 mpg
This optional data set defines how the columns “0_4”, “0_6” etc are constructed, and also the value that shall be used to fill in the values. “am” and “cyl” will be used to create the “.name” and the “mpg” column will be used for the “.value”:
mtcars %>%
pivot_wide(spec = mtcars_spec) %>%
select(matches("^0|1"), everything())
## # A tibble: 32 x 14
## `0_4` `0_6` `0_8` `1_4` `1_6` `1_8` disp hp drat wt qsec vs
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NA NA NA NA 21 NA 160 110 3.9 2.62 16.5 0
## 2 NA NA NA NA 21 NA 160 110 3.9 2.88 17.0 0
## 3 NA NA NA 22.8 NA NA 108 93 3.85 2.32 18.6 1
## 4 NA 21.4 NA NA NA NA 258 110 3.08 3.22 19.4 1
## 5 NA NA 18.7 NA NA NA 360 175 3.15 3.44 17.0 0
## 6 NA 18.1 NA NA NA NA 225 105 2.76 3.46 20.2 1
## 7 NA NA 14.3 NA NA NA 360 245 3.21 3.57 15.8 0
## 8 24.4 NA NA NA NA NA 147. 62 3.69 3.19 20 1
## 9 22.8 NA NA NA NA NA 141. 95 3.92 3.15 22.9 1
## 10 NA 19.2 NA NA NA NA 168. 123 3.92 3.44 18.3 1
## # … with 22 more rows, and 2 more variables: gear <dbl>, carb <dbl>
Using a spec is especially useful if you need to make new levels that are not in the data. For instance, suppose that there are actually 10-cylinder cars too, but they do not appear in our sample. We would like to make the fact that they’re missing explicit:
mtcars_spec2 <- mtcars %>%
expand(am, "cyl" = c(cyl, 10), .value = "mpg") %>%
unite(".name", am, cyl, remove = FALSE)
mtcars_spec2
## # A tibble: 8 x 4
## .name am cyl .value
## <chr> <dbl> <dbl> <chr>
## 1 0_4 0 4 mpg
## 2 0_6 0 6 mpg
## 3 0_8 0 8 mpg
## 4 0_10 0 10 mpg
## 5 1_4 1 4 mpg
## 6 1_6 1 6 mpg
## 7 1_8 1 8 mpg
## 8 1_10 1 10 mpg
mtcars %>%
pivot_wide(spec = mtcars_spec2) %>%
select(matches("^0|1"), everything())
## # A tibble: 32 x 16
## `0_4` `0_6` `0_8` `0_10` `1_4` `1_6` `1_8` `1_10` disp hp drat
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NA NA NA NA NA 21 NA NA 160 110 3.9
## 2 NA NA NA NA NA 21 NA NA 160 110 3.9
## 3 NA NA NA NA 22.8 NA NA NA 108 93 3.85
## 4 NA 21.4 NA NA NA NA NA NA 258 110 3.08
## 5 NA NA 18.7 NA NA NA NA NA 360 175 3.15
## 6 NA 18.1 NA NA NA NA NA NA 225 105 2.76
## 7 NA NA 14.3 NA NA NA NA NA 360 245 3.21
## 8 24.4 NA NA NA NA NA NA NA 147. 62 3.69
## 9 22.8 NA NA NA NA NA NA NA 141. 95 3.92
## 10 NA 19.2 NA NA NA NA NA NA 168. 123 3.92
## # … with 22 more rows, and 5 more variables: wt <dbl>, qsec <dbl>,
## # vs <dbl>, gear <dbl>, carb <dbl>
As you can see, we now have two more columns have been added, and they are full of NA’s.
Now, let’s try to go from wide to long data sets, using pivot_long()
:
mtcars_wide1 %>%
pivot_long(cols = c(`1`, `0`), names_to = "am", values_to = "mpg") %>%
select(am, mpg, everything())
## # A tibble: 64 x 11
## am mpg cyl disp hp drat wt qsec vs gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 21 6 160 110 3.9 2.62 16.5 0 4 4
## 2 0 NA 6 160 110 3.9 2.62 16.5 0 4 4
## 3 1 21 6 160 110 3.9 2.88 17.0 0 4 4
## 4 0 NA 6 160 110 3.9 2.88 17.0 0 4 4
## 5 1 22.8 4 108 93 3.85 2.32 18.6 1 4 1
## 6 0 NA 4 108 93 3.85 2.32 18.6 1 4 1
## 7 1 NA 6 258 110 3.08 3.22 19.4 1 3 1
## 8 0 21.4 6 258 110 3.08 3.22 19.4 1 3 1
## 9 1 NA 8 360 175 3.15 3.44 17.0 0 3 2
## 10 0 18.7 8 360 175 3.15 3.44 17.0 0 3 2
## # … with 54 more rows
The arguments of pivot_long()
are quite explicit too, and similar to the ones in pivot_wide()
.
cols =
is where the user specifies the columns that need to be pivoted. names_to =
is where
the user can specify the name of the new columns, whose levels will be exactly the ones specified
to cols =
. values_to =
is where the user specifies the column name of the new column that
will contain the values.
It is also possible to specify the columns that should not be transformed, by using -
:
mtcars_wide1 %>%
pivot_long(cols = -matches("^[[:alpha:]]"), names_to = "am", values_to = "mpg") %>%
select(am, mpg, everything())
## # A tibble: 64 x 11
## am mpg cyl disp hp drat wt qsec vs gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 21 6 160 110 3.9 2.62 16.5 0 4 4
## 2 0 NA 6 160 110 3.9 2.62 16.5 0 4 4
## 3 1 21 6 160 110 3.9 2.88 17.0 0 4 4
## 4 0 NA 6 160 110 3.9 2.88 17.0 0 4 4
## 5 1 22.8 4 108 93 3.85 2.32 18.6 1 4 1
## 6 0 NA 4 108 93 3.85 2.32 18.6 1 4 1
## 7 1 NA 6 258 110 3.08 3.22 19.4 1 3 1
## 8 0 21.4 6 258 110 3.08 3.22 19.4 1 3 1
## 9 1 NA 8 360 175 3.15 3.44 17.0 0 3 2
## 10 0 18.7 8 360 175 3.15 3.44 17.0 0 3 2
## # … with 54 more rows
Here the columns that should not be modified are all those that start with a letter, hence the “1”
regular expression. It is also possible to remove all the NA
’s from the data frame, with na.rm =
.
mtcars_wide1 %>%
pivot_long(cols = c(`1`, `0`), names_to = "am", values_to = "mpg", na.rm = TRUE) %>%
select(am, mpg, everything())
## # A tibble: 32 x 11
## am mpg cyl disp hp drat wt qsec vs gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 21 6 160 110 3.9 2.62 16.5 0 4 4
## 2 1 21 6 160 110 3.9 2.88 17.0 0 4 4
## 3 1 22.8 4 108 93 3.85 2.32 18.6 1 4 1
## 4 0 21.4 6 258 110 3.08 3.22 19.4 1 3 1
## 5 0 18.7 8 360 175 3.15 3.44 17.0 0 3 2
## 6 0 18.1 6 225 105 2.76 3.46 20.2 1 3 1
## 7 0 14.3 8 360 245 3.21 3.57 15.8 0 3 4
## 8 0 24.4 4 147. 62 3.69 3.19 20 1 4 2
## 9 0 22.8 4 141. 95 3.92 3.15 22.9 1 4 2
## 10 0 19.2 6 168. 123 3.92 3.44 18.3 1 4 4
## # … with 22 more rows
We can also pivot data frames where the names of the columns are made of two or more variables,
for example in our mtcars_wide2
data frame:
mtcars_wide2 %>%
select(matches("^0|1"), everything())
## # A tibble: 32 x 14
## `1_6` `1_4` `0_6` `0_8` `0_4` `1_8` disp hp drat wt qsec vs
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 NA NA NA NA NA 160 110 3.9 2.62 16.5 0
## 2 21 NA NA NA NA NA 160 110 3.9 2.88 17.0 0
## 3 NA 22.8 NA NA NA NA 108 93 3.85 2.32 18.6 1
## 4 NA NA 21.4 NA NA NA 258 110 3.08 3.22 19.4 1
## 5 NA NA NA 18.7 NA NA 360 175 3.15 3.44 17.0 0
## 6 NA NA 18.1 NA NA NA 225 105 2.76 3.46 20.2 1
## 7 NA NA NA 14.3 NA NA 360 245 3.21 3.57 15.8 0
## 8 NA NA NA NA 24.4 NA 147. 62 3.69 3.19 20 1
## 9 NA NA NA NA 22.8 NA 141. 95 3.92 3.15 22.9 1
## 10 NA NA 19.2 NA NA NA 168. 123 3.92 3.44 18.3 1
## # … with 22 more rows, and 2 more variables: gear <dbl>, carb <dbl>
All the columns that start with either “0” or “1” must be pivoted:
mtcars_wide2 %>%
pivot_long(cols = matches("0|1"), names_to = "am_cyl", values_to = "mpg", na.rm = TRUE) %>%
select(am_cyl, everything())
## # A tibble: 32 x 10
## am_cyl disp hp drat wt qsec vs gear carb mpg
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1_6 160 110 3.9 2.62 16.5 0 4 4 21
## 2 1_6 160 110 3.9 2.88 17.0 0 4 4 21
## 3 1_4 108 93 3.85 2.32 18.6 1 4 1 22.8
## 4 0_6 258 110 3.08 3.22 19.4 1 3 1 21.4
## 5 0_8 360 175 3.15 3.44 17.0 0 3 2 18.7
## 6 0_6 225 105 2.76 3.46 20.2 1 3 1 18.1
## 7 0_8 360 245 3.21 3.57 15.8 0 3 4 14.3
## 8 0_4 147. 62 3.69 3.19 20 1 4 2 24.4
## 9 0_4 141. 95 3.92 3.15 22.9 1 4 2 22.8
## 10 0_6 168. 123 3.92 3.44 18.3 1 4 4 19.2
## # … with 22 more rows
Now, there is one new column, “am_cyl” which must still be transformed by separating “am_cyl” into two new columns:
mtcars_wide2 %>%
pivot_long(cols = matches("0|1"), names_to = "am_cyl", values_to = "mpg", na.rm = TRUE) %>%
separate(am_cyl, into = c("am", "cyl"), sep = "_") %>%
select(am, cyl, everything())
## # A tibble: 32 x 11
## am cyl disp hp drat wt qsec vs gear carb mpg
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 6 160 110 3.9 2.62 16.5 0 4 4 21
## 2 1 6 160 110 3.9 2.88 17.0 0 4 4 21
## 3 1 4 108 93 3.85 2.32 18.6 1 4 1 22.8
## 4 0 6 258 110 3.08 3.22 19.4 1 3 1 21.4
## 5 0 8 360 175 3.15 3.44 17.0 0 3 2 18.7
## 6 0 6 225 105 2.76 3.46 20.2 1 3 1 18.1
## 7 0 8 360 245 3.21 3.57 15.8 0 3 4 14.3
## 8 0 4 147. 62 3.69 3.19 20 1 4 2 24.4
## 9 0 4 141. 95 3.92 3.15 22.9 1 4 2 22.8
## 10 0 6 168. 123 3.92 3.44 18.3 1 4 4 19.2
## # … with 22 more rows
It is also possible to achieve this using a data frame with the specification of what you need:
mtcars_spec_long <- mtcars_wide2 %>%
pivot_long_spec(matches("0|1"), values_to = "mpg") %>%
separate(name, c("am", "cyl"), sep = "_")
mtcars_spec_long
## # A tibble: 6 x 4
## .name .value am cyl
## <chr> <chr> <chr> <chr>
## 1 1_6 mpg 1 6
## 2 1_4 mpg 1 4
## 3 0_6 mpg 0 6
## 4 0_8 mpg 0 8
## 5 0_4 mpg 0 4
## 6 1_8 mpg 1 8
Providing this spec to pivot_long()
solves the issue:
mtcars_wide2 %>%
pivot_long(spec = mtcars_spec_long, na.rm = TRUE) %>%
select(am, cyl, everything())
## # A tibble: 32 x 11
## am cyl disp hp drat wt qsec vs gear carb mpg
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 6 160 110 3.9 2.62 16.5 0 4 4 21
## 2 1 6 160 110 3.9 2.88 17.0 0 4 4 21
## 3 1 4 108 93 3.85 2.32 18.6 1 4 1 22.8
## 4 0 6 258 110 3.08 3.22 19.4 1 3 1 21.4
## 5 0 8 360 175 3.15 3.44 17.0 0 3 2 18.7
## 6 0 6 225 105 2.76 3.46 20.2 1 3 1 18.1
## 7 0 8 360 245 3.21 3.57 15.8 0 3 4 14.3
## 8 0 4 147. 62 3.69 3.19 20 1 4 2 24.4
## 9 0 4 141. 95 3.92 3.15 22.9 1 4 2 22.8
## 10 0 6 168. 123 3.92 3.44 18.3 1 4 4 19.2
## # … with 22 more rows
Stay tuned to Hadley Wickham’s twitter as there will definitely be announcements soon!
Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me.
[:alpha:]↩
Thanks for visiting r-craft.org
This article is originally published at https://www.brodrigues.co/
Please visit source website for post related comments.