purrr Like a Kitten till the Lake Pipes RoaR
This article is originally published at https://www.finex.co
I really should make a minimal effort to resist opening a data analysis blog post with Beach Boys’ lyrics, but this time the combination is too apt. We use the purrr package to show how to let your pipes roar in R.
The tidyverse GitHub site contains a simple example illustrating how well pipes and purrr work together. For more learning, try Jenny Bryan’s purrr tutorial.
First, load the tidyverse and the purrr package.
library(tidyverse)
library(purrr)
#devtools::install_github("jennybc/repurrrsive")
#library(repurrrsive)
If you want to be more adventurous, you can download Jenny’s repurrrsive package from GitHub. The code is hashed out in the chunk above.
You Don’t Know What I Got
The great thing about using pipes is you can tie together a series of steps to produce a single output of exactly what you want. In this case, we are going to start with the base R dataset airquality and apply a number of functions to come up with the adjusted R-squared value for the Ozone data for each month between May and September. Along the way we will run a linear regression on the data to generate the adjusted R-squared values.
Here is what the full process looks like, from beginning to end.
airquality %>%
split(.$Month) %>%
map(~ lm(Ozone ~ Temp, data = .)) %>%
map(summary) %>%
map_dbl('adj.r.squared')
5 6 7 8 9
0.2781 0.3676 0.5024 0.3307 0.6742
The problem, of course, is the output generated by the intermediate steps stays behind the scenes. For a beginner, this can be a bit confusing because it isn’t clear what is going on. So let’s break the full chunk into its constituent pieces so we can see what purrr is doing and how pipes tie the whole thing together.
Start at the beginning and take things step-by-step.
Run the airquality data set to see what we are dealing with. We can see the data is collected daily for five months. Ozone looks to be the target, so it is natural to wonder if there is any relationship between it and the other variables.
airquality
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
11 7 NA 6.9 74 5 11
12 16 256 9.7 69 5 12
13 11 290 9.2 66 5 13
14 14 274 10.9 68 5 14
15 18 65 13.2 58 5 15
16 14 334 11.5 64 5 16
17 34 307 12.0 66 5 17
18 6 78 18.4 57 5 18
19 30 322 11.5 68 5 19
20 11 44 9.7 62 5 20
21 1 8 9.7 59 5 21
22 11 320 16.6 73 5 22
23 4 25 9.7 61 5 23
24 32 92 12.0 61 5 24
25 NA 66 16.6 57 5 25
26 NA 266 14.9 58 5 26
27 NA NA 8.0 57 5 27
28 23 13 12.0 67 5 28
29 45 252 14.9 81 5 29
30 115 223 5.7 79 5 30
31 37 279 7.4 76 5 31
32 NA 286 8.6 78 6 1
33 NA 287 9.7 74 6 2
34 NA 242 16.1 67 6 3
35 NA 186 9.2 84 6 4
36 NA 220 8.6 85 6 5
37 NA 264 14.3 79 6 6
38 29 127 9.7 82 6 7
39 NA 273 6.9 87 6 8
40 71 291 13.8 90 6 9
41 39 323 11.5 87 6 10
42 NA 259 10.9 93 6 11
43 NA 250 9.2 92 6 12
44 23 148 8.0 82 6 13
45 NA 332 13.8 80 6 14
46 NA 322 11.5 79 6 15
47 21 191 14.9 77 6 16
48 37 284 20.7 72 6 17
49 20 37 9.2 65 6 18
50 12 120 11.5 73 6 19
51 13 137 10.3 76 6 20
52 NA 150 6.3 77 6 21
53 NA 59 1.7 76 6 22
54 NA 91 4.6 76 6 23
55 NA 250 6.3 76 6 24
56 NA 135 8.0 75 6 25
57 NA 127 8.0 78 6 26
58 NA 47 10.3 73 6 27
59 NA 98 11.5 80 6 28
60 NA 31 14.9 77 6 29
61 NA 138 8.0 83 6 30
62 135 269 4.1 84 7 1
63 49 248 9.2 85 7 2
64 32 236 9.2 81 7 3
65 NA 101 10.9 84 7 4
66 64 175 4.6 83 7 5
67 40 314 10.9 83 7 6
68 77 276 5.1 88 7 7
69 97 267 6.3 92 7 8
70 97 272 5.7 92 7 9
71 85 175 7.4 89 7 10
72 NA 139 8.6 82 7 11
73 10 264 14.3 73 7 12
74 27 175 14.9 81 7 13
75 NA 291 14.9 91 7 14
76 7 48 14.3 80 7 15
77 48 260 6.9 81 7 16
78 35 274 10.3 82 7 17
79 61 285 6.3 84 7 18
80 79 187 5.1 87 7 19
81 63 220 11.5 85 7 20
82 16 7 6.9 74 7 21
83 NA 258 9.7 81 7 22
84 NA 295 11.5 82 7 23
85 80 294 8.6 86 7 24
86 108 223 8.0 85 7 25
87 20 81 8.6 82 7 26
88 52 82 12.0 86 7 27
89 82 213 7.4 88 7 28
90 50 275 7.4 86 7 29
91 64 253 7.4 83 7 30
92 59 254 9.2 81 7 31
93 39 83 6.9 81 8 1
94 9 24 13.8 81 8 2
95 16 77 7.4 82 8 3
96 78 NA 6.9 86 8 4
97 35 NA 7.4 85 8 5
98 66 NA 4.6 87 8 6
99 122 255 4.0 89 8 7
100 89 229 10.3 90 8 8
101 110 207 8.0 90 8 9
102 NA 222 8.6 92 8 10
103 NA 137 11.5 86 8 11
104 44 192 11.5 86 8 12
105 28 273 11.5 82 8 13
106 65 157 9.7 80 8 14
107 NA 64 11.5 79 8 15
108 22 71 10.3 77 8 16
109 59 51 6.3 79 8 17
110 23 115 7.4 76 8 18
111 31 244 10.9 78 8 19
112 44 190 10.3 78 8 20
113 21 259 15.5 77 8 21
114 9 36 14.3 72 8 22
115 NA 255 12.6 75 8 23
116 45 212 9.7 79 8 24
117 168 238 3.4 81 8 25
118 73 215 8.0 86 8 26
119 NA 153 5.7 88 8 27
120 76 203 9.7 97 8 28
121 118 225 2.3 94 8 29
122 84 237 6.3 96 8 30
123 85 188 6.3 94 8 31
124 96 167 6.9 91 9 1
125 78 197 5.1 92 9 2
126 73 183 2.8 93 9 3
127 91 189 4.6 93 9 4
128 47 95 7.4 87 9 5
129 32 92 15.5 84 9 6
130 20 252 10.9 80 9 7
131 23 220 10.3 78 9 8
132 21 230 10.9 75 9 9
133 24 259 9.7 73 9 10
134 44 236 14.9 81 9 11
135 21 259 15.5 76 9 12
136 28 238 6.3 77 9 13
137 9 24 10.9 71 9 14
138 13 112 11.5 71 9 15
139 46 237 6.9 78 9 16
140 18 224 13.8 67 9 17
141 13 27 10.3 76 9 18
142 24 238 10.3 68 9 19
143 16 201 8.0 82 9 20
144 13 238 12.6 64 9 21
145 23 14 9.2 71 9 22
146 36 139 10.3 81 9 23
147 7 49 10.3 69 9 24
148 14 20 16.6 63 9 25
149 30 193 6.9 70 9 26
150 NA 145 13.2 77 9 27
151 14 191 14.3 75 9 28
152 18 131 8.0 76 9 29
153 20 223 11.5 68 9 30
The first step is to break the data up by month, so we make use of base R‘s split()
function. Notice all the data is grouped by month.
airquality %>%
split(.$Month)
$`5`
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
11 7 NA 6.9 74 5 11
12 16 256 9.7 69 5 12
13 11 290 9.2 66 5 13
14 14 274 10.9 68 5 14
15 18 65 13.2 58 5 15
16 14 334 11.5 64 5 16
17 34 307 12.0 66 5 17
18 6 78 18.4 57 5 18
19 30 322 11.5 68 5 19
20 11 44 9.7 62 5 20
21 1 8 9.7 59 5 21
22 11 320 16.6 73 5 22
23 4 25 9.7 61 5 23
24 32 92 12.0 61 5 24
25 NA 66 16.6 57 5 25
26 NA 266 14.9 58 5 26
27 NA NA 8.0 57 5 27
28 23 13 12.0 67 5 28
29 45 252 14.9 81 5 29
30 115 223 5.7 79 5 30
31 37 279 7.4 76 5 31
$`6`
Ozone Solar.R Wind Temp Month Day
32 NA 286 8.6 78 6 1
33 NA 287 9.7 74 6 2
34 NA 242 16.1 67 6 3
35 NA 186 9.2 84 6 4
36 NA 220 8.6 85 6 5
37 NA 264 14.3 79 6 6
38 29 127 9.7 82 6 7
39 NA 273 6.9 87 6 8
40 71 291 13.8 90 6 9
41 39 323 11.5 87 6 10
42 NA 259 10.9 93 6 11
43 NA 250 9.2 92 6 12
44 23 148 8.0 82 6 13
45 NA 332 13.8 80 6 14
46 NA 322 11.5 79 6 15
47 21 191 14.9 77 6 16
48 37 284 20.7 72 6 17
49 20 37 9.2 65 6 18
50 12 120 11.5 73 6 19
51 13 137 10.3 76 6 20
52 NA 150 6.3 77 6 21
53 NA 59 1.7 76 6 22
54 NA 91 4.6 76 6 23
55 NA 250 6.3 76 6 24
56 NA 135 8.0 75 6 25
57 NA 127 8.0 78 6 26
58 NA 47 10.3 73 6 27
59 NA 98 11.5 80 6 28
60 NA 31 14.9 77 6 29
61 NA 138 8.0 83 6 30
$`7`
Ozone Solar.R Wind Temp Month Day
62 135 269 4.1 84 7 1
63 49 248 9.2 85 7 2
64 32 236 9.2 81 7 3
65 NA 101 10.9 84 7 4
66 64 175 4.6 83 7 5
67 40 314 10.9 83 7 6
68 77 276 5.1 88 7 7
69 97 267 6.3 92 7 8
70 97 272 5.7 92 7 9
71 85 175 7.4 89 7 10
72 NA 139 8.6 82 7 11
73 10 264 14.3 73 7 12
74 27 175 14.9 81 7 13
75 NA 291 14.9 91 7 14
76 7 48 14.3 80 7 15
77 48 260 6.9 81 7 16
78 35 274 10.3 82 7 17
79 61 285 6.3 84 7 18
80 79 187 5.1 87 7 19
81 63 220 11.5 85 7 20
82 16 7 6.9 74 7 21
83 NA 258 9.7 81 7 22
84 NA 295 11.5 82 7 23
85 80 294 8.6 86 7 24
86 108 223 8.0 85 7 25
87 20 81 8.6 82 7 26
88 52 82 12.0 86 7 27
89 82 213 7.4 88 7 28
90 50 275 7.4 86 7 29
91 64 253 7.4 83 7 30
92 59 254 9.2 81 7 31
$`8`
Ozone Solar.R Wind Temp Month Day
93 39 83 6.9 81 8 1
94 9 24 13.8 81 8 2
95 16 77 7.4 82 8 3
96 78 NA 6.9 86 8 4
97 35 NA 7.4 85 8 5
98 66 NA 4.6 87 8 6
99 122 255 4.0 89 8 7
100 89 229 10.3 90 8 8
101 110 207 8.0 90 8 9
102 NA 222 8.6 92 8 10
103 NA 137 11.5 86 8 11
104 44 192 11.5 86 8 12
105 28 273 11.5 82 8 13
106 65 157 9.7 80 8 14
107 NA 64 11.5 79 8 15
108 22 71 10.3 77 8 16
109 59 51 6.3 79 8 17
110 23 115 7.4 76 8 18
111 31 244 10.9 78 8 19
112 44 190 10.3 78 8 20
113 21 259 15.5 77 8 21
114 9 36 14.3 72 8 22
115 NA 255 12.6 75 8 23
116 45 212 9.7 79 8 24
117 168 238 3.4 81 8 25
118 73 215 8.0 86 8 26
119 NA 153 5.7 88 8 27
120 76 203 9.7 97 8 28
121 118 225 2.3 94 8 29
122 84 237 6.3 96 8 30
123 85 188 6.3 94 8 31
$`9`
Ozone Solar.R Wind Temp Month Day
124 96 167 6.9 91 9 1
125 78 197 5.1 92 9 2
126 73 183 2.8 93 9 3
127 91 189 4.6 93 9 4
128 47 95 7.4 87 9 5
129 32 92 15.5 84 9 6
130 20 252 10.9 80 9 7
131 23 220 10.3 78 9 8
132 21 230 10.9 75 9 9
133 24 259 9.7 73 9 10
134 44 236 14.9 81 9 11
135 21 259 15.5 76 9 12
136 28 238 6.3 77 9 13
137 9 24 10.9 71 9 14
138 13 112 11.5 71 9 15
139 46 237 6.9 78 9 16
140 18 224 13.8 67 9 17
141 13 27 10.3 76 9 18
142 24 238 10.3 68 9 19
143 16 201 8.0 82 9 20
144 13 238 12.6 64 9 21
145 23 14 9.2 71 9 22
146 36 139 10.3 81 9 23
147 7 49 10.3 69 9 24
148 14 20 16.6 63 9 25
149 30 193 6.9 70 9 26
150 NA 145 13.2 77 9 27
151 14 191 14.3 75 9 28
152 18 131 8.0 76 9 29
153 20 223 11.5 68 9 30
In the second step, we apply the purrr map()
function to the linear regression model we create with lm()
. We want the adjusted R-squared value for each month for Ozone. I played around with the variables a bit to find which one illustrated the adjusted R-squared values best and settled on Temp, but you can choose any other besides Month and Day.
The map()
command applies the lm()
function to each monthy group and yields the typical output for each month. We now have five linear regression models, one for each month, but no adjusted R-squared values.
airquality %>%
split(.$Month) %>%
map(~ lm(Ozone ~ Temp, data = .))
$`5`
Call:
lm(formula = Ozone ~ Temp, data = .)
Coefficients:
(Intercept) Temp
-102.16 1.88
$`6`
Call:
lm(formula = Ozone ~ Temp, data = .)
Coefficients:
(Intercept) Temp
-91.99 1.55
$`7`
Call:
lm(formula = Ozone ~ Temp, data = .)
Coefficients:
(Intercept) Temp
-372.92 5.15
$`8`
Call:
lm(formula = Ozone ~ Temp, data = .)
Coefficients:
(Intercept) Temp
-238.86 3.56
$`9`
Call:
lm(formula = Ozone ~ Temp, data = .)
Coefficients:
(Intercept) Temp
-149.35 2.35
To generate the adjusted R-squared values, we need to map()
the summary()
command to each group. This is the third step. Again, the results are familiar. We have the typical summary of the linear model, one for each month.
airquality %>%
split(.$Month) %>%
map(~ lm(Ozone ~ Temp, data = .)) %>%
map(summary)
$`5`
Call:
lm(formula = Ozone ~ Temp, data = .)
Residuals:
Min 1Q Median 3Q Max
-30.32 -8.62 -2.41 5.32 68.26
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -102.159 38.750 -2.64 0.0145 *
Temp 1.885 0.578 3.26 0.0033 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 18.9 on 24 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared: 0.307, Adjusted R-squared: 0.278
F-statistic: 10.6 on 1 and 24 DF, p-value: 0.00331
$`6`
Call:
lm(formula = Ozone ~ Temp, data = .)
Residuals:
Min 1Q Median 3Q Max
-12.99 -9.34 -6.31 11.08 23.27
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -91.991 51.312 -1.79 0.116
Temp 1.552 0.653 2.38 0.049 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 14.5 on 7 degrees of freedom
(21 observations deleted due to missingness)
Multiple R-squared: 0.447, Adjusted R-squared: 0.368
F-statistic: 5.65 on 1 and 7 DF, p-value: 0.0491
$`7`
Call:
lm(formula = Ozone ~ Temp, data = .)
Residuals:
Min 1Q Median 3Q Max
-32.11 -14.52 -1.16 7.58 75.29
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -372.92 84.45 -4.42 0.00018 ***
Temp 5.15 1.01 5.12 3e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.3 on 24 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared: 0.522, Adjusted R-squared: 0.502
F-statistic: 26.2 on 1 and 24 DF, p-value: 3.05e-05
$`8`
Call:
lm(formula = Ozone ~ Temp, data = .)
Residuals:
Min 1Q Median 3Q Max
-40.42 -17.65 -8.07 9.97 118.58
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -238.861 82.023 -2.91 0.0076 **
Temp 3.559 0.974 3.65 0.0013 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 32.5 on 24 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared: 0.357, Adjusted R-squared: 0.331
F-statistic: 13.4 on 1 and 24 DF, p-value: 0.00126
$`9`
Call:
lm(formula = Ozone ~ Temp, data = .)
Residuals:
Min 1Q Median 3Q Max
-27.45 -8.59 -3.69 11.04 31.39
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -149.347 23.688 -6.30 9.5e-07 ***
Temp 2.351 0.306 7.68 2.9e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.8 on 27 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.686, Adjusted R-squared: 0.674
F-statistic: 58.9 on 1 and 27 DF, p-value: 2.95e-08
She Blows em Outta the Water Like You Never Seen
Now things get interesting, because we can put it all together.
Step 4 involves using the specialized map_dbl()
command to pull the adjusted R-squared values from each month’s linear model summary and output them in a single line.
How do we know the adjusted R-squared value is a double? Well, it looks like one since it consists of a floating decimal. But we could guess, too. If we were to use the map_int()
command we would get an error that tells us the value is a double. If we guessed character we could use mapZ_chr
. That would work but we would have the output in character form, which isn’t what we want.
So we can recognize its data form or we can figure it out through trial and error. Either way we end up at the same place, back where we started.
airquality %>%
split(.$Month) %>%
map(~ lm(Ozone ~ Temp, data = .)) %>%
map(summary) %>%
map_dbl('adj.r.squared')
5 6 7 8 9
0.2781 0.3676 0.5024 0.3307 0.6742
And if That Ain’t Enough to Make You Flip Your Lid
That’s one more thing I’ve got to rethink, daddy.
UPDATE, 3 May 2018
Fellow R blogger Chuck Powell suggests this neat twist on the theme, which yields a more complete statistical table, changing only the last line. I like it. Thanks Chuck!
airquality %>%
split(.$Month) %>%
map(~ lm(Ozone ~ Temp, data = .)) %>%
map(summary) %>%
map_dfr(~ broom::glance(.), .id = "Month")
Month r.squared adj.r.squared sigma statistic p.value df
1 5 0.3070 0.2781 18.88 10.632 3.315e-03 2
2 6 0.4467 0.3676 14.48 5.651 4.909e-02 2
3 7 0.5223 0.5024 22.32 26.241 3.048e-05 2
4 8 0.3575 0.3307 32.46 13.353 1.256e-03 2
5 9 0.6858 0.6742 13.78 58.942 2.945e-08 2
Thanks for visiting r-craft.org
This article is originally published at https://www.finex.co
Please visit source website for post related comments.