Q3 2021 tidymodels roundup
This article is originally published at https://www.tidyverse.org/blog/
tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles. We now publish
regular updates here on the tidyverse blog summarizing recent developments in the tidymodels ecosystem. You can check out the
tidymodels tag to find all tidymodels blog posts here, including those that focus on a single package or more major releases. The purpose of these roundup posts is to keep you informed about any releases you may have missed and useful new functionality as we maintain these packages.
Since our last roundup post, there have been 9 CRAN releases of tidymodels packages. You can install these updates from CRAN with:
install.packages(c("baguette", "broom", "dials", "modeldata", "poissonreg", "recipes", "rules", "stacks", "textrecipes"))
NEWS files are linked here for each package; you’ll notice that many of these releases involve small updates for CRAN policy or changes that are not user-facing. We write code in these smaller, modular packages that we can release frequently to make models easier to deploy and our software easier to maintain, but it can be a lot to keep up with as a user! We want to take the opportunity here to highlight a couple of more important changes in these releases.
The dials package is one that you might not have thought much about yet, even if you are a regular tidymodels user. It is an infrastructure package for creating and managing hyperparameters as well as grids (both regular and non-regular) of hyperparameters. The most recent release of dials includes several new parameters, and Hannah Frick is now the package maintainer.
library(tidymodels) #> ── Attaching packages ──────────────────────────── tidymodels 0.1.3 ── #> ✓ broom 0.7.9 ✓ rsample 0.1.0 #> ✓ dials 0.0.10 ✓ tibble 3.1.4 #> ✓ dplyr 1.0.7 ✓ tidyr 1.1.3 #> ✓ infer 1.0.0 ✓ tune 0.1.6 #> ✓ modeldata 0.1.1 ✓ workflows 0.2.3 #> ✓ parsnip 0.1.7 ✓ workflowsets 0.1.0 #> ✓ purrr 0.3.4 ✓ yardstick 0.0.8 #> ✓ recipes 0.1.17 #> ── Conflicts ─────────────────────────────── tidymodels_conflicts() ── #> x purrr::discard() masks scales::discard() #> x dplyr::filter() masks stats::filter() #> x dplyr::lag() masks stats::lag() #> x recipes::step() masks stats::step() #> • Use tidymodels_prefer() to resolve common conflicts. stop_iter() #> # Iterations Before Stopping (quantitative) #> Range: [3, 20]
You don’t typically use parameters from dials like this, though. Instead, the infrastructure these parameters provide is what allows us to fluently tune our hyperparameters. For example, we can use data on high-performance computing jobs to predict the class of those jobs with xgboost, choosing to try out different values for early stopping (along with another hyperparameter
data(hpc_data) hpc_folds <- vfold_cv(hpc_data, strata = class) hpc_formula <- class ~ compounds + input_fields + iterations + num_pending + hour stopping_spec <- boost_tree( trees = 500, learn_rate = 0.02, mtry = tune(), stop_iter = tune() ) %>% set_engine("xgboost", validation = 0.2) %>% set_mode("classification") early_stop_wf <- workflow(hpc_formula, stopping_spec)
We can now tune this
workflow() over the resamples
hpc_folds and find out which values for the hyperparameters turned out best.
doParallel::registerDoParallel() set.seed(123) early_stop_rs <- tune_grid(early_stop_wf, hpc_folds) #> i Creating pre-processing data to finalize unknown parameter: mtry show_best(early_stop_rs, "roc_auc") #> # A tibble: 5 × 8 #> mtry stop_iter .metric .estimator mean n std_err .config #> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 3 13 roc_auc hand_till 0.887 10 0.00337 Preprocessor… #> 2 2 11 roc_auc hand_till 0.887 10 0.00414 Preprocessor… #> 3 2 15 roc_auc hand_till 0.886 10 0.00342 Preprocessor… #> 4 4 7 roc_auc hand_till 0.885 10 0.00365 Preprocessor… #> 5 4 9 roc_auc hand_till 0.884 10 0.00353 Preprocessor…
In this case, the best value for the early stopping parameter is 13.
This recent screencast demonstrates how to tune early stopping for xgboost as well.
Several of the new parameter objects in this version of dials are prep work for supporting more options in tidymodels, such as tuning yet more hyperparameters of xgboost (like L1 and L2 penalties and whether to balance classes), generalized additive models, discriminant analysis models, recursively partitioned models, and a recipe step for sparse PCA. Stay tuned for more on these new options!
The most recent release of recipes is a robust one, including new recipe steps, bug fixes, documentation improvements, and better performance. You can dig into the NEWS file for more details, but check out a few of the most important changes.
We added a new recipe step for creating sine and cosine features,
step_harmonic(). We can use this to analyze data with periodic features, like
the annual number of sunspots. The
sunspot.year dataset has an observation every year, and the solar cycle is about 11 years long.
data(sunspot.year) sunspots <- tibble(year = 1700:1988, n_sunspot = sunspot.year) sun_split <- initial_time_split(sunspots) sun_train <- training(sun_split) sun_test <- testing(sun_split) sunspots_rec <- recipe(n_sunspot ~ year, data = sun_train) %>% step_harmonic(year, frequency = 1 / 11, cycle_size = 1, role = "predictor", keep_original_cols = FALSE) lm_spec <- linear_reg() sunspots_wf <- workflow(sunspots_rec, lm_spec) sunspots_fit <- fit(sunspots_wf, sun_train) sunspots_fit #> ══ Workflow [trained] ════════════════════════════════════════════════ #> Preprocessor: Recipe #> Model: linear_reg() #> #> ── Preprocessor ────────────────────────────────────────────────────── #> 1 Recipe Step #> #> • step_harmonic() #> #> ── Model ───────────────────────────────────────────────────────────── #> #> Call: #> stats::lm(formula = ..y ~ ., data = data) #> #> Coefficients: #> (Intercept) year_sin_1 year_cos_1 #> 42.904 9.691 20.504
We can now predict with this fitted linear model on the most recent years.
sunspots_fit %>% augment(sun_test) %>% select(year, n_sunspot, .pred) %>% pivot_longer(-year) %>% ggplot(aes(year, value, color = name)) + geom_line(alpha = 0.8, size = 1.2) + theme_minimal() + labs(x = NULL, y = "Number of sunspots", color = NULL)
Looks like there have been more sunspots in recent decades compared to the past!
build your own recipe steps, the new
recipes_eval_select() function is now available, powering the tidyselect semantics specific to recipes. The older
terms_select() function is now deprecated in favor of this new helper.
The recipes package is fairly extensive, and we have recently invested time and energy in refining the documentation to make it more navigable and clear, as well as easier to maintain and contribute to. Specific documentation pages with recent updates you may find helpful include:
One of the tidymodels packages is
modeldata, where we keep example datasets for vignettes, examples, and other similar uses. We have included two datasets,
okc_text, in modeldata based on real user data from the dating website OkCupid.
This dataset was sourced from Kim and Escobedo-Land (2015). Permission to use this dataset was explicitly granted by OkCupid, but since that time, concerns have been raised about the ethics of using this or similar data sets, for example to identify individuals. Xiao and Ma (2021) specifically address the possible misuse of this particular dataset, and we now agree that it isn’t a good option to use for examples or teaching. In the most recent release of modeldata, we have marked these datasets as deprecated. We have removed them from the development version of the package on GitHub, and they will be removed entirely in the next CRAN release. We especially want to thank Albert Kim, one of the authors of the original paper, for his thoughtful and helpful discussion.
One of the reasons we found the OkCupid dataset useful was that it included multiple text columns per observation, so removing these two datasets motivated us to look for a new option to include instead. We landed on
metadata for modern artwork from the Tate Gallery; if you used
okc_text in the past, we recommend switching to
tate_text. For example, we can count how many of these examples of artwork involve paper or canvas.
data(tate_text) library(textrecipes) paper_or_canvas <- c("paper", "canvas") recipe(~ ., data = tate_text) %>% step_tokenize(medium) %>% step_stopwords(medium, custom_stopword_source = paper_or_canvas, keep = TRUE) %>% step_tf(medium) %>% prep() %>% bake(new_data = NULL) %>% select(artist, year, starts_with("tf")) #> # A tibble: 4,284 × 4 #> artist year tf_medium_canvas tf_medium_paper #> <fct> <dbl> <dbl> <dbl> #> 1 Absalon 1990 0 0 #> 2 Auerbach, Frank 1990 0 1 #> 3 Auerbach, Frank 1990 0 1 #> 4 Auerbach, Frank 1990 0 1 #> 5 Auerbach, Frank 1990 1 0 #> 6 Ayres, OBE Gillian 1990 1 0 #> 7 Barlow, Phyllida 1990 0 1 #> 8 Baselitz, Georg 1990 0 1 #> 9 Beattie, Basil 1990 1 0 #> 10 Beuys, Joseph 1990 0 1 #> # … with 4,274 more rows
This artwork metadata was a Tidy Tuesday dataset earlier this year.
We’d like to extend our thanks to all of the contributors who helped make these releases during Q3 possible!
broom: @bcallaway11, @billdenney, @brshallo, @corybrunson, @crsh, @gregmacfarlane, @hfrick, @ilapros, @jamesrrae, @jennybc, @jthomasmock, @kaseyzapatka, @krivit, @LukasWallrich, @oskasf, @simonpcouch, and @tarensanders
recipes: @AndrewKostandy, @asiripanich, @atusy, @avrenli2, @czopluoglu, @DavisVaughan, @DesmondChoy, @EmilHvitfeldt, @felipeangelimvieira, @hfrick, @jennybc, @joeycouse, @juliasilge, @kadyb, @mmp3, @MrFlick, @NikKrieger, @PathosEthosLogos, @simonschoe, @SlowMo24, @topepo, @vspinu, and @yyhyun64
Please visit source website for post related comments.