Three ways errors are about to get better in tidymodels
This article is originally published at https://www.tidyverse.org/blog/
Twice a year, the tidymodels team comes together for “spring cleaning,” a week-long project devoted to package maintenance. Ahead of the week, we come up with a list of maintenance tasks that we’d like to see consistently implemented across our packages. Many of these tasks can be completed by running one usethis function, while others are much more involved, like issue triage.1 In tidymodels, triaging issues in our core packages helps us to better understand common ways that users struggle to wrap their heads around an API choice we’ve made or find the information they need. So, among other things, refinements to the wording of our error messages is a common output of our spring cleanings. This blog post will call out three kinds of changes to our erroring that came out of this spring cleaning:
- Improving existing errors: The outcome went missing
- Do something where we once did nothing: Predicting with things that can’t predict
- Make a place and point to it: Model formulas
To demonstrate, we’ll walk through some examples using the tidymodels packages:
library(tidymodels) #> ── Attaching packages ──────────────────────────── tidymodels 1.1.1 ── #> ✔ broom 1.0.5 ✔ recipes 126.96.36.19900 #> ✔ dials 1.2.0 ✔ rsample 1.2.0 #> ✔ dplyr 1.1.3 ✔ tibble 3.2.1 #> ✔ ggplot2 3.4.4 ✔ tidyr 1.3.0 #> ✔ infer 1.0.5 ✔ tune 188.8.131.5200 #> ✔ modeldata 1.2.0 ✔ workflows 1.1.3 #> ✔ parsnip 184.108.40.20601 ✔ workflowsets 1.0.1 #> ✔ purrr 1.0.2 ✔ yardstick 1.2.0 #> ── Conflicts ─────────────────────────────── tidymodels_conflicts() ── #> ✖ purrr::discard() masks scales::discard() #> ✖ dplyr::filter() masks stats::filter() #> ✖ dplyr::lag() masks stats::lag() #> ✖ recipes::step() masks stats::step() #> • Use suppressPackageStartupMessages() to eliminate package startup messages
Note that my installed versions include the current dev version of a few tidymodels packages. You can install those versions with:
The tidymodels packages focus on supervised machine learning problems, predicting the value of an outcome using predictors.2 For example, in the code:
linear_spec <- linear_reg() linear_fit <- fit(linear_spec, mpg ~ hp, mtcars)
mpg variable is the outcome. There are many ways that an analyst may mistakenly fail to pass an outcome. In the most straightforward case, they might omit the outcome on the LHS of the formula:
fit(linear_spec, ~ hp, mtcars) #> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : #> incompatible dimensions
In this case, parsnip used to defer to the modeling engine to raise an error, which may or may not be informative.
There are many less obvious ways an analyst may mistakenly supply no outcome variable. For example, try spotting the issue in the following code, defining a recipe to perform principal component analysis (PCA) on the numeric variables in the data before fitting the model:
mtcars_rec <- recipe(mpg ~ ., mtcars) %>% step_pca(all_numeric()) workflow(mtcars_rec, linear_spec) %>% fit(mtcars) #> Error: object '.' not found
A head-scratcher! To help diagnose what’s happening here, we could first try seeing what data is actually being passed to the model.
mtcars_rec_trained <- mtcars_rec %>% prep(mtcars) mtcars_rec_trained %>% bake(NULL) #> # A tibble: 32 × 5 #> PC1 PC2 PC3 PC4 PC5 #> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 -195. 12.8 -11.4 0.0164 2.17 #> 2 -195. 12.9 -11.7 -0.479 2.11 #> 3 -142. 25.9 -16.0 -1.34 -1.18 #> 4 -279. -38.3 -14.0 0.157 -0.817 #> 5 -399. -37.3 -1.38 2.56 -0.444 #> 6 -248. -25.6 -12.2 -3.01 -1.08 #> 7 -435. 20.9 13.9 0.801 -0.916 #> 8 -160. -20.0 -23.3 -1.06 0.787 #> 9 -172. 10.8 -18.3 -4.40 -0.836 #> 10 -209. 19.7 -8.94 -2.58 1.33 #> # ℹ 22 more rows
Mmm. What happened to
mpg? We mistakenly told
step_pca() to perform PCA on all of the numeric variables, not just the numeric predictors! As a result, it incorporated
mpg into the principal components, removing each of the original numeric variables after the fact. Rewriting using the correct tidyselect specification
mtcars_rec_new <- recipe(mpg ~ ., mtcars) %>% step_pca(all_numeric_predictors()) workflow(mtcars_rec_new, linear_spec) %>% fit(mtcars) #> ══ Workflow [trained] ════════════════════════════════════════════════ #> Preprocessor: Recipe #> Model: linear_reg() #> #> ── Preprocessor ────────────────────────────────────────────────────── #> 1 Recipe Step #> #> • step_pca() #> #> ── Model ───────────────────────────────────────────────────────────── #> #> Call: #> stats::lm(formula = ..y ~ ., data = data) #> #> Coefficients: #> (Intercept) PC1 PC2 PC3 PC4 #> 43.39293 0.07609 -0.05266 0.57892 0.94890 #> PC5 #> -1.72569
Works like a charm. That error we saw previously could be much more helpful, though. With the current developmental version of parsnip, this looks like:
fit(linear_spec, ~ hp, mtcars) #> Error: #> ! `linear_reg()` was unable to find an outcome. #> ℹ Ensure that you have specified an outcome column and that it hasn't #> been removed in pre-processing.
Or, with workflows:
workflow(mtcars_rec, linear_spec) %>% fit(mtcars) #> Error: #> ! `linear_reg()` was unable to find an outcome. #> ℹ Ensure that you have specified an outcome column and that it hasn't #> been removed in pre-processing.
Earlier this year, Dr. Louise E. Sinks put out a wonderful blog post documenting what it felt like to approach the various object types defined in the tidymodels as a newcomer to the collection of packages. They wrote:
I found it confusing that
fit_resamples, etc., did not all produce objects that contained the same information and could be acted on by the same functions.
This makes sense. While we try to forefront the intended mental model for fitting and predicting with tidymodels in our APIs and documentation, we also need to be proactive in anticipating common challenges in constructing that mental model.
For example, we’ve found that it’s sometimes not clear to users which outputs they can call
predict() on. One such situation, as Louise points out, is with
# fit a linear regression model to bootstrap resamples of mtcars mtcars_res <- fit_resamples(linear_reg(), mpg ~ ., bootstraps(mtcars)) mtcars_res #> # Resampling results #> # Bootstrap sampling #> # A tibble: 25 × 4 #> splits id .metrics .notes #> <list> <chr> <list> <list> #> 1 <split [32/11]> Bootstrap01 <tibble [2 × 4]> <tibble [0 × 3]> #> 2 <split [32/10]> Bootstrap02 <tibble [2 × 4]> <tibble [0 × 3]> #> 3 <split [32/16]> Bootstrap03 <tibble [2 × 4]> <tibble [0 × 3]> #> 4 <split [32/11]> Bootstrap04 <tibble [2 × 4]> <tibble [0 × 3]> #> 5 <split [32/10]> Bootstrap05 <tibble [2 × 4]> <tibble [0 × 3]> #> 6 <split [32/13]> Bootstrap06 <tibble [2 × 4]> <tibble [0 × 3]> #> 7 <split [32/16]> Bootstrap07 <tibble [2 × 4]> <tibble [0 × 3]> #> 8 <split [32/11]> Bootstrap08 <tibble [2 × 4]> <tibble [0 × 3]> #> 9 <split [32/11]> Bootstrap09 <tibble [2 × 4]> <tibble [0 × 3]> #> 10 <split [32/10]> Bootstrap10 <tibble [2 × 4]> <tibble [0 × 3]> #> # ℹ 15 more rows
With previous tidymodels versions, mistakenly trying to predict with this object resulted in the following output:
predict(mtcars_res) #> Error in UseMethod("predict") : #> no applicable method for 'predict' applied to an object of class #> "c('resample_results', 'tune_results', 'tbl_df', 'tbl', 'data.frame')"
Some R developers may recognize this error as what results when we didn’t define any
predict() method for
tune_results objects. We didn’t do so because prediction isn’t well-defined for tuning results. But, this error message does little to help a user understand why that’s the case.
We’ve recently made some changes to error more informatively in this case. We do so by defining a “dummy”
predict() method for tuning results, implemented only for the sake of erroring more informatively. The same code will now give the following output:
predict(mtcars_res) #> Error in `predict()`: #> ! `predict()` is not well-defined for tuning results. #> ℹ To predict with the optimal model configuration from tuning #> results, ensure that the tuning result was generated with the #> control option `save_workflow = TRUE`, run `fit_best()`, and #> then predict using `predict()` on its output. #> ℹ To collect predictions from tuning results, ensure that the #> tuning result was generated with the control option `save_pred #> = TRUE` and run `collect_predictions()`.
We hope new error messages like this will help to get folks back on track.
In R, formulas provide a compact, symbolic notation to specify model terms. Many modeling functions in R make use of “specials,” or nonstandard notations used in formulas. Specials are defined and handled as a special case by a given modeling package. parsnip defers to engine packages to handle specials, so you can work with them as usual. For example, the mgcv package provides support for generalized additive models in R, and defines a special called
s() to indicate smoothing terms. You can interface with it via tidymodels like so:
# define a generalized additive model specification gam_spec <- gen_additive_mod("regression") # fit the specification using a formula with specials fit(gam_spec, mpg ~ cyl + s(disp, k = 5), mtcars) #> parsnip model object #> #> #> Family: gaussian #> Link function: identity #> #> Formula: #> mpg ~ cyl + s(disp, k = 5) #> #> Estimated degrees of freedom: #> 3.39 total = 5.39 #> #> GCV score: 6.380152
While parsnip can handle specials just fine, the package is often used in conjunction with the greater tidymodels package ecosystem, which defines its own pre-processing infrastructure and functionality via packages like hardhat and recipes. The specials defined in many modeling packages introduce conflicts with that infrastructure. To support specials while also maintaining consistent syntax elsewhere in the ecosystem, tidymodels delineates between two types of formulas: preprocessing formulas and model formulas. Preprocessing formulas determine the input variables, while model formulas determine the model structure.
This is a tricky abstraction, and one that users have tripped up on in the past. Users could generate all sorts of different errors by 1) mistakenly passing model formulas where preprocessing formulas were expected, or 2) forgetting to pass a model formula where it’s needed. For an example of 1), we could pass recipes the same formula we passed to parsnip:
recipe(mpg ~ cyl + s(disp, k = 5), mtcars) #> Error in `inline_check()`: #> ! No in-line functions should be used here; use steps to #> define baking actions.
But we just used a special with another tidymodels function! Rude!
Or, to demonstrate 2), we pass the preprocessing formula as we ought to but forget to provide the model formula:
gam_wflow <- workflow() %>% add_formula(mpg ~ .) %>% add_model(gam_spec) gam_wflow %>% fit(mtcars) #> Error in `fit_xy()`: #> ! `fit()` must be used with GAM models (due to its use of formulas).
Uh, but I did just use
Since the distinction between model formulas and preprocessor formulas comes up in functions across tidymodels, we decide to create a central page that documents the concept itself, hopefully making the syntax associated with it come more easily to users. Then, we link to it all over the place. For example, those errors now look like:
recipe(mpg ~ cyl + s(disp, k = 5), mtcars) #> Error in `inline_check()`: #> ✖ No in-line functions should be used here. #> ℹ The following function was found: `s`. #> ℹ Use steps to do transformations instead. #> ℹ If your modeling engine uses special terms in formulas, pass that #> formula to workflows as a model formula #> (`?parsnip::model_formula()`).
gam_wflow %>% fit(mtcars) #> Error: #> ! When working with generalized additive models, please supply #> the model specification to `workflows::add_model()` along with a #> `formula` argument. #> ℹ See `?parsnip::model_formula()` to learn more.
While I’ve only outlined three, there are all sorts of improvements to error messages on their way to the tidymodels packages in upcoming releases. If you happen to stumble across them, we hope they quickly set you back on the right path. 🗺
Please visit source website for post related comments.