R News

Three ways errors are about to get better in tidymodels

by Posts | Tidyverse · November 10, 2023

This article is originally published at https://www.tidyverse.org/blog/

Twice a year, the tidymodels team comes together for “spring cleaning,” a week-long project devoted to package maintenance. Ahead of the week, we come up with a list of maintenance tasks that we’d like to see consistently implemented across our packages. Many of these tasks can be completed by running one usethis function, while others are much more involved, like issue triage.¹ In tidymodels, triaging issues in our core packages helps us to better understand common ways that users struggle to wrap their heads around an API choice we’ve made or find the information they need. So, among other things, refinements to the wording of our error messages is a common output of our spring cleanings. This blog post will call out three kinds of changes to our erroring that came out of this spring cleaning:

Improving existing errors: The outcome went missing
Do something where we once did nothing: Predicting with things that can’t predict
Make a place and point to it: Model formulas

To demonstrate, we’ll walk through some examples using the tidymodels packages:

library(tidymodels)
#> ── Attaching packages ──────────────────────────── tidymodels 1.1.1 ──
#> ✔ broom        1.0.5          ✔ recipes      1.0.8.9000
#> ✔ dials        1.2.0          ✔ rsample      1.2.0     
#> ✔ dplyr        1.1.3          ✔ tibble       3.2.1     
#> ✔ ggplot2      3.4.4          ✔ tidyr        1.3.0     
#> ✔ infer        1.0.5          ✔ tune         1.1.2.9000
#> ✔ modeldata    1.2.0          ✔ workflows    1.1.3     
#> ✔ parsnip      1.1.1.9001     ✔ workflowsets 1.0.1     
#> ✔ purrr        1.0.2          ✔ yardstick    1.2.0
#> ── Conflicts ─────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()
#> • Use suppressPackageStartupMessages() to eliminate package startup messages

Note that my installed versions include the current dev version of a few tidymodels packages. You can install those versions with:

pak::pak(paste0("tidymodels/", c("tune", "parsnip", "recipes")))

The outcome went missing 👻

The tidymodels packages focus on supervised machine learning problems, predicting the value of an outcome using predictors.² For example, in the code:

linear_spec <- linear_reg()

linear_fit <- fit(linear_spec, mpg ~ hp, mtcars)

The mpg variable is the outcome. There are many ways that an analyst may mistakenly fail to pass an outcome. In the most straightforward case, they might omit the outcome on the LHS of the formula:

fit(linear_spec, ~ hp, mtcars)
#> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
#>   incompatible dimensions

In this case, parsnip used to defer to the modeling engine to raise an error, which may or may not be informative.

There are many less obvious ways an analyst may mistakenly supply no outcome variable. For example, try spotting the issue in the following code, defining a recipe to perform principal component analysis (PCA) on the numeric variables in the data before fitting the model:

mtcars_rec <-
  recipe(mpg ~ ., mtcars) %>%
  step_pca(all_numeric())

workflow(mtcars_rec, linear_spec) %>% fit(mtcars)
#> Error: object '.' not found

A head-scratcher! To help diagnose what’s happening here, we could first try seeing what data is actually being passed to the model.

mtcars_rec_trained <-
  mtcars_rec %>% 
  prep(mtcars) 

mtcars_rec_trained %>% bake(NULL)
#> # A tibble: 32 × 5
#>      PC1   PC2    PC3     PC4    PC5
#>    <dbl> <dbl>  <dbl>   <dbl>  <dbl>
#>  1 -195.  12.8 -11.4   0.0164  2.17 
#>  2 -195.  12.9 -11.7  -0.479   2.11 
#>  3 -142.  25.9 -16.0  -1.34   -1.18 
#>  4 -279. -38.3 -14.0   0.157  -0.817
#>  5 -399. -37.3  -1.38  2.56   -0.444
#>  6 -248. -25.6 -12.2  -3.01   -1.08 
#>  7 -435.  20.9  13.9   0.801  -0.916
#>  8 -160. -20.0 -23.3  -1.06    0.787
#>  9 -172.  10.8 -18.3  -4.40   -0.836
#> 10 -209.  19.7  -8.94 -2.58    1.33 
#> # ℹ 22 more rows

Mmm. What happened to mpg? We mistakenly told step_pca() to perform PCA on all of the numeric variables, not just the numeric predictors! As a result, it incorporated mpg into the principal components, removing each of the original numeric variables after the fact. Rewriting using the correct tidyselect specification all_numeric_predictors():

mtcars_rec_new <- 
  recipe(mpg ~ ., mtcars) %>%
  step_pca(all_numeric_predictors())

workflow(mtcars_rec_new, linear_spec) %>% fit(mtcars)
#> ══ Workflow [trained] ════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────
#> 1 Recipe Step
#> 
#> • step_pca()
#> 
#> ── Model ─────────────────────────────────────────────────────────────
#> 
#> Call:
#> stats::lm(formula = ..y ~ ., data = data)
#> 
#> Coefficients:
#> (Intercept)          PC1          PC2          PC3          PC4  
#>    43.39293      0.07609     -0.05266      0.57892      0.94890  
#>         PC5  
#>    -1.72569

Works like a charm. That error we saw previously could be much more helpful, though. With the current developmental version of parsnip, this looks like:

fit(linear_spec, ~ hp, mtcars)
#> Error:
#> ! `linear_reg()` was unable to find an outcome.
#> ℹ Ensure that you have specified an outcome column and that it hasn't
#>   been removed in pre-processing.

Or, with workflows:

workflow(mtcars_rec, linear_spec) %>% fit(mtcars)
#> Error:
#> ! `linear_reg()` was unable to find an outcome.
#> ℹ Ensure that you have specified an outcome column and that it hasn't
#>   been removed in pre-processing.

Much better.

Predicting with things that can’t predict

Earlier this year, Dr. Louise E. Sinks put out a wonderful blog post documenting what it felt like to approach the various object types defined in the tidymodels as a newcomer to the collection of packages. They wrote:

I found it confusing that fit, last_fit, fit_resamples, etc., did not all produce objects that contained the same information and could be acted on by the same functions.

This makes sense. While we try to forefront the intended mental model for fitting and predicting with tidymodels in our APIs and documentation, we also need to be proactive in anticipating common challenges in constructing that mental model.

For example, we’ve found that it’s sometimes not clear to users which outputs they can call predict() on. One such situation, as Louise points out, is with fit_resamples():

# fit a linear regression model to bootstrap resamples of mtcars
mtcars_res <- fit_resamples(linear_reg(), mpg ~ ., bootstraps(mtcars))

mtcars_res
#> # Resampling results
#> # Bootstrap sampling 
#> # A tibble: 25 × 4
#>    splits          id          .metrics         .notes          
#>    <list>          <chr>       <list>           <list>          
#>  1 <split [32/11]> Bootstrap01 <tibble [2 × 4]> <tibble [0 × 3]>
#>  2 <split [32/10]> Bootstrap02 <tibble [2 × 4]> <tibble [0 × 3]>
#>  3 <split [32/16]> Bootstrap03 <tibble [2 × 4]> <tibble [0 × 3]>
#>  4 <split [32/11]> Bootstrap04 <tibble [2 × 4]> <tibble [0 × 3]>
#>  5 <split [32/10]> Bootstrap05 <tibble [2 × 4]> <tibble [0 × 3]>
#>  6 <split [32/13]> Bootstrap06 <tibble [2 × 4]> <tibble [0 × 3]>
#>  7 <split [32/16]> Bootstrap07 <tibble [2 × 4]> <tibble [0 × 3]>
#>  8 <split [32/11]> Bootstrap08 <tibble [2 × 4]> <tibble [0 × 3]>
#>  9 <split [32/11]> Bootstrap09 <tibble [2 × 4]> <tibble [0 × 3]>
#> 10 <split [32/10]> Bootstrap10 <tibble [2 × 4]> <tibble [0 × 3]>
#> # ℹ 15 more rows

With previous tidymodels versions, mistakenly trying to predict with this object resulted in the following output:

predict(mtcars_res)
#> Error in UseMethod("predict") : 
#>   no applicable method for 'predict' applied to an object of class
#>   "c('resample_results', 'tune_results', 'tbl_df', 'tbl', 'data.frame')"

Some R developers may recognize this error as what results when we didn’t define any predict() method for tune_results objects. We didn’t do so because prediction isn’t well-defined for tuning results. But, this error message does little to help a user understand why that’s the case.

We’ve recently made some changes to error more informatively in this case. We do so by defining a “dummy” predict() method for tuning results, implemented only for the sake of erroring more informatively. The same code will now give the following output:

predict(mtcars_res)
#> Error in `predict()`:
#> ! `predict()` is not well-defined for tuning results.
#> ℹ To predict with the optimal model configuration from tuning
#>   results, ensure that the tuning result was generated with the
#>   control option `save_workflow = TRUE`, run `fit_best()`, and
#>   then predict using `predict()` on its output.
#> ℹ To collect predictions from tuning results, ensure that the
#>   tuning result was generated with the control option `save_pred
#>   = TRUE` and run `collect_predictions()`.

References to important concepts or functions, like control options, fit_best(), and collect_predictions(), link to the help-files for those functions using cli’s erroring tools.

We hope new error messages like this will help to get folks back on track.

Model formulas

In R, formulas provide a compact, symbolic notation to specify model terms. Many modeling functions in R make use of “specials,” or nonstandard notations used in formulas. Specials are defined and handled as a special case by a given modeling package. parsnip defers to engine packages to handle specials, so you can work with them as usual. For example, the mgcv package provides support for generalized additive models in R, and defines a special called s() to indicate smoothing terms. You can interface with it via tidymodels like so:

# define a generalized additive model specification
gam_spec <- gen_additive_mod("regression")

# fit the specification using a formula with specials
fit(gam_spec, mpg ~ cyl + s(disp, k = 5), mtcars)
#> parsnip model object
#> 
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> mpg ~ cyl + s(disp, k = 5)
#> 
#> Estimated degrees of freedom:
#> 3.39  total = 5.39 
#> 
#> GCV score: 6.380152

While parsnip can handle specials just fine, the package is often used in conjunction with the greater tidymodels package ecosystem, which defines its own pre-processing infrastructure and functionality via packages like hardhat and recipes. The specials defined in many modeling packages introduce conflicts with that infrastructure. To support specials while also maintaining consistent syntax elsewhere in the ecosystem, tidymodels delineates between two types of formulas: preprocessing formulas and model formulas. Preprocessing formulas determine the input variables, while model formulas determine the model structure.

This is a tricky abstraction, and one that users have tripped up on in the past. Users could generate all sorts of different errors by 1) mistakenly passing model formulas where preprocessing formulas were expected, or 2) forgetting to pass a model formula where it’s needed. For an example of 1), we could pass recipes the same formula we passed to parsnip:

recipe(mpg ~ cyl + s(disp, k = 5), mtcars)
#> Error in `inline_check()`:
#> ! No in-line functions should be used here; use steps to 
#>   define baking actions.

But we just used a special with another tidymodels function! Rude!

Or, to demonstrate 2), we pass the preprocessing formula as we ought to but forget to provide the model formula:

gam_wflow <- 
  workflow() %>%
  add_formula(mpg ~ .) %>%
  add_model(gam_spec) 

gam_wflow %>% fit(mtcars)
#> Error in `fit_xy()`:
#> ! `fit()` must be used with GAM models (due to its use of formulas).

Uh, but I did just use fit()!

Since the distinction between model formulas and preprocessor formulas comes up in functions across tidymodels, we decide to create a central page that documents the concept itself, hopefully making the syntax associated with it come more easily to users. Then, we link to it all over the place. For example, those errors now look like:

recipe(mpg ~ cyl + s(disp, k = 5), mtcars)
#> Error in `inline_check()`:
#> ✖ No in-line functions should be used here.
#> ℹ The following function was found: `s`.
#> ℹ Use steps to do transformations instead.
#> ℹ If your modeling engine uses special terms in formulas, pass that
#>   formula to workflows as a model formula
#>   (`?parsnip::model_formula()`).

Or:

gam_wflow %>% fit(mtcars)
#> Error:
#> ! When working with generalized additive models, please supply
#>   the model specification to `workflows::add_model()` along with a
#>   `formula` argument.
#> ℹ See `?parsnip::model_formula()` to learn more.

While I’ve only outlined three, there are all sorts of improvements to error messages on their way to the tidymodels packages in upcoming releases. If you happen to stumble across them, we hope they quickly set you back on the right path. 🗺

Issue triage consists of categorizing, prioritizing, and consolidating issues in a repository’s issue tracker. ↩︎
See the tidyclust package for unsupervised learning with tidymodels! ↩︎

Thanks for visiting r-craft.org
This article is originally published at https://www.tidyverse.org/blog/
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Three ways errors are about to get better in tidymodels

You may also like...

Categories

Three ways errors are about to get better in tidymodels

The outcome went missing 👻

Predicting with things that can’t predict

Model formulas

You may also like...

An RStudio Table Contest for 2021

svglite 2.0.0

So, how come we can use TensorFlow from R?

Categories