Updates for recipes extension packages
This article is originally published at https://www.tidyverse.org/blog/
We’re tickled pink to announce the releases of extension packages that followed the recent release of
recipes 0.2.0. recipes is a package for preprocessing data before using it in models or visualizations. You can think of it as a mash-up of
model.matrix() and dplyr.
You can install the these updates from CRAN with:
NEWS files are linked here for each package; We will go over some of the bigger changes within and between these packages in this post. A lot of the smaller changes were done to make sure that these extension packages are up to the same standard as recipes itself.
A new step
step_smotenc() was added thanks to
Robert Gregg. This step applies the
SMOTENC algorithm to synthetically generate observations from minority classes. The SMOTENC method can handle a mix of categorical and numerical predictors, which was not possible using the existing SMOTE method which could only operate on numeric predictors.
hpc_data illustrates this use case neatly. The data set contains characteristics of HPC Unix jobs and how long they took to run (the outcome column is
class). The outcome is not that balanced, with some classes having almost 10 times fewer observations than others. One way to deal with an imbalance like this is to over-sample the minority observations to mitigate the imbalance.
step_smotenc(), with the
over_ratio argument, we can make sure that all classes are over-sampled to have no less than half of the observations of the largest class.
The method that was implemented in embed now has standalone functions to apply these algorithms without having to create a recipe.
smotenc(hpc_data, "class", over_ratio = 0.5) #> # A tibble: 5,768 × 8 #> protocol compounds input_fields iterations num_pending hour day class #> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> #> 1 E 997 137 20 0 14 Tue F #> 2 E 97 103 20 0 13.8 Tue VF #> 3 E 101 75 10 0 13.8 Thu VF #> 4 E 93 76 20 0 10.1 Fri VF #> 5 E 100 82 20 0 10.4 Fri VF #> 6 E 100 82 20 0 16.5 Wed VF #> 7 E 105 88 20 0 16.4 Fri VF #> 8 E 98 95 20 0 16.7 Fri VF #> 9 E 101 91 20 0 16.2 Fri VF #> 10 E 95 92 20 0 10.8 Wed VF #> # … with 5,758 more rows
We added the functions
all_tokenized_predictors() to more easily select tokenized columns, similar to the
all_numeric_predictors() selectors in recipes.
The most important step in textrecipes is
step_tokenize(), as you need it to generate tokens that can be modified by other steps. We have found that this function has gotten overloaded with functionality as more and more support for different types of tokenization was added. To address this, we have created new specialized tokenization steps;
step_tokenize() has gotten cousin steps
step_tokenize_wordpiece() which wrap
In addition to being easier to manage code-wise, these new functions also allow for more compact, more readable code with better tab completion.
data(tate_text) # Old tate_rec <- recipe(~., data = tate_text) %>% step_tokenize( text, engine = "tokenizers.bpe", training_options = list(vocab_size = 1000) ) # New tate_rec <- recipe(~., data = tate_text) %>% step_tokenize_bpe(medium, vocabulary_size = 1000)
step_feature_hash() is now soft deprecated in embed in favor of
step_dummy_hash() in textrecipes. The embed version uses TensorFlow, which for some use cases is quite a dependency. One thing to keep an eye out for when moving over is that the textrecipes version uses
num_terms instead of
num_hash to denote the number of columns to output.
data(Sacramento) # Old recipe embed_rec <- recipe(price ~ zip, data = Sacramento) %>% step_feature_hash(zip, num_hash = 64) #> Loaded Tensorflow version 2.8.0 # New recipe textrecipes_rec <- recipe(price ~ zip, data = Sacramento) %>% step_dummy_hash(zip, num_terms = 64)
We’d like to extend our thanks to all of the contributors who helped make these releases possible!
Please visit source website for post related comments.