textrecipes 0.0.1
This article is originally published at https://www.tidyverse.org/articles/
We’re delighted to announce the release of textrecipes 0.0.1 on CRAN. textrecipes implements a collection of new steps for the recipes package to deal with text preprocessing. textrecipes is still in early development so any and all feedback is highly appreciated.
You can install it by running:
install.packages("textrecipes")
New steps
The steps introduced here can be split into 3 types, those that:
- convert from characters to list-columns and vice versa,
- modify the elements in list-columns, and
- convert list-columns to numerics.
This allows for greater flexibility in the preprocessing tasks that can be done while staying inside the recipes framework. This also prevents having a single step with many arguments.
Workflows
First we start by creating a recipe
object from the original data.
data("okc_text")
rec_obj <- recipe(~ ., okc_text)
rec_obj
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> predictor 10
The workflow in textrecipes so far starts with step_tokenize()
, followed by a combination of type-1 and type-2 steps ending with a type-3 step. step_tokenize()
wraps the tokenizers package for tokenization, but other tokenization functions can be utilized using the custom_token
argument. More information concerning arguments can be found in the documentation. The shortest possible recipes are step_tokenize()
directly followed by a type-3 step.
### Feature hashing done on word tokens
rec_obj %>%
step_tokenize(essay0) %>% # token argument defaults to "words"
step_texthash(essay0)
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> predictor 10
#>
#> Operations:
#>
#> Tokenization for essay0
#> Feature hashing with essay0
### Counting chacter occurrences
rec_obj %>%
step_tokenize(essay0, token = "character") %>%
step_tf(essay0)
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> predictor 10
#>
#> Operations:
#>
#> Tokenization for essay0
#> Term frequency with essay0
If one wanted to calculate the word count of the top 100 most frequently used words after stemming is performed, type-2 steps are needed. Here we use step_stem()
to perform stemming using the SnowballC package and step_tokenfilter()
to keep only the 100 most frequent tokens.
rec_obj %>%
step_tokenize(essay0) %>%
step_stem(essay0) %>%
step_tokenfilter(essay0, max_tokens = 100) %>%
step_tf(essay0)
#> Data Recipe
#>
#> Inputs:
#>
#> role #variables
#> predictor 10
#>
#> Operations:
#>
#> Tokenization for essay0
#> Stemming for essay0
#> Text filtering for essay0
#> Term frequency with essay0
For more combinations, please consult the documentation and the vignette, which includes recipe examples.
Acknowledgements
A big thank you goes out to the 6 people who contributed to this release: @ClaytonJY, @DavisVaughan, @EmilHvitfeldt, @jwijffels, @kanishkamisra, and @topepo.
Thanks for visiting r-craft.org
This article is originally published at https://www.tidyverse.org/articles/
Please visit source website for post related comments.