R News

textrecipes 0.0.1

by Articles on Tidyverse · December 26, 2018

This article is originally published at https://www.tidyverse.org/articles/

We’re delighted to announce the release of textrecipes 0.0.1 on CRAN. textrecipes implements a collection of new steps for the recipes package to deal with text preprocessing. textrecipes is still in early development so any and all feedback is highly appreciated.

You can install it by running:

install.packages("textrecipes")

New steps

The steps introduced here can be split into 3 types, those that:

convert from characters to list-columns and vice versa,
modify the elements in list-columns, and
convert list-columns to numerics.

This allows for greater flexibility in the preprocessing tasks that can be done while staying inside the recipes framework. This also prevents having a single step with many arguments.

Workflows

First we start by creating a recipe object from the original data.

data("okc_text")
rec_obj <- recipe(~ ., okc_text)

rec_obj
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>  predictor         10

The workflow in textrecipes so far starts with step_tokenize(), followed by a combination of type-1 and type-2 steps ending with a type-3 step. step_tokenize() wraps the tokenizers package for tokenization, but other tokenization functions can be utilized using the custom_token argument. More information concerning arguments can be found in the documentation. The shortest possible recipes are step_tokenize() directly followed by a type-3 step.

### Feature hashing done on word tokens
rec_obj %>%
  step_tokenize(essay0) %>% # token argument defaults to "words"
  step_texthash(essay0)
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>  predictor         10
#> 
#> Operations:
#> 
#> Tokenization for essay0
#> Feature hashing with essay0

### Counting chacter occurrences
rec_obj %>%
  step_tokenize(essay0, token = "character") %>%
  step_tf(essay0)
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>  predictor         10
#> 
#> Operations:
#> 
#> Tokenization for essay0
#> Term frequency with essay0

If one wanted to calculate the word count of the top 100 most frequently used words after stemming is performed, type-2 steps are needed. Here we use step_stem() to perform stemming using the SnowballC package and step_tokenfilter() to keep only the 100 most frequent tokens.

rec_obj %>%
  step_tokenize(essay0) %>%
  step_stem(essay0) %>%
  step_tokenfilter(essay0, max_tokens = 100) %>%
  step_tf(essay0)
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>  predictor         10
#> 
#> Operations:
#> 
#> Tokenization for essay0
#> Stemming for essay0
#> Text filtering for essay0
#> Term frequency with essay0

For more combinations, please consult the documentation and the vignette, which includes recipe examples.

Acknowledgements

A big thank you goes out to the 6 people who contributed to this release: @ClaytonJY, @DavisVaughan, @EmilHvitfeldt, @jwijffels, @kanishkamisra, and @topepo.

Thanks for visiting r-craft.org
This article is originally published at https://www.tidyverse.org/articles/
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

textrecipes 0.0.1

You may also like...

Categories

textrecipes 0.0.1

New steps

Workflows

Acknowledgements

You may also like...

Analyzing NetHack data, part 2: What players kill the most

Creating interactive SVG tables in R

Gradient descent in R

Categories