Updates for parsnip packages
This article is originally published at https://www.tidyverse.org/blog/
We’re delighted to announce the release of parsnip 0.2.1. parsnip is a unified modeling interface for tidymodels.
This release of parsnip precipitated releases of our parsnip extension packages: baguette, discrim, plsmod, poissonreg, and rules. It also allowed us to release an additional package called multilevelmod (see the section below). We’ve kept CRAN busy!
You can see a full list of recent parsnip changes in the release notes. You can install the entire set from CRAN with:
install.packages("parsnip") install.packages("baguette") install.packages("discrim") install.packages("multilevelmod") install.packages("plsmod") install.packages("poissonreg") install.packages("rules")
Let’s look at a summary of the changes, which are almost entirely in parsnip, before looking at multilevelmod.
There are a lot of improvements in this version of parsnip. The main changes are described below.
A previous version of parsnip added a nice feature where the help package for each model showed the engines that are available. One confusing aspect of this was that the list depended on what packages that were loaded. It also didn’t tell users what engines are possible.
Now, parsnip shows all of the known engines and labels which require extension packages. Here’s a screenshot of what you get with
This will not change within a version of parsnip; we’ll update each list with each release.
We’ve added a model function for the excellent Bayesian Additive Regression Trees (BART) approach and an engine for the dbarts package. The model is an ensemble of trees that is assembled using Bayesian estimation methods. It typically has very good predictive performance and is also able to generate estimates of the predictive posterior variance, and prediction intervals.
A good overview of this model is: Bayesian Additive Regression Trees: A Review and Look Forward ( pdf).
Within parsnip, a
"glm" engine was added for linear regression. An engine vale of
"brulee" was added for linear, logistic, and multinomial regression as well as for neural networks. The brulee package is a new, and is for fitting models using torch (look for a blog post soon on this package).
As discussed below, the multilevelmod package adds a lot more engines for linear(ish) models, such as
"stan_glmer". There are similar engines for logistic and Poisson regression.
This package has been simmering for a while on GitHub. Its engines are useful for fitting a variety of models that go by a litany of different names: mixed effects models, random coefficient models, variance component models, hierarchical linear models, and so on.
One aspect of these models is that they mostly work with the formula method, which specifies both the model terms and also which of these are “random effects”.
As an example, let’s look at the measurement system analysis (MSA) data in the package. In these data, 56 separate items were measured twice using a laboratory test. The lab would like to understand how noisy their data are and if different samples can be distinguished from one another. Here’s a plot of the data:
library(ggplot2) library(parsnip) library(multilevelmod) data(msa_data) msa_data %>% ggplot() + aes(x = reorder(id, value), y = value, col = replicate, pch = replicate) + geom_point(alpha = 1/2, cex = 3) + labs(x = NULL, y = "lab result") + theme_bw() + theme( axis.text.x = element_text(angle = 90), legend.position = "top" )
With this data set, the goal is to estimate how much of the variation in the lab test is due to the different samples (as it should be since they are different) or measurement noise. The latter term could be associated with day-to-day differences, people-to-people differences etc. It might also be irreducible noise. In any case, we’d like to get estimates of these two sources of variation.
A straightforward way to estimate this is to use a repeated measurements model that considers the samples to be randomly selected from a population that are independent from one another. We can add a random intercept term that is different for each sample. From this, the sample-to-sample variance can be computed.
There are a lot of packages that can do this but we’ll use the lme4 package:
msa_model <- linear_reg() %>% set_engine("lmer") %>% # The formula has (1|id) which means that each sample (=id) should # have a different intercept (=1) fit(value ~ (1|id), data = msa_data) msa_model
## parsnip model object ## ## Linear mixed model fit by REML ['lmerMod'] ## Formula: value ~ (1 | id) ## Data: data ## REML criterion at convergence: 163.0314 ## Random effects: ## Groups Name Std.Dev. ## id (Intercept) 0.6397 ## Residual 0.2618 ## Number of obs: 112, groups: id, 56 ## Fixed Effects: ## (Intercept) ## 0.8778
We can see from this output that the sample-to-sample variance is
0.6397^2 = 0.40921 which gives a percental of the total variance of:
0.6397 ^ 2 / (0.6397 ^ 2 + 0.2618 ^ 2) * 100
##  85.6539
There is a lot more that can be done with these models in terms of prediction and inference. If you are interested in more about multilevelmod, take a look at the Get Started vignette.
We’d like to thank all of the contributors to these packages since their last releases: @asshah4, @batpigandme, @bshor, @cimentadaj, @daaronr, @davestr2, @DavisVaughan, @deschen1, @dfalbel, @dietrichson, @edgararuiz, @EmilHvitfeldt, @fabrice-rossi, @frequena, @ghost, @gmcmacran, @hfrick, @JB304245, @Jeffrothschild, @jennybc, @jonthegeek, @josefortou, @juliasilge, @kcarnold, @maspotts, @mattwarkentin, @meenakshi-kushwaha, @miepstei, @mmp3, @NickCH-K, @nikhilpathiyil, @nvelden, @p-lemercier, @psads-git, @RaymondBalise, @rmflight, @saadaslam, @Shafi2016, @shuckle16, @sitendug, @ssh352, @stephenhillphd, @stevenpawley, @Steviey, @t-kalinowski, @t-neumann, @tiagomaie, @topepo, @tsengj, @ttrodrigz, @wdkeyzer, @yitao-li, @zenggyu
Please visit source website for post related comments.