ANOVA in R: afex may be the solution you are looking for
This article is originally published at http://singmann.org
Prelude: When you start with
R and try to estimate a standard ANOVA , which is relatively simple in commercial software like SPSS,
R kind of sucks. Especially for unbalanced designs or designs with repeated-measures replicating the results from such software in base
R may require considerable effort. For a newcomer (and even an old timer) this can be somewhat off-putting. After I had gained experience developing my first package and was once again struggling with
R and ANOVA I had enough and decided to develop
afex. If you know this feeling,
afex is also for you.
A new version of
afex (0.18-0) has been accepted on CRAN a few days ago. This version only fixes a small bug that was introduced in the last version.
aov_ez did not work with more than one
covariate (thanks to
tkerwin for reporting this bug).
I want to use this opportunity to introduce one of the main functionalities of
afex. It provides a set of functions that make calculating ANOVAs easy. In the default settings,
afex automatically uses appropriate orthogonal contrasts for factors, transforms numerical variables into factors, uses so-called Type III sums of squares, and allows for any number of factors including repeated-measures (or within-subjects) factors and mixed/split-plot designs. Together this guarantees that the ANOVA results correspond to the results obtained from commercial statistical packages such as
SAS. On top of this, the ANOVA object returned by afex (of class
afex_aov) can be directly used for follow-up or post-hoc tests/contrasts using the
lsmeans package .
Let me illustrate how to calculate an ANOVA with a simple example. We use data courtesy of Andrew Heathcote and colleagues . The data are lexical decision and word naming latencies for 300 words and 300 nonwords from 45 participants. Here we only look at three factors:
taskis a between subjects (or independent-samples) factor: 25 participants worked on the lexical decision task (
lexdec; i.e., participants had to make a binary decision whether or not the presented string is a word or nonword) and 20 participants on the naming task (
naming; i.e., participant had to say the presented string out loud).
stimulusis a repeated-measures or within-subjects factor that codes whether a presented string was a
lengthis also a repeated-measures factor that gives the number of characters of the presented strings with three levels: 3, 4, and 5.
The dependent variable is the response latency or response time for each presented string. More specifically, as is common in the literature we analyze the log of the response times,
log_rt. After excluding erroneous responses each participants responded to between 135 and 150 words and between 124 and 150 nonwords. To use this data in an ANOVA one needs to aggregate the data such that only one observation per participant and cell of the design (i.e., combination of all factors) remains. As we will see,
afex does this automatically for us (this is one of the features I blatantly stole from
library(afex) data("fhch2010") # load data (comes with afex) mean(!fhch2010$correct) # error rate #  0.01981546 fhch <- droplevels(fhch2010[ fhch2010$correct,]) # remove errors str(fhch2010) # structure of the data # 'data.frame': 13222 obs. of 10 variables: # $ id : Factor w/ 45 levels "N1","N12","N13",..: 1 1 1 1 1 1 1 1 ... # $ task : Factor w/ 2 levels "naming","lexdec": 1 1 1 1 1 1 1 1 1 1 ... # $ stimulus : Factor w/ 2 levels "word","nonword": 1 1 1 2 2 1 2 2 1 2 ... # $ density : Factor w/ 2 levels "low","high": 2 1 1 2 1 2 1 1 1 1 ... # $ frequency: Factor w/ 2 levels "low","high": 1 2 2 2 2 2 1 2 1 2 ... # $ length : Factor w/ 3 levels "4","5","6": 3 3 2 2 1 1 3 2 1 3 ... # $ item : Factor w/ 600 levels "abide","acts",..: 363 121 ... # $ rt : num 1.091 0.876 0.71 1.21 0.843 ... # $ log_rt : num 0.0871 -0.1324 -0.3425 0.1906 -0.1708 ... # $ correct : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
We first load the data and remove the roughly 2% errors. The structure of the
data.frame (obtained via
str()) shows us that the data has a few more factors than discussed here. To specify our ANOVA we first use function
aov_car() which works very similar to base
R aov(), but as all
afex functions uses
car::Anova() (read as function
Anova() from package
car) as the backend for calculating the ANOVA.
Specifying an ANOVA
(a1 <- aov_car(log_rt ~ task*length*stimulus + Error(id/(length*stimulus)), fhch)) # Contrasts set to contr.sum for the following variables: task # Anova Table (Type 3 tests) # # Response: log_rt # Effect df MSE F ges p.value # 1 task 1, 43 0.23 13.38 *** .22 .0007 # 2 length 1.83, 78.64 0.00 18.55 *** .008 <.0001 # 3 task:length 1.83, 78.64 0.00 1.02 .0004 .36 # 4 stimulus 1, 43 0.01 173.25 *** .17 <.0001 # 5 task:stimulus 1, 43 0.01 87.56 *** .10 <.0001 # 6 length:stimulus 1.70, 72.97 0.00 1.91 .0007 .16 # 7 task:length:stimulus 1.70, 72.97 0.00 1.21 .0005 .30 # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘+’ 0.1 ‘ ’ 1 # # Sphericity correction method: GG # Warning message: # More than one observation per cell, aggregating the data using mean (i.e, fun_aggregate = mean)!
The printed output is an ANOVA table that could basically be copied to a manuscript as is. One sees the terms in column
Effect, the degrees of freedoms (
df), the mean-squared error (
MSE, I would probably remove this column in a manuscript), the F-value (
F, which also contains the significance stars), and the p-value (
p.value). The only somewhat uncommon column is
ges which provides generalized eta-squared, ‘the recommended effect size statistics for repeated measure designs’ . The standard output also reports Greenhouse-Geisser (
GG) corrected df for repeated-measures factors with more than two levels (to account for possible violations of sphericity). Note that these corrected df are not integers.
We can also see a warning notifying us that
afex has detected that each participant and cell of the design provides more than one observation which are then automatically aggregated using
mean. The warning serves the purpose to notify the user in case this was not intended (i.e., when there should be only one observation per participant and cell of the design). The warning can be suppressed via specifying
fun_aggregate = mean explicitly in the call to
The formula passed to
aov_car basically needs to be the same as for standard
aov with a few differences:
- It must have an
Errorterm specifying the column containing the participant (or unit of observation) identifier (e.g., minimally
+Error(id)). This is necessary to allow the automatic aggregation even in designs without repeated-measures factor.
- Repeated-measures factors only need to be defined in the
Errorterm and do not need to be enclosed by parentheses. Consequently, the following call produces the same ANOVA:
aov_car(log_rt ~ task + Error(id/length*stimulus), fhch)
In addition to
afex provides two further function for calculating ANOVAs. These function produce the same output but differ in the way how to specify the ANOVA.
aov_ezallows the ANOVA specification not via a
charactervectors (and is similar to
aov_ez(id = "id", dv = "log_rt", fhch, between = "task", within = c("length", "stimulus"))
aov_4requires a formula for which the
idand repeated-measures factors need to be specified as in
lme4::lmer()(with the same simplification that repeated-measures factors only need to be specified in the random part):
aov_4(log_rt ~ task + (length*stimulus|id), fhch) aov_4(log_rt ~ task*length*stimulus + (length*stimulus|id), fhch)
A common requirement after the omnibus test provided by the ANOVA is some-sort of follow-up analysis. For this purpose,
afex is fully integrated with
For example, assume we are interested in the significant
task:stimulus interaction. As a first step we might want to investigate the marginal means of these two factors:
lsmeans(a1, c("stimulus","task")) # NOTE: Results may be misleading due to involvement in interactions # stimulus task lsmean SE df lower.CL upper.CL # word naming -0.34111656 0.04250050 48.46 -0.42654877 -0.25568435 # nonword naming -0.02687619 0.04250050 48.46 -0.11230839 0.05855602 # word lexdec 0.00331642 0.04224522 47.37 -0.08165241 0.08828525 # nonword lexdec 0.05640801 0.04224522 47.37 -0.02856083 0.14137684 # # Results are averaged over the levels of: length # Confidence level used: 0.95
From this we can see naming trials seems to be generally slower (as a reminder, the dv is log-transformed RT in seconds, so values below 0 correspond to RTs bewteen 0 and 1), It also appears that the difference between
nonword trials is larger in the
naming task then in the
lexdec task. We test this with the following code using a few different
lsmeans function. We first use
lsmeans again, but this time using
task as the conditioning variable specified in
by. Then we use
pairs() for obtaining all pairwise comparisons within each conditioning strata (i.e., level of
task). This provides us already with the correct tests, but does not control for the family-wise error rate across both tests. To get those, we simply
update() the returned results and remove the conditioning by setting
by=NULL. In the call to update we can already specify the method for error control and we specify
'holm', because it is uniformly more powerful than Bonferroni.
# set up conditional marginal means: (ls1 <- lsmeans(a1, c("stimulus"), by="task")) # task = naming: # stimulus lsmean SE df lower.CL upper.CL # word -0.34111656 0.04250050 48.46 -0.42654877 -0.25568435 # nonword -0.02687619 0.04250050 48.46 -0.11230839 0.05855602 # # task = lexdec: # stimulus lsmean SE df lower.CL upper.CL # word 0.00331642 0.04224522 47.37 -0.08165241 0.08828525 # nonword 0.05640801 0.04224522 47.37 -0.02856083 0.14137684 # # Results are averaged over the levels of: length # Confidence level used: 0.95 update(pairs(ls1), by=NULL, adjust = "holm") # contrast task estimate SE df t.ratio p.value # word - nonword naming -0.31424037 0.02080113 43 -15.107 <.0001 # word - nonword lexdec -0.05309159 0.01860509 43 -2.854 0.0066 # # Results are averaged over the levels of: length # P value adjustment: holm method for 2 tests
Hmm. These results show that the
stimulus effects in both
task conditions are independently significant. Obviously, the difference between them must also be significant then, or?
pairs(update(pairs(ls1), by=NULL)) # contrast estimate SE df t.ratio p.value # wrd-nnwrd,naming - wrd-nnwrd,lexdec -0.2611488 0.02790764 43 -9.358 <.0001
They obviously are. As a reminder, the interaction is testing exactly this, the difference of the difference. And we can actually recover the F-value of the interaction using
lsmeans alone by invoking yet another of its functions,
test(pairs(update(pairs(ls1), by=NULL)), joint=TRUE) # df1 df2 F p.value # 1 43 87.565 <.0001
These last two example were perhaps not particularly interesting from a statistical point of view, but show an important ability of
lsmeans. Any set of estimated marginal means produced by
lsmeans, including any sort of (custom) contrasts, can be used again for further tests or calculating new sets of marginal means. And with
test() we can even obtain joint F-tests over several parameters using
lsmeans is extremely powerful and one of my most frequently used packages that basically performs all tests following an omnibus test (and in its latest version it directly interfaces with
rstanarm so it can now also be used for a lot of Bayesian stuff, but this is the topic of another blog post).
lsmeans can also be used directly for plotting by envoking
lsmip(a1, task ~ stimulus)
lsmip does not add error bars to the estimated marginal means, but only plots the point estimates. There are mainly two reasons for this. First, as soon as repeated-measures factors are involved, it is difficult to decide which error bars to plot. Standard error bars based on the standard error of the mean are not appropriate for within-subjects comparisons. For those, one would need to use a within-subject intervals (see also here or here). Especially for plots as the current one with both independent-samples and repeated-measures factors (i.e., mixed within-between designs or split-plot designs) no error bar will allow comparisons across both dimensions. Second, only ‘if the SE [i.e., standard error] of the mean is exactly 1/2 the SE of the difference of two means — which is almost never the case — it would be appropriate to use overlapping confidence intervals to test comparisons of means’ (lsmeans author Russel Lenth, the link provides an alternative).
We can also use
lsmeans in combination with
lattice to plot the results on the unconstrained scale (i.e., after back-transforming tha data from the log scale to the original scale of response time in seconds). The plot is not shown here.
lsm1 <- summary(lsmeans(a1, c("stimulus","task"))) lsm1$lsmean <- exp(lsm1$lsmean) require(lattice) xyplot(lsmean ~ stimulus, lsm1, group = task, type = "b", auto.key = list(space = "right"))
afexprovides a set of functions that make specifying standard ANOVAs for an arbitrary number of between-subjects (i.e., independent-sample) or within-subjects (i.e., repeated-measures) factors easy:
- In its default settings, the
afexANOVA functions replicate the results of commercial statistical packages such as SPSS or SAS (using orthogonal contrasts and Type III sums of squares).
- Fitted ANOVA models can be passed to
lsmeansfor follow-up tests, custom contrast tests, and plotting.
- For specific questions visit the new afex support forum: afex.singmann.science (I think we just need someone to ask the first ANOVA question to get the ball rolling).
- For more examples see the vignette or here (blog post by Ulf Mertens) or download the full example R script used here.
As a caveat, let me end this post with some cautionary remarks from Douglas Bates (
fortunes::fortune(184)) who explains why ANOVA in
R is supposed to not be the same as in other software packages (i.e., he justifies why it ‘sucks’):
You must realize that R is written by experts in statistics and statistical computing who, despite popular opinion, do not believe that everything in SAS and SPSS is worth copying. Some things done in such packages, which trace their roots back to the days of punched cards and magnetic tape when fitting a single linear model may take several days because your first 5 attempts failed due to syntax errors in the JCL or the SAS code, still reflect the approach of “give me every possible statistic that could be calculated from this model, whether or not it makes sense”. The approach taken in R is different. The underlying assumption is that the useR is thinking about the analysis while doing it.
— Douglas Bates (in reply to the suggestion to include type III sums of squares and lsmeans in base R to make it more similar to SAS or SPSS)
R-help (March 2007)
Maybe he is right, but maybe what I have described here is useful to some degree.
Thanks for visiting r-craft.org
This article is originally published at http://singmann.org
Please visit source website for post related comments.