Model averaging in ecology: a review of Bayesian, information-theoretic and tactical approaches for predictive inference
This article is originally published at https://theoreticalecology.wordpress.com
This (co-)guest post by Carsten F. Dormann & Florian Hartig summarizes a comprehensive review on model averaging for predictive inference, just published in Ecological Monographs.
Dormann, C.F., Calabrese, J.M., Guillera-Arroita, G., Matechou, E., Bahn, V., Bartoń, K., et al. (in press). Model averaging in ecology: a review of Bayesian, information-theoretic and tactical approaches for predictive inference. Ecol Monogr, doi: 10.1002/ecm.1309
When times are dire, and data are scarce, quantitative ecologists (or quantitative scientists in general) often reach into their quiver for an arrow called model averaging.
Model averaging refers to the practice of using several models at once for making predictions (the focus of our review), or for inferring parameters (the focus of other papers, and some recent controversy, see, e.g. Banner & Higgs, 2017). There are literally thousands of publications across the disciplines that practice “classical” model averaging, i.e. averaging a few or many models that one could also use “stand-alone”. Additionally, model averaging, as a principle, underlies many of the most commonly used machine-learning methods (e.g. as bagging of trees in random forest or of neural network predictions). We only devoted a few sentences in the appendix of the paper to this, but we think that the link between classical model averaging and machine learning is not sufficiently appreciated and could be further explored.
In ecology, averaging of statistical models is heavily dominated by the “information-theoretical” framework popularised by Burnham & Anderson (2002), while alternative methods that are used in other scientific fields are less well-known. When we set out in March 2015, in the form of a workshop, to conduct a comprehensive review of the wealth of model-averaging approaches, we anticipated this diversity, but not the road full of potholes that we encountered. Studies and information about the topic are fragmented across disciplines, many of which have developed their own ideas and terminology to approach the model-averaging problem. Moreover, the field is largely characterized by a hands-on approach, where alternative ways to average and quantify uncertainties are proposed in abundance; however, with very little “cleaning up” of what works and what doesn’t. As a consequence, what started as a small workshop developed into a multi-author, multi-year activity that culminated in a multi-facetted publication, in which the actual technical description of the various available model-averaging algorithms is only one part.
Apart from mapping the method jungle, our review explains, at least in ecology probably for the first time,
- why and when model averaging works, and what this depends on (see our explanation of how bias, (co)variance and uncertainty of weight estimation influence the benefits of MA);
- how to quantify the uncertainty of model-averaged predictions, and why there are substantial problems to achieve good uncertainty estimates.
The goal of this post is to wet your appetite, not to reproduce the entire paper. Thus, in what follows, we will only have a superficial look at the ingredients of each of these points.
Bias, (co)variance and weight uncertainty
The first part of our paper shows how error of model-averaged predictions can be decomposed into bias and error (co)variance of the contributing models, and uncertainty of weight estimation. Some key insights are:
- If our different models err systematically, but equally on the high and the low side, then their average has less bias.
- If our models vary stochastically, but all in the same way, then there is little point in averaging them. MA becomes more useful the lower the covariance between estimates.
- If all our models are more or less great (or poor), we can save us the trouble of estimating weights.
Here are some titbits of explanation:
First off, the prediction uncertainty, quantified e.g. as mean squared error MSE, is the sum of squared bias and variance. Hence, we can decompose the effect of model averaging into its effect on bias, and on variance.
The first point about systematic error is usually not so relevant for statistical models. Classical/typical/good statistical models are unbiased, i.e. their mean prediction does not deviate from the truth. For process-based models, this need not be the case. If a processes is specified wrongly, the model’s predictions may be consistently too high or too low. Averaging predictions from different process models, with biases in either way, should therefore cancel to some extent and hence reduce bias in the averaged prediction, explaining why model averaging is popular in process-based modelling communities such as climate modelling.
The second point about variance is more relevant for statistical models. Variance refers to the fact that an ideal statistical model gets it right on average (no bias), but will still make an error in each single application (variance). For an unbiased model, predictions will have a smaller error if their variance is lower. We show that, as a consequence of error propagation, the variance of the averaged prediction depends on the variance of each contributing model, as well as the co-variances among these predictions. Thus, if all models made identical predictions, the covariance would cancel any benefit of averaging variances. If, however, model predictions are perfectly uncorrelated, we get great benefits for their prediction’s variance.

Fig. 1 from Dormann et al. – conceptual depiction of the contributions of error to model averaging.
Hang on!? So if my models make very different predictions (which might be worrying for some), only then I get the full benefits of model averaging? Correct!
And it gets more complicated.
There is another factor influencing the variance, which is the weighting of the models. If we threw all models we can get our hands on willy-nilly into an averaging procedure, then surely, we need to sort the wheat from the chaff first. It seems illogical to allow a crappy model to ruin our model average, so we need to downweigh it. Or, as the advice in many papers reads: “Only average plausible models.”
Here, it gets really confusing in the literature, because that’s exactly what many highly successfully machine-learning approaches do not do. For example, in bagging, a commonly used machine learning principle, all models are averaged, and they are not even weighted!
The underlying issue is that, when estimating model weights, we may accrue substantial uncertainty, and this uncertainty also propagates into our model-averaged prediction (Claeskens & al. 2016)! Indeed, it may often be wiser not to compute model weights, if we already have pre-selected our models, as is the common procedure in economics and with the IPCC earth-system-models.
How uncertain is the model-averaged prediction?
After having established that model averaging can (in the right circumstances) improve predictions, let us turn to the second presumed benefit of model averaging, a better representation of uncertainty.
A commonly named reason to use model averaging is that we cannot decide which of our candidate models is the correct one, and therefore want to include them all to better represent our structural uncertainty. So then the obvious question is: how do we compute an uncertainty estimate of a model average? As ingredients we (possibly) have (a) a prediction from each model, (b) a standard error for each model’s prediction, e.g. from bootstrapping, (c) the model weights, and (d) the unknown uncertainty in the model weights. How to brew them into one 95% confidence interval of the model-averaged prediction?
Again, we shall not disclose the details as given in the paper, but this issue caused some serious head-scratching among the authors (each by herself, of course).
As a teaser: there are a few proposals of how to construct frequentist confidence intervals, but they are by-and-large problematic. Some assume perfect correlation of predictions and “non-standard mathematics”, others assume perfect independence and work surprisingly well in our little test-run. (Our personal all-time favourite, the full model, did of course best, but that is not a very helpful finding for any process modeller.)
However, it should be noted that things are not so bad if one is only interested in a predictive error (which can be obtained by cross-validation), or if one works Bayesian, as posterior predictive intervals are more naturally to compute.
How to compute the model-averaging weights?
Finally, we come to the topic that you all must have waited for: what’s the best method to compute the weights? We gave it away already: it’s hard to say, because there are many proposals out there, far more than informative method comparisons.
We divided the method-zoo into three sections: one for Bayesians, one for “IC folks”, and one for practically-oriented folks (aka machine learners & co).
The pure Bayesian side is theoretically simple, but difficult implementation-wise (we’re talking here about the problem of estimating marginal likelihoods of the models, e.g. by reversible-jump MCMC or some other approximations).
The information theoretical approaches are theoretically somewhat more dubious (because they seem to strongly head into the Bayesian direction, with model weights being something akin to model probabilities, but then verbally shun Bayesian viewpoints), but well established computationally.
The smorgasbord (this word was chosen to reflect the European dominance in the author collective) of approaches not fitting either category, which we labelled tactical, comprised the sound and obscure. In short, we summarize here all the approaches that directly target a reduction of predictive error, be it by machine-learning principles or verbal argument. Key examples here are stacking and jackknife model averaging.
Detailed explanations of each approach are given in the paper, and we also ran most methods through two case studies. We found little in our results to justify the dominance of AIC-based model averaging. And model-averaging did not necessarily outperform single models.

Fig. 5 from Dormann et al. – Performance of different methods in a case study
Take-home message
Model averaging has no super-powers. Claims of “combining the best from all models” are plain nonsense. Like most other statistical methods, at close inspection, we see that model averaging has benefits and costs, and an analyst must weigh them carefully against each other to decide which approach is most suitable for their problem.
Benefits include a possible reduction of predictive error. Costs include the fact that this does not always work. And that confidence intervals (and also p-values) are difficult to provide.
To reduce prediction error, we recommend cross validation-based approaches, which are specifically designed to achieve this goal. Embracing model structural uncertainty is certainly a laudable ambition, but the precise mathematics are complicated, and robust methods that work out of the box are not yet worked out.
Literature
Banner, K. M. and M. D. Higgs (2017) Considerations for assessing model averaging of regression coefficients. Ecological Applications, 28:78–93.
Burnham KP, Anderson DR (2002) Model Selection and Multi-Model Inference: a Practical Information-Theoretical Approach. 2nd ed. Berlin: Springer.
Claeskens G, Magnus JR, Vasnev AL, Wang W. The forecast combination puzzle: A simple theoretical explanation. International Journal of Forecasting. 2016;32:754–62.
Dormann, C.F., Calabrese, J.M., Guillera-Arroita, G., Matechou, E., Bahn, V., Bartoń, K., et al. (in press). Model averaging in ecology: a review of Bayesian, information-theoretic and tactical approaches for predictive inference. Ecol Monogr, doi: 10.1002/ecm.1309
Thanks for visiting r-craft.org
This article is originally published at https://theoreticalecology.wordpress.com
Please visit source website for post related comments.