Bayesian, frequentist and statistical learning perspectives on penalising model complexity
This article is originally published at https://theoreticalecology.wordpress.com
In regression analysis, a common problem is to decide on the right functional form of the fitted model. On the one hand, we would like to make the model as flexible as possible so that it can adjust itself bias-free to the true data-generating process. On the other hand, the more freedom we give the model, the more uncertain parameter estimates get (= variance), which leads to total model error increasing with complexity after a certain point. This phenomenon, known as the bias-variance trade-off, leads to the insight that we should limit or penalise model complexity to get reasonable inferences.
As a result of this insight, a large number of statistical approaches exist where we try to optimise an objective of the form:
Quality(M) = L(M) – complexityPenalty(M)
where M is the model, L(M) is the likelihood, and complexityPenalty(M) adds some penalty for the model’s complexity. Examples for this structure are information criteria such as the AIC / BIC, shrinkage estimations such as lasso / ridge (L1 / L2) penalty, or the wiggliness penalty in gams.
When these techniques are introduced in stats classes, they are usually motivated as a means to reduce overfitting, based on the arguments that I gave above. It is well-known (however, possibly less widely) that many of these penalties can be reinterpreted as a Bayesian prior. For example, shrinkage penalties such as the lasso (L1) or the ridge (L2) are equivalent to a double exponential respectively normal prior on the regression parameters (see Fig1). Likewise, wiggliness penalties in gams can be reinterpreted as priors on functional simplicity (see Miller, David L. (2019)).

One may therefore be tempted to re-interpret complexity penalties from statistical learning such as L1/L2 as an a-priori preference for simplicity, similar to Occam’s razor. This, however, misses an important point: in statistical learning, the strength of the penalty is usually estimated from data. L1/L2 complexity penalties, for example, are usually optimised via cross-validation. Thus, the simplicity preference in these statistical learning methods is not really a priori (what you would expect if we had a fundamental / scientific, data-independent preference for simplicity), but it is something that is adjusted adjusted from the data to optimise the bias-variance trade-off. Not also that, in low-data situations, the penalty may easily favour models that are far simpler than the truth.
This is the reason why classical L1/L2 regularisations are better interpreted as “empirical Bayesian” rather than fully Bayesian. Empirical Bayesian methods are methods that use the Bayesian framework for inference, but with priors that are estimated from data. Empirical and fully Bayesian perspectives can be switched or mixed though. One could, for example, add additional data-independent priors on simplicity in a model, and in some sense the common Bayesian practice of using “weakly informative” (data-independent) priors on regression parameters could be interpreted as a light fundamental preference of Bayesian for simplicity.
How does that help us in practice? Well, for example, I am a big fan of shrinkage estimators and would nearly always prefer them over variable selection. The reason why they are rarely used in ecology, however, is that frequentist regression packages that use shrinkage (such as glmnet) don’t calculate p-values. The reason is that obtaining calibrated p-values or CIs with nominal coverage for shrinkage estimators is hard, showing that the latter are probably better understood as a statistical learning method that optimises predictive error than a frequentist method that has controlled error rates. If we re-interpret the shrinkage estimator as a prior in a Bayesian analysis, however, we naturally get normal posterior estimates that can be interpreted pretty straightforward for inference. Thus, if you want to apply L1 / L2 penalties in a regression without loosing the ability to discuss the statistical evidence for an effect, just do it Bayesian!
References
Miller, David L. (2019) “Bayesian views of generalized additive modelling.” arXiv preprint arXiv:1902.01330 .
Polson, N. G., & Sokolov, V. (2019). Bayesian regularization: From Tikhonov to horseshoe. Wiley Interdisciplinary Reviews: Computational Statistics, 11(4), e1463.
Park, T., & Casella, G. (2008). The bayesian lasso. Journal of the American Statistical Association, 103(482), 681-686
Thanks for visiting r-craft.org
This article is originally published at https://theoreticalecology.wordpress.com
Please visit source website for post related comments.