Machine learning / R News / Statistics

Bayesian, frequentist and statistical learning perspectives on penalising model complexity

by Florian Hartig · October 8, 2021

This article is originally published at https://theoreticalecology.wordpress.com

In regression analysis, a common problem is to decide on the right functional form of the fitted model. On the one hand, we would like to make the model as flexible as possible so that it can adjust itself bias-free to the true data-generating process. On the other hand, the more freedom we give the model, the more uncertain parameter estimates get (= variance), which leads to total model error increasing with complexity after a certain point. This phenomenon, known as the bias-variance trade-off, leads to the insight that we should limit or penalise model complexity to get reasonable inferences.

As a result of this insight, a large number of statistical approaches exist where we try to optimise an objective of the form:

Quality(M) = L(M) – complexityPenalty(M)

where M is the model, L(M) is the likelihood, and complexityPenalty(M) adds some penalty for the model’s complexity. Examples for this structure are information criteria such as the AIC / BIC, shrinkage estimations such as lasso / ridge (L1 / L2) penalty, or the wiggliness penalty in gams.

When these techniques are introduced in stats classes, they are usually motivated as a means to reduce overfitting, based on the arguments that I gave above. It is well-known (however, possibly less widely) that many of these penalties can be reinterpreted as a Bayesian prior. For example, shrinkage penalties such as the lasso (L1) or the ridge (L2) are equivalent to a double exponential respectively normal prior on the regression parameters (see Fig1). Likewise, wiggliness penalties in gams can be reinterpreted as priors on functional simplicity (see Miller, David L. (2019)).

Fig. 1: Equivalence of common regularisation penalties with Bayesian priors on the respective parameters. From Polson, N. G., & Sokolov, V. (2019), see also Park, T., & Casella, G. (2008).

One may therefore be tempted to re-interpret complexity penalties from statistical learning such as L1/L2 as an a-priori preference for simplicity, similar to Occam’s razor. This, however, misses an important point: in statistical learning, the strength of the penalty is usually estimated from data. L1/L2 complexity penalties, for example, are usually optimised via cross-validation. Thus, the simplicity preference in these statistical learning methods is not really a priori (what you would expect if we had a fundamental / scientific, data-independent preference for simplicity), but it is something that is adjusted adjusted from the data to optimise the bias-variance trade-off. Not also that, in low-data situations, the penalty may easily favour models that are far simpler than the truth.

This is the reason why classical L1/L2 regularisations are better interpreted as “empirical Bayesian” rather than fully Bayesian. Empirical Bayesian methods are methods that use the Bayesian framework for inference, but with priors that are estimated from data. Empirical and fully Bayesian perspectives can be switched or mixed though. One could, for example, add additional data-independent priors on simplicity in a model, and in some sense the common Bayesian practice of using “weakly informative” (data-independent) priors on regression parameters could be interpreted as a light fundamental preference of Bayesian for simplicity.

How does that help us in practice? Well, for example, I am a big fan of shrinkage estimators and would nearly always prefer them over variable selection. The reason why they are rarely used in ecology, however, is that frequentist regression packages that use shrinkage (such as glmnet) don’t calculate p-values. The reason is that obtaining calibrated p-values or CIs with nominal coverage for shrinkage estimators is hard, showing that the latter are probably better understood as a statistical learning method that optimises predictive error than a frequentist method that has controlled error rates. If we re-interpret the shrinkage estimator as a prior in a Bayesian analysis, however, we naturally get normal posterior estimates that can be interpreted pretty straightforward for inference. Thus, if you want to apply L1 / L2 penalties in a regression without loosing the ability to discuss the statistical evidence for an effect, just do it Bayesian!

References

Miller, David L. (2019) “Bayesian views of generalized additive modelling.” arXiv preprint arXiv:1902.01330 .

Polson, N. G., & Sokolov, V. (2019). Bayesian regularization: From Tikhonov to horseshoe. Wiley Interdisciplinary Reviews: Computational Statistics, 11(4), e1463.

Park, T., & Casella, G. (2008). The bayesian lasso. Journal of the American Statistical Association, 103(482), 681-686

Thanks for visiting r-craft.org
This article is originally published at https://theoreticalecology.wordpress.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Bayesian, frequentist and statistical learning perspectives on penalising model complexity

You may also like...

Categories

Bayesian, frequentist and statistical learning perspectives on penalising model complexity

You may also like...

Text Mining of Stack Overflow Questions

JHSPH-Biostat through Coursera

Jakub Nowosad's website 1970-01-01 12:00:00

Categories