R News / Statistics

Another take on the Hyvärinen score for model comparison

by Stephane Shao · September 22, 2018

This article is originally published at https://statisfaction.wordpress.com

Exact log-Bayes factors (log-BF) and H-factors (HF) of M₁ against M₂, computed for 100 independent samples (thin solid lines) of 1000 observations generated as i.i.d. N(1,1), under three increasingly vague priors for θ₁.

In a former post, Pierre wrote about Bayesian model comparison and the limitations of Bayes factors in the presence of vague priors. Here we are, one year later, and I am happy to announce that our joint work with Jie Ding and Vahid Tarokh has been recently accepted for publication. As way of celebrating, allow me to give you another take on the matter. Given some observations $y_{1:T} = (y_1,...,y_T)$ generated as independent draws from $\mathcal{N}(1,1)$ , consider the seemingly simple task of choosing between the following two Normal models

$M_1$ : $Y_1,...,Y_T\,|\,\theta_1 \stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(\theta_1,1)$ , $\theta_1 \sim \mathcal{N}(0,\sigma_0^2)$ ,

$M_2$ : $Y_1,...,Y_T\,|\,\theta_2 \,\stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(0,\theta_2)$ , $\theta_2 \sim \text{Inv-}\chi^2(0.1,1)$ ,

where $\sigma_0>0$ is a known hyperparameter that controls how vague the prior on $\theta_1$ is. In such a setting, only model $M_1$ is well-specified (in the sense of containing the true data generating process), and we can reasonably demand from any sensible model selection criterion to be able to correctly select $M_1$ over $M_2$ . The log-Bayes factor, defined as $\log \text{BF} = \log p(y_{1:T}|M_1) - \log p(y_{1:T}|M_2)$ , fits the bill asymptotically, as it is almost surely positive (thus choosing $M_1$ ) for all $T$ large enough. However, for any fixed sample size $T$ (no matter how large), making the prior on $\theta_1$ more vague by increasing $\sigma_0$ drives $p(y_{1:T}|M_1)$ to 0 and $\log \text{BF}$ to $-\infty$ , which has the undesirable effect of tricking the log-Bayes factor into preferring the misspecified model $M_2$ . This is illustrated with $T=1000$ in the above picture, where $\log(\sigma_0)$ is increased from 0 to 150, then 350. The alarming point is not so much that the value of $\log p(y_{1:T}|M_1)$ varies when the prior changes (as it should, mathematically speaking), but rather that these variations can be unbounded and arbitrary, even for changes of prior that have no impact on the inference stage. This phenomenon is concerning when using Bayes factors to perform model selection, since the choice of one prior is able to dictate the conclusion of the procedure even though the “fit” of the model is virtually unchanged beyond a certain vagueness: indeed, as $\sigma_0$ increases, the posterior $p(\theta_1|y_{1:T},M_1)$ stabilizes to $\mathcal{N}(1,1/T)$ , leading the posterior predictive distributions $p(y_{T+1}|y_{1:T},M_1)$ given by $\mathcal{N}(1,1+1/T)\approx \mathcal{N}(1,1)$ to essentially coincide with the data generating distribution, for any large fixed $T$ and all priors with $\sigma_0$ large enough; yet, as soon as $\sigma_0$ becomes too large, the Bayes factor mechanically (and wrongly) rules out $M_1$ against other misspecified models.

One possible solution to address this sensitivity to priors’ vagueness is to take a decision theoretic approach, by regarding the Bayes factor as a decision rule consisting in choosing the model $M$ that minimizes the prequential score $\sum_{t=1}^T S(y_t, p(dy_t|y_{1:t-1},M))$ , for the particular choice of score function $S(y, p) = -\log p(y)$ , known as the log-score. A score function has an associated divergence function $D_S(p,q) = \mathbb{E}_{p}[S(Y, q)-S(Y, p)]$ , which quantifies the expected loss incurred when trying to use a distribution $q$ to predict a random outcome $Y$ , whose true distribution is $p$ . In particular, the divergence associated to the log-score is the well-known Kullback-Leibler divergence $\text{KL}(p,q) = \mathbb{E}_p[\log p(Y) - \log q(Y)]$ . Finding an alternative to the Bayes factor thus becomes a matter of using a different divergence, and Dawid & Musio (2015) suggest using

$D_\mathcal{H}(p,q) = \mathbb{E}_p[\,\|\nabla\log p(Y) - \nabla\log q(Y)\|^2\,]$ ,

sometimes referred to as the relative Fisher information divergence. Its key feature is that, contrary to the Kullback-Leibler divergence, its value remains unchanged when multiplying $q(Y)$ by an arbitrary constant (which is effectively what happens when making a prior more vague, as its normalizing constant becomes arbitrary). The associated score is the Hyvärinen score, defined as

$\mathcal{H}(y,p) = 2\Delta\log p(y) + \|\nabla\log p(Y)\|^2$ .

One would then select the model $M$ minimizing the H-score, defined as $\sum_{t=1}^T \mathcal{H}(y_t, p(dy_t|y_{1:t-1},M))$ , which is now robust to any arbitrary vagueness in the specification of priors, as shown in the picture (where H-factors simply refer to differences of H-scores). In practice, H-scores need to be estimated. This can be achieved consistently using sequential Monte Carlo methods (e.g. IBIS when the likelihood is available, or SMC² for state-space models with intractable likelihoods). We can also prove (under strong assumptions) that the H-score leads to consistent model selection. If interested, more can be learned here; the code for the figure is provided here.

Thanks for reading !

Thanks for visiting r-craft.org
This article is originally published at https://statisfaction.wordpress.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Another take on the Hyvärinen score for model comparison

You may also like...

Categories

Another take on the Hyvärinen score for model comparison

You may also like...

What hyper-parameters are, and what to do with them; an illustration with ridge regression

Analyzing the bachelor franchise ratings with gtrendsR!

http://r-addict.com 2020-04-15 04:28:11

Categories