R / R News / Statistics

Metropolis-in-Gibbs Sampling and Runtime Analysis with Profviz

by AO · November 8, 2017

This article is originally published at https://stablemarkets.wordpress.com

First off, here are the previous posts in my Bayesian sampling series:

In the first post, I illustrated Gibbs Sampling – an algorithm for getting draws from a posterior when conditional posteriors are known. In the second post, I showed that if we can vectorize, then drawing a whole “block” per iteration will increase the speed of the sampler.

For many models, like logistics models, there are no conjugate priors – so Gibbs is not applicable. And as we saw in the first post, the brute force grid method is much too slow to scale to real-world settings.

This post shows how we can use Metropolis-Hastings (MH) to sample from non-conjugate conditional posteriors within each blocked Gibbs iteration – a much better alternative than the grid method.

I’ll illustrate the algorithm, give some R code results (all code posted on my GitHub), and then profile the R code to identify the bottlenecks in the MH algorithm.

The Model

The simulated data for this example is a cross-sectional dataset with $N=1000$ patients. There is one binary outcome, $Y$ , a binary treatment variable, $A$ , and one confounder, age. Age is a categorical variable with 3 levels. I control for it using 2 dummies (with one category as reference). I model this with a bayesian logistic regression:

$logit(Y) = \beta_0 + \beta_1A + \beta_2age_1 + \beta_3age_2 = X\vec \beta$

$\beta_0, \beta_1, \beta_2, \beta_3 \sim N(\lambda, \phi)$

Above, $\lambda$ is assumed known. Note I’m using $X$ to denote the 1000×4 model matrix and $\vec \beta$ to denote the 4×1 parameter vector. I also place an inverse gamma prior on $\phi$ with known hyper-parameters. This is a fairly realistic motivating example for the Metropolis-in-Gibbs:

We have a binary outcome for which we employ a non-linear link function.
We have a confounder for which we need to adjust.
We are estimating more parameters that we care about. In this setting, we really care about the estimate of the treatment effect $\beta_1$ , so the other coefficients are in some sense nuisance parameters. I wouldn’t say this is a “high-dimensional” setting, but it’s definitely going to strain the sampler.

Unnormalized Conditional Posteriors

Let’s look at the (unnormalized) conditional posteriors of this model. I won’t go through the derivation, but it follows the same procedure used in my previous posts.
$log f( \vec \beta | \phi, X, Y) \propto \sum_{i=1}^n y_i\cdot log(p_i) + (1 - y_i) \cdot log(1 - p_i) - \frac{1}{2\phi}(\vec \beta - \vec \lambda)'(\vec \beta - \vec \lambda)$
Recall we are modeling $p_i = expit(x_i' \vec \beta)$ . Here $x_i$ denotes the $i^{ith}$ row of the model matrix, $X$ , and $i$ indicates the patient.

There is no conjugacy here! This conditional distribution is not a known distribution so we can’t simply sample from it using Gibbs. Rather, within each gibbs iteration we need another sampling step to draw from this conditional posterior. This second sampler will be the MH sampler. If you need a refresher on Gibbs, see the previous posts linked above.

Side note: the conditional posterior of $\phi$ is conjugate. So within each Gibbs iteration we can use a standard function to sample from the inverse gamma. No need for a second sampling step here. Phew!

Metropolis-Hastings

The goal is to sample from $log f(\vec \beta | \phi, X, Y)$ . Note this is a 4-dimensional density. For simplicity of explanation, assume we only have one $\beta$ and that it’s conditional density is unimodal. The MH sampler works as follow:

Make an initial “proposal” of $\beta^{(0)}$ to get the sampling started.
Draw from a distribution centered around $\beta^{(0)}$ . This is called a proposal distribution and, in the standard case, must be symmetric around $\beta^{(0)}$ . E.g. we could use a $N(\beta^{(0)}, \sigma^2)$ . For now let’s assume we set the variance of the proposal distribution to some constant.
Now we take a new draw from the proposal distribution $\beta^*$ . We then calculate the ratio of the unnormalized densities evaluated at the previous draw, $\beta^{(0)}$ and the current proposal, $\beta^*$ : $r = \frac{f(\beta^*)}{f(\beta^{(0)})}$
If this ratio is greater than 1, then the density at the current proposal is higher than the density at the previous value. So we “accept” the proposal and set $\beta^{(1)} = \beta^*$ . Then we repeat steps 2-4 using a proposal distribution centered around $\beta^{(1)}$ and then generating a new proposal. If the ratio is less than 1, then the density at the current proposed value is lower than the previous proposal. In this case, we accept $\beta^*$ with probability $r$ .

So, proposals that yield a higher conditional posterior evaluation are always accepted. However, proposals with lower density evaluations are only sometimes accepted – the lower the relative density evaluation of the proposal, the lower the probability of its acceptance (intuitive!). Over many iterations, draws from the posterior’s high density areas are accepted and the sequence of accepted proposals “climbs” to the high density area. Once the sequence arrives in this high-density area, it tends to remain there. So, in many ways you can view MH as a stochastic “hill-climbing” algorithm. My CompSci friend tells me it is also similar to something called simulated annealing.

The notation extends easily to our 4-dimensional example: the proposal distribution is now a 4-dimensional multivariate Gaussian. Instead of a scalar variance parameter $\sigma^2$ , we have a covariance matrix. Our proposal is therefore a vector of coefficients. In this sense, we are running a blocked Gibbs – using MH to draw a whole block of coefficients per iteration.

Some comments:

The variance of the jumping distribution is an important parameter. If the variance is too low, the current proposal will likely be very close to the last value and so $r$ will be close to 1. We will therefore accept very frequently, but because the accepted values are so close to each other we climb to the high density area slowly over many iterations. If the variance is too high, the sequence may fail to remain in the high density area once it arrives there. Literature suggests that in high dimensions (more than 5 parameters), the optimal acceptance rate is about 24%.
Many of the “adaptive” MH methods are variants of the basic algorithm described here, but include a tuning period to find the jumping distribution variance that yields the optimal acceptance rate.
The most computationally intensive part of MH is the density evaluations. For each Gibbs iteration, we have to evaluate the 4-dimensional density twice: once at the current proposal and once at the previously accepted proposal.
Although the notation extends to high dimensions easily, the performance itself worsens in higher dimensions. The reasons for this is quite technical but super interesting. This paper by Michael Betancourt explains the shortfalls of Gibbs and MH in higher dimensions and outlines how Hamiltonian Monte Carlo (HMC) overcomes these difficulties. As I understand it: in higher dimensions, density does not equal volume. Since getting to the high-volume regions is really what we want, and since standard MH gets to high-density regions, in high dimensions MH fails to explore high-volume areas efficiently. In low dimensions, density approximates volume well so it’s not an issue.

Results

All (commented!) code producing these results is available on my Github. So here are the MCMC chains of our 4 parameters of interest. The red lines indicate true values.

There’s some room for improvement:

The acceptance rate is only 18%, I could have tuned the jumping distribution covariance matrix to have a more optimal rate.
I think more iterations would definitely help here. These chains look okay, but still fairly autocorrelated.

The nice thing about about the Bayesian paradigm is that all inference is done using the posterior distribution. The coefficient estimates now are on the log scale, but if we wanted odds ratios, we just exponentiate the posterior draws. If we want an interval estimate of the odds ratio, then we could just grab the 2.5 and 97.5 percentiles of the exponentiated posterior draws. It’s as simple as that – no delta-method junk.

I mentioned before that MH is costly because the log posterior must be evaluated twice per Gibbs iteration. Below is a profile analysis using the R package profviz showing this. The for-loop runs a Gibbs iteration. Within each Gibbs iteration, I call the function rcond_post_beta_mh() which uses MH to produce a draw from the conditional posterior of the parameter vector. Notice that it takes up the bulk of the runtime.

Diving into the rcond_post_beta_mh(), we see that the subroutine log_cond_post_beta() is the bottleneck in the MH run. This function is the log conditional posterior density of the beta vector, which is evaluated twice.

That’s all for now. Feel free to leave comments if you see errors (I type these up fairly quickly so I apologize in advance).

I may do another post diving deeper into Hamiltonian Monte Carlo, which I alluded to earlier in this post. More likely, this will be my last post on sampling methods. I’d like to move on to some Bayesian nonparametrics in the future.

Thanks for visiting r-craft.org
This article is originally published at https://stablemarkets.wordpress.com
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Metropolis-in-Gibbs Sampling and Runtime Analysis with Profviz

You may also like...

Categories

Metropolis-in-Gibbs Sampling and Runtime Analysis with Profviz

The Model

Unnormalized Conditional Posteriors

Metropolis-Hastings

Results

You may also like...

6 myths about refuelling – tackled with statistics

Another take on the Hyvärinen score for model comparison

Moving from RPubs to Github documents

Categories