Understanding mixture models and expectation-maximization (using baseball statistics)
This article is originally published at http://varianceexplained.org
Previously in this series:
- Understanding the beta distribution
- Understanding empirical Bayes estimation
- Understanding credible intervals
- Understanding the Bayesian approach to false discovery rates
- Understanding Bayesian A/B testing
- Understanding beta binomial regression
- Understanding empirical Bayesian hierarchical modeling
In this series on empirical Bayesian methods on baseball data, we’ve been treating our overall distribution of batting averages as a beta distribution, which is a simple distribution between 0 and 1 that has a single peak. But what if that weren’t a good fit? For example, what if we had a multimodal distribution, with multiple peaks?
In this post, we’re going to consider what to do when your binomial proportions are made up of multiple peaks, and when you don’t know which observation belongs to which clusters. For example, so far in our analysis we’ve been filtering out pitchers, who tend to have a much lower batting average than non-pitchers. If you include them, the data looks something like this:
These batting averages certainly don’t look like a single beta distribution- it’s more like two separate ones mixed together. Imagine that you didn’t know which players were pitchers, and you wanted to separate the data into two groups according to your best prediction. This is very common in practical machine learning applications, such as clustering and segmentation.
In this post we’ll examine mixture models, where we treat the distribution of batting averages as a mixture of two beta-binomial distributions, and need to guess which player belongs to which group. This will also introduce the concept of an expectation-maximization algorithm, which is important in both Bayesian and frequentist statistics. We’ll show how to calculate a posterior probability for the cluster each player belongs to, and see that mixture models are still a good fit for the empirical Bayes framework.
Setup
As usual, I’ll start with some code you can use to catch up if you want to follow along in R. If you want to understand what it does in more depth, check out the previous posts in this series. (As always, all the code in this post can be found here).1
We’ve been filtering out pitchers in the previous posts, which make batting averages look roughly like a beta distribution. But when we leave them in, as I showed above, the data looks a lot less like a beta:
The dashed density curve represents the beta distribution we would naively fit to this data. We can see that unlike our earlier analysis, where we’d filtered out pitchers, the beta is not a good fit- but that it’s plausible that we could fit the data using two beta distributions, one for pitchers and one for non-pitchers.
In this example, we know which players are pitchers and which aren’t. But if we didn’t, we would need to assign each player to a distribution, or “cluster”, before performing shrinkage on it. In a real analysis it’s not realistic that we wouldn’t know which players are pitchers, but it’s an excellent illustrative example of a mixture model and of expectation-maximization algorithms.
Expectation-maximization
The challenge of mixture models is that at the start, we don’t know which observations belong to which cluster, nor what the parameters of each distribution is. It’s difficult to solve these problems at the same time- so an expectation-maximization (EM) algorithm takes the jump of estimating them one at a time, and alternating between them.
The first thing to do in an EM clustering algorithm is to assign our clusters randomly:
Maximization
Now that we’ve got cluster assignments, what do the densities of each cluster look like?
Well, that doesn’t look like much of a division- they have basically the same density! That’s OK: one of the nice features of expectation-maximization is that we don’t actually have to start with good clusters to end up with a good result.
We’ll now write a function for fitting a beta-binomial distribution using maximum likelihood estimation (and the dbetabinom.ab
function from the VGAM package). This is a process we’ve done in multiple posts before, including the appendix of one of the first ones. We’re just encapsulating it into a function.
(The number
column I added will be useful in the next step). For example, here are the alpha and beta chosen for the entire data as a whole:
But now we’re working with a mixture model. This time, we’re going to fit the model within each of our (randomly assigned) clusters:
Another component of this model is the prior probability that a player is in cluster A or cluster B, which we set to 50-50 when we were assigning random clusters. We can estimate our new iteration of this based on the total number of assignments in each group, which is why we included the number
column:
Much as the within-cluster densities only changed a little, the priors only changed a little as well. This was the maximization step: find the maximum likelihood parameters (in this case, two alpha/beta values, and a per-cluster probability), pretending we knew the assignments.
Expectation
We now have a distribution for each cluster. It’s worth noting that these are pretty similar distributions, and that neither is a good fit to the data.
However, notice that due to a small random difference, cluster B is slightly more likely than cluster A for batting averages above about .2, and vice versa below .2.
Consider therefore that each player has a likelihood it would have been generated from cluster A, and a likelihood it would have been generated from cluster B (being sure to weight each by the prior probability of being in A or B):
For example, consider Jeff Abbott, who got 11 hits out of 42 at-bats. He had a 4.35% chance of getting that if he were in cluster A, but a 4.76% chance if he were in cluster B. For that reason (even though it’s a small difference), we’ll put him in B. Similarly we’ll put Kyle Abbott in cluster A: 3/31 was more likely to come from that distribution.
We can do that for every player using group_by
and top_n
:
That’s the expectation step: assigning each person to the most likely cluster. How do our assignments look after that?
Something really important happened here: even though the two beta models we’d fit were very similar, we still split up the data rather neatly. Generally batters with a higher average ended up in cluster B, while batters with a lower average were in cluster A. (Note that due to B having a slightly higher prior probability, it was possible for players with a low average- but also a low AB- to be assigned to cluster B).
Expectation-Maximization
The above two steps got to a better set of assignments than our original, random ones. But there’s no reason to believe these are as good as we can get. So we repeat the two steps, choosing new parameters for each distribution in the mixture and then making new assignments each time.
For example, now that we’ve reassigned each player’s cluster, we could re-fit the beta-binomial with the new assignments. Those distributions would look like this:
Unlike our first model fit, we can see that cluster A and cluster B have diverged a lot. Now we can take those parameters and perform a new estimation step. Generally we will do this multiple times, as an iterative process. This is the heart of an expectation-maximization algorithm, where we switch between assigning clusters (expectation) and fitting the model from those clusters (maximization).
Here I used the accumulate
function from the purrr
package, which is useful for running data through the same function repeatedly and keeping intermediate states. I haven’t seen others use this tidy approach to EM algorithms, and there are existing R approaches to mixture models.2 But I like this approach both because it’s transparent about what we’re doing in each iteration, and because our iterations are now combined in a tidy format, which is convenient to summarize and visualize.
For example, how did our assignments change over the course of the iteration?
We notice that only the first few iterations led to a shift in the assignments, after which it appears to converge. Similarly, how did the estimated beta distributions change over these iterations?
This confirms that it took about three iterations to converge, and then stayed about the same after that. Also notice that in the process, cluster B got much more likely than cluster A, which makes sense since there are more non-pitchers than pitchers in the dataset.
Assigning players to clusters
We now have some final parameters for each cluster:
How would we assign players to clusters, and get a posterior probability that the player belongs to that cluster? Well, let’s arbitrarily pick the six players that each batted exactly 100 times:
Where would we classify each of them? Well, we’d consider the likelihood each would get the number of hits they did if they were a pitcher (cluster A) or a non-pitcher (cluster B):
By Bayes’ Theorem, we can simply use the ratio of one likelihood (say, A in red) to the sum of the two likelihoods to get the posterior probability:
Based on this, we feel confident that Juan Nicasio and Jose de Jesus are pitchers, and that the others probably aren’t. And we’d be right! (Check out the isPitcher
column in the batter_100
table above).
This allows us to assign all players in the dataset to one of the two clusters.
Since we know whether each player actually is a pitcher or not, we can also get a confusion matrix. How many pitchers were accidentally assigned to cluster B, and how many non-pitchers were assigned to cluster A? In this case we’ll look only at the ones for which we had at least 80% confidence in our classification.
Not bad, considering the only information we used was the batting average- and note that we didn’t even use data on who were pitchers to train the model, but just let the clusters define themselves.
It looks like we were a lot more likely to call a pitcher a non-pitcher than vice versa. There’s a lot more we could do to examine this model, how well calibrated its posterior estimates are, and what kinds of pitchers may be mistaken for non-pitchers (e.g. good batters who pitched only a few times), but we won’t consider them in this post.
Empirical bayes shrinkage with a mixture model
We’ve gone to all this work posterior probabilities of each player’s assignments. How can we use this in empirical Bayes shrinkage, or with the other methods we’ve described in this series?
Well, consider that all of our other methods have worked because the posterior was another beta distribution (thanks to the beta being the conjugate prior of the binomial). However, now that each point might belong to one of two beta distributions, our posterior will be a mixture of betas. This mixture is made up of the posterior from each cluster, weighted by the probability the point belongs to that cluster.
For example, consider the six players who had exactly 100 at-bats. Their posterior distributions would look like this:
For example, we are pretty sure that Jose de Jesus and Juan Nicasio are part of the “pitcher” cluster, so that makes up most of their posterior mass, and all of Ryan Shealy’s density is in the “non-pitcher” cluster. However, we’re pretty split on Mike Mahoney- he could be a pitcher who is unusually good at batting, or a non-pitcher who is unusually bad.
Can we perform shrinkage like we did in that early post? If our goal is still to find the mean of each posterior, then yes! Thanks to linearity of expected value, we can simply average the two distribution means, weighing each by the probability the player belongs to that cluster:
For example, we are pretty sure that Jose de Jesus and Juan Nicasio are part of the “pitcher” cluster, which means they mostly get shrunken towards that center. We are quite certain Ryan Shealy is not a pitcher, so he’ll be updated based entirely on that distribution.
Notice that instead of shrinking towards a single value, the batting averages are now shrunken towards two centers: one higher value for the non-pitcher cluster, one smaller value for the pitcher cluster. Ones that are exactly in between don’t really get shrunken in either direction- they’re “pulled equally”.
(For simplicity’s sake I didn’t use our beta-binomial regression approach in this model, but that could easily be added to take into account the relationship between average and AB).
Not all of the methods we’ve used in this series are so easy to adapt to a multimodal distribution. For example, a credible interval is ambiguous in a multimodal distribution (see here for more), and we’d need to rethink our approach to Bayesian A/B testing. But since we do have a posterior distribution for each player- even though it’s not a beta- we’d be able to face these challenges.
What’s Next: Combining into an R package
We’ve introduced a number of statistical techniques in this series for dealing with empirical Bayes, the beta-binomial relationship, A/B testing, etc. Since I’ve provided the code at each stage, you could certainly apply these methods to your own data (as some already have!). However, you’d probably find yourself copying and pasting a rather large amount of code, which isn’t necessarily something you want to do every time you run into this kind of binomial data (it takes you out of the flow of your own data analysis).
In my next post I’ll introduce an R package for performing empirical Bayes on binomial data that encapsulates many of the analyses we’ve performed in this series. These statistical techniques aren’t original to me (most can be found in an elementary Bayesian statistics textbook), but providing them in a convenient R package can still be useful for the community. We’ll also go over some of the choices one makes in developing a statistics package in R, particularly one that is compatible with the tidy tools we’ve been using.
I’m changing this analysis slightly to look only at National League batters since the year 1980. Why? Because National League pitchers are required to bat (while American League pitchers don’t in typical games), and because looking at modern batters helps reduce the noise within each group. ↩
I should note that I haven’t yet gotten an existing R mixture model package to work with a beta-binomial model like we do in this post If you have an approach you’d recommend, please share it in the comments or on Twitter! ↩
Thanks for visiting r-craft.org
This article is originally published at http://varianceexplained.org
Please visit source website for post related comments.