Categorical distribution, structure of the second kind and Gumbel-max trick
This article is originally published at https://statisfaction.wordpress.com
This post is about a way of sampling from a Categorical distribution, which appears in Arthur Dempter‘s approach to inference as a generalization of Bayesian inference (see Figure 1 in “A Generalization of Bayesian Inference”, 1968), under the name “structure of the second kind”. It’s the starting point of my on-going work with Ruobin Gong and Paul Edlefsen, which I’ll write about on another day. This sampling mechanism turns out to be strictly equivalent to the “Gumbel-max” trick that got some attention in machine learning see e.g. this blog post by Francis Bach.
Let’s look at the figure above: the encompassing triangle is equivalent to the “simplex” with 3 vertices (K vertices more generally). Any point within the triangle is a convex combination of the vertices, where are non-negative “weights” summing to one, and where are the vertices. The weights are the “barycentric coordinates” of the point. Any point in the triangle induces a partition into K sets . Each “sub-simplex” can be obtained by considering the entire simplex and replacing vertex by . It has a volume equal to relative to the volume of the entire simplex. Can you see why? If not, it’s OK, great scientific endeavors require a certain degree of trust and optimism.
Since the volume of each is , if we sample a point uniformly within the encompassing simplex, it will land within with probability . In other words we can sample from a Categorical distribution with probabilities by sampling uniformly within the simplex, and by identifying which index k is such that the point lands in . This appears in various places in Arthur Dempster’s articles (see references below), because Categorical distributions provide a pedagogical setting for new methods of statistical inference, and because this sampling mechanism does not rely on any arbitrary ordering of the categories (contrarily to “inverse transform sampling”).
How does this relate to the Gumbel-max trick? One way of sampling uniformly within the simplex is to sample Exponentials(1) and to define weights . Furthermore, a point is within for a given , if and only if for all . The next figure illustrates such inequalities: the points with coordinates satisfying are under/above some line that originates from the vertex opposite the segment and goes through .
An Exponential(1) is also minus the logarithm of a Uniform(0,1). Putting all these pieces together, a Uniform point in the simplex is within if and only if, for all ,
Since is a Gumbel variable, the above mechanism is equivalent to where are independent Gumbel variables. It’s the Gumbel-max trick!
- It’s hard to trace back the first instance of this sampling mechanism, but it appears in various of Arthur Dempster’s articles, e.g. “New methods for reasoning towards posterior distributions based on sample data”, 1966, and it is discussed at length in “A class of random convex polytopes”, 1972.
- The connection occurred to me while reading Xi’an’s blog post, which points to this interesting article on Emil Gumbel, academic in Heidelberg up to his exile in 1932, “pioneer of modern data journalism” and active opponent to the nazis. Quoting from the article, “His fate was sealed when, at a speech in memory of the 700,000 who had perished of hunger in the winter of 1916/17, he remarked that a rutabaga would certainly be a better memorial than a scantily clad virgin with a palm frond”.
- The Gumbel-max trick is interesting for many reasons, it amounts to viewing sampling as an optimization program, it can be “relaxed” in various useful ways, etc. In Art Dempster’s work that sampling mechanism is appealing because of its invariance by relabeling of the categories (“category 2” is not between “category 1” and “category 3”). This matters when performing inference with Categorical distributions (i.e. with count data) using Art Dempster’s approach, because the estimation depends on the choice of sampling mechanism and not simply on the likelihood function.
Please visit source website for post related comments.