When not to use Gaussian Mixture Model (EM clustering)
This article is originally published at https://hameddaily.blogspot.com/
Eq. (1)
In other words, the idea of the EM clustering is that there are K clusters and points in j-th cluster are following a normal distribution with mean μj and covariance matrix Σj. Each point xi in the dataset has a soft assignment to the K clusters. This soft assignment is determined by πj N(xi|μj, Σj). One can convert this soft probabilistic assignment into membership by picking up the most likely clusters (cluster with highest probability of assignment).
- Non-Gaussian dataset: as it is clear from the formulation, GMM assumes an underlying Gaussian generative distribution. However, many practical datasets do not satisfy this assumption. I study effect of non-Gaussian dataset in two cases:
- The number of clusters is known
- The number of clusters is unknown
Example:
![]() |
Fig. 1: Example of GMM clustering |
Let study this more.
1- Non-Gaussian dataset
A. Number of clusters is known
When the data is not normal, there is no guarantee that EM clustering will pickup the right clusters. Look at the following example. Clearly, the data on the left has two clusters. However, the GMM clustering can not recognize the two.
![]() |
Fig.2: If data is not Gaussian, GMM clustering could not recognize right clusters |
As it is seen in Eq. (1), EM clustering requires to know the number of clusters (K) in advance. But what if you do not know the number of clusters? what if you want to use EM clustering in the production and number of clusters are different for different customers?
![]() |
Fig. 3: If data is Gaussian, we can find number of clusters using BIC |
But if the data is not normal then, BIC will be faulty in terms of correct number of clusters. Let use the data set in Figure 1. This time instead of telling the EM clustering that K=2, we try to find the best K. Well the result is not promising. The BIC suggest to have 12 clusters. Why? Because those 12 clusters will have normal distribution that matches the underlying assumption of GMM.
![]() |
Fig 4: If data is not Gaussian, correct number of clusters is hard to find. Because GMM finds number of clusters that all have Normal distribution. |
2- Uneven clusters
![]() |
Fig 5: Uneven cluster size leads to higher misclassification error rate |
Thanks for visiting r-craft.org
This article is originally published at https://hameddaily.blogspot.com/
Please visit source website for post related comments.