GMM is a generative model i.e. it aims at estimating parameters of a distribution.
A GMM sample is composed of $j$ Gaussian variables (clusters) distributed with proportions $(\pi_1,…,\pi_k)$ ($\Sigma \pi_i =1$)
We can write:
\[X \sim \mathcal{N}(\mu_{Z},\Sigma_{Z})~~~~~~with~~Z \sim \pi\]Thus, $X$ has a density which is a weighted-average of all Gaussian densities:
\[p_\theta(x) = \Sigma_{j=1}^{k}\pi_j f_j(x)\]Estimation
We want to estimate $\theta = (\pi, \mu, \Sigma)$ where:
$\pi=(\pi_1,…,\pi_k)$, $\mu=(\mu_1,…,\mu_k)$, $\Sigma=(\Sigma_1,…,\Sigma_k)$
To do so, we use the maximum likelihood method (product of densities across all samples):
\[p_\theta(x)=\Pi_{i=1}^n p_\theta(x_i)\] \[l(\theta)=log(\Pi_{i=1}^n p_\theta(x_i))=\Sigma_{i=1}^n log(p_\theta(x_i))\]We thus need to find $argmax(l(\theta))$
Problem: the likelihood function is not convex!
The expectation-maximization problem is used when we have latent variables (= variables for which we don’t know their associated distribution).
Let $z=(z_1,…,z_k)$ be the vector of latent variables. We can express the density as a joint function with respect to $z$:
\[p_\theta(x,z)=p_\theta(z)p_\theta(x|z)\]The rest is similar to the example in the Stats notes (based on this document) using the Gaussian density instead of the Binomial density.
Initialize parameters ${\Sigma_1, \Sigma_2, \Sigma_3, \mu_1, \mu_2, \mu_3}$ and ${\pi_1, \pi_2, \pi_3}$ (latent variables).
At (local) optimum:
$\frac{d\mathbb{E}[\text{log-likelihood}]}{d \Sigma_k} = 0$
$\frac{d\mathbb{E}[\text{log-likelihood}]}{d \mu_k} = 0$
$\frac{d\mathbb{E}[\text{log-likelihood}]}{d \pi_k} = 0$
\[\pi_k = \frac{\sum_{n=1}^N \mathbb{E}[z_{nk}]}{N}\] \[\mu_k = \frac{\sum_{n=1}^N \mathbb{E}[z_{nk}]x_n}{\sum_{n=1}^N \mathbb{E}[z_{nk}]}\] \[\Sigma_k = \frac{\sum_{n=1}^N \mathbb{E}[z_{nk}](x_n-\mu_k)(x_n-\mu_k)^T}{\sum_{n=1}^N \mathbb{E}[z_{nk}]}\]Difference with K-Means
Both methods focus on globular (convex) clusters.
-
GMM considers that every observation belongs to every cluster with a certain probability. In other words, it takes uncertainty into account. In K-Means, every observation belongs to one cluster only with probability = 1.
-
GMM takes variance-covariance into account. K-means uses same variance-covariance matrices for each cluster.
