Gmm – Savoga

GMM is a generative model i.e. it aims at estimating parameters of a distribution.

A GMM sample is composed of $j$ Gaussian variables (clusters) distributed with proportions $(\pi_1,…,\pi_k)$ ($\Sigma \pi_i =1$)

We can write:

\[X \sim \mathcal{N}(\mu_{Z},\Sigma_{Z})~~~~~~with~~Z \sim \pi\]

Thus, $X$ has a density which is a weighted-average of all Gaussian densities:

\[p_\theta(x) = \Sigma_{j=1}^{k}\pi_j f_j(x)\]

Estimation

We want to estimate $\theta = (\pi, \mu, \Sigma)$ where:

$\pi=(\pi_1,…,\pi_k)$, $\mu=(\mu_1,…,\mu_k)$, $\Sigma=(\Sigma_1,…,\Sigma_k)$

To do so, we use the maximum likelihood method (product of densities across all samples):

\[p_\theta(x)=\Pi_{i=1}^n p_\theta(x_i)\] \[l(\theta)=log(\Pi_{i=1}^n p_\theta(x_i))=\Sigma_{i=1}^n log(p_\theta(x_i))\]

We thus need to find $argmax(l(\theta))$

Problem: the likelihood function is not convex!

The expectation-maximization problem is used when we have latent variables (= variables for which we don’t know their associated distribution).

Let $z=(z_1,…,z_k)$ be the vector of latent variables. We can express the density as a joint function with respect to $z$:

\[p_\theta(x,z)=p_\theta(z)p_\theta(x|z)\]

The rest is similar to the example in the Stats notes (based on this document) using the Gaussian density instead of the Binomial density.

Initialize parameters ${\Sigma_1, \Sigma_2, \Sigma_3, \mu_1, \mu_2, \mu_3}$ and ${\pi_1, \pi_2, \pi_3}$ (latent variables).

At (local) optimum:

$\frac{d\mathbb{E}[\text{log-likelihood}]}{d \Sigma_k} = 0$

$\frac{d\mathbb{E}[\text{log-likelihood}]}{d \mu_k} = 0$

$\frac{d\mathbb{E}[\text{log-likelihood}]}{d \pi_k} = 0$

\[\pi_k = \frac{\sum_{n=1}^N \mathbb{E}[z_{nk}]}{N}\] \[\mu_k = \frac{\sum_{n=1}^N \mathbb{E}[z_{nk}]x_n}{\sum_{n=1}^N \mathbb{E}[z_{nk}]}\] \[\Sigma_k = \frac{\sum_{n=1}^N \mathbb{E}[z_{nk}](x_n-\mu_k)(x_n-\mu_k)^T}{\sum_{n=1}^N \mathbb{E}[z_{nk}]}\]

Difference with K-Means

Both methods focus on globular (convex) clusters.

GMM considers that every observation belongs to every cluster with a certain probability. In other words, it takes uncertainty into account. In K-Means, every observation belongs to one cluster only with probability = 1.
GMM takes variance-covariance into account. K-means uses same variance-covariance matrices for each cluster.