The PCA’s objective is to get an approximation of data in a low dimensional space.
Inertia
Inertia $I_M = \Sigma_{i=1}^n p_i ||x_i - g||_M^2$
Note: the inertia in physics is equivalent to the variance in statistics
where $g^T = (\bar{x}^{(1)},…,\bar{x}^{(p)})$ also called the gravity center.
–> The inertia is thus the weighted average of the squared distance of each observation with the gravity center.
–> $p_i$ is the weight given to each observation. Most of the times, $p_i = \frac{1}{n}$ (every observation contributes equally to the analysis)
–> the distance $||.||$ depends on the choosen metric $M$
If the data are centered:
\[I_M = \Sigma_{i=1}^n p_i x_i^T M x_i\]Since $I_M \in \mathbb{R}$:
\[I_M = Tr(\Sigma_{i=1}^n p_i x_i^T M x_i)\]Thanks to the trace properties:
\[I_M = Tr(\Sigma_{i=1}^n M x_i p_i x_i^T)\]With $V = Cov(X)$:
\[I_M = Tr(MV)\]Projection
In order to represent the data in a low dimensional space, we use projections.
The projection should distort the initial space the less as possible, that is:
=> reduce the projection distances as much as possible
=> maximize the average of squared distances between projected points
=> maximize inertia of the projected points
In the below figure, maximizing the inertia leads to choosing the projection on the right since d2 > d1.
The above figure shows that PCA achieves a dual objective: finding the maximum variance of the projected space (green arrow = projected variance) as well as minimizing the squared distance from projected space to the initial data (blue arrow). In that sense, PCA is also often described as a form of OLS.
Let $P$ a projector. $V = Cov(X) = X^TDX$ (with $D$ the weight matrix). The covariance matrix of the projected points is:
\[\begin{align*} V_P &= (PX)^TD(PX) \\ &= (XP^T)^TD(XP^T) \\ &= PX^TDXP^T \\ &= PVP^T \end{align*}\]Note: a projector $P$ is such that $P^2=P$ and $PM = MP^T$
Optimisation
As seen previously, the objective is to maximize the inertia. Combining the previous 2 paragraphs, we can express the inertia of projected points:
\[Ip_M = Tr(V_pM) = Tr(PVP^TM)\]$Tr(PVP^TM) = Tr(PVMP)$ since $PM = MP^T$
$~~~~~~~~~~~~~~~~~~~~~~~~= Tr(VMP^2)$ since $Tr(AB) = Tr(BA)$
$~~~~~~~~~~~~~~~~~~~~~~~~= Tr(VMP)$ since $P^2 = P$
Thus the optimisation problem is:
\[\max~Ip_M = \max~Tr(VMP)\]The objective is to find the line (in black on above figure) going through $g$ and maximizing the inertia. Let $a$ be a point on this line. We have the following equation:
\[P=a(a'Ma)^{-1}a'M\](Indeed we have $P^2=P$ and $PM = MP^T$)
$Tr(VMP) = Tr(VMa(a’Ma)^{-1}a’M)$
$~~~~~~~~~~~~~~~~~~= \frac{1}{a’Ma}Tr(VMaa’M)$
$~~~~~~~~~~~~~~~~~~= \frac{Tr(a’MVMa)}{a’Ma}$
$~~~~~~~~~~~~~~~~~~= \frac{a’MVMa}{a’Ma}$ since $a’MVMa$ is a scalar
In order to obtain the maximum, we use first order optimal conditions:
\[\frac{d}{da}(\frac{a'MVMa}{a'Ma}) = 0\]With $\frac{d}{da}(\frac{a’MVMa}{a’Ma}) = \frac{(a’Ma)2MVMa-(a’MVMa)2Ma}{(a’Ma)^2}$, previous equation becomes:
$MVMa = (\frac{a’MVMa}{a’Ma})Ma$
Since $\frac{a’MVMa}{a’Ma}$ is a scalar, let’s replace it by $\lambda$:
\[VMa = \lambda a\]Based on eigenvalue definition, $\lambda$ is thus the eigenvalue of $VM$.
We can thus rewrite the optimization problem:
\[\max~Ip_M = \max~Tr(VMP) = \max~\lambda\]This final result leads to the theorem:
The lower dimensional space is given by the eigenvectors associated with the biggest eigenvalues.
Summary of the proof:
Minimization of the distorsion => maximization of the inertia of the projected space => maximization of the covariance matrix’s eigenvalues.
In practice, we simply use the Spectral Decomposition of the covariance matrix:
\[Cov(X) = X^TX = UDU^T\]where $U$ is a matrix with eigenvectors in column. $D$ is the new covariance matrix in the orthogonal (all covariances are zero) projected space. The eigenvectors represent the main directions of the variability of the reduced space, whereas the eigenvalues represent the magnitudes for the directions. See more details in a step-by-step example here.
Implementation
import numpy as np
def PCA(X , num_components):
# Standardize variables
X_std = (X - np.mean(X))/np.std(X)
# Compute covariance matrix
cov_mat = np.cov(X_std , rowvar = False)
# Find eigenvalues and eigenvectors (results of a matrix decomposition)
eigen_values, eigen_vectors = np.linalg.eig(cov_mat)
# Reminder: cov_mat == eigen_vectors @ np.diag(eigen_values) @ eigen_vectors.T
# Sort eigenvalues and eigenvectors
sorted_index = np.argsort(eigen_values)[::-1]
sorted_eigenvalue = eigen_values[sorted_index]
sorted_eigenvectors = eigen_vectors[:,sorted_index]
# Keep eigenvectors associated with highest eigenvalues
eigenvector_subset = sorted_eigenvectors[:,0:num_components]
# Compute the reduced space
X_reduced = X_std @ eigenvector_subset
# X_reduced is the projection of the initial dataset
# eigenvector_subset is the projector
return X_reduced
Note (1): standardizing is a good practice before PCA; if not done, variables with high variances will have too much importance. Useful details here and here.
Explanation of the components
The principal components are a result of a matrix multiplication. Consequently, the principal components can be written as linear combinations of the initial features:
\[\begin{align*} \tilde{X} &= X \tilde{U}\\ &= \begin{pmatrix} 1 & 2\\ 3 & 4\\ 5 & 6 \end{pmatrix} \begin{pmatrix} 7 & 2\\ 5 & 4 \end{pmatrix} \\ &=\bigg( 7\begin{pmatrix} 1\\ 3\\ 5 \end{pmatrix} + 5\begin{pmatrix} 2\\ 4\\ 6 \end{pmatrix} ~~~~ 2\begin{pmatrix} 1\\ 3\\ 5 \end{pmatrix} + 4\begin{pmatrix} 2\\ 4\\ 6 \end{pmatrix}\bigg) \\ &=\bigg( 7f_1 + 5f_2 ~~~~ 2f_1 + 4f_2 \bigg)\\ &= \bigg( PC1 ~~~~ PC2 \bigg) \end{align*}\]def explain_pca(eigenvector_subset, feature_names=[]):
dict_res = {}
weight_list =[[] for i in np.zeros(eigenvector_subset.shape[1])]
if feature_names == []:
feature_names = ['feat ' + str(i) for i in range(eigenvector_subset.shape[0])]
for idx_col in range(eigenvector_subset.shape[1]):
pc_str = 'PC' + str(idx_col+1)
sum_col = np.sum(abs(eigenvector_subset[:,idx_col]))
for idx_row in range(eigenvector_subset.shape[0]):
weight = round(eigenvector_subset[idx_row, idx_col]/sum_col, 2)
new_explanation = '{} = {}'.format(feature_names[idx_row], weight)
if pc_str in dict_res:
dict_res[pc_str].append(new_explanation)
else:
dict_res[pc_str] = [new_explanation]
weight_list[idx_col].append(weight)
return dict_res, weight_list