I GOTCHU BBG
Okay so. Imagine your data is like a chaotic group chat with 50 people all talking at once. It's a LOT. PCA is that one bestie who goes "okay girlies, let's SUMMARISE." It takes your high-dimensional data and squishes it into fewer dimensions while keeping the most important info intact.
Official definition: PCA is an unsupervised dimensionality reduction technique that finds new axes (Principal Components) along which the data has maximum variance, then projects the data onto those axes.
| assumption | what it means in plain english | vibe |
|---|---|---|
| Linearity | PCA only finds straight-line relationships between features | ๐ |
| Large variance = info | The more a feature varies, the more important it is. Boring features get yeeted. | ๐ |
| Standardised data | All features must be on the same scale or the big numbers bully the small ones | โ๏ธ |
| Orthogonality | Each PC is perpendicular (90ยฐ) to every other PC. They never overlap. | ๐ |
| Continuous data | Works best with numbers, not categories like "cat" or "dog" | ๐ข |
| fancy word | what it means | remember it as |
|---|---|---|
| Eigenvector | The DIRECTION of a Principal Component โ a unit vector \(\mathbf{v}\) where \(\|\mathbf{v}\|=1\) | The arrow pointing to where drama is ๐น |
| Eigenvalue | HOW MUCH variance (\(\lambda\)) is in that direction. Bigger = more important. | The drama level on a scale of 1โ100 ๐ฃ |
| Principal Component | A new axis = linear combination of original features arranged by importance | The main character energy axis โญ |
| Loading | Correlation between an original feature and a PC (values in eigenvector) | Which friend is actually holding up the group ๐ |
| Score | Where a data point lands after projection: \(z = \mathbf{x} \cdot \mathbf{v}\) | Your data point's new address ๐ |
| Covariance | How two features move together โ positive, negative, or not at all | Are they besties or enemies? ๐๐ |
| Scree Plot | Graph of \(\lambda_i\) in descending order. Find the "elbow" = how many PCs to keep. | The Netflix scroll of eigenvalues ๐บ |
| Explained Variance | Percentage of total info captured: \(\lambda_i / \sum_j \lambda_j\) | How much gossip one person knows ๐ฃ๏ธ |
Every feature must be on the same scale. Without this, salary = 50000 bulldozes age = 25, even if age matters equally. We compute the z-score for every value in every feature column.
where \(\displaystyle\mu = \frac{1}{n}\sum_{i=1}^{n} x_i\) is the column mean, and \(\displaystyle\sigma = \sqrt{\frac{\sum_{i=1}^{n}(x_i-\mu)^2}{n-1}}\) is the standard deviation.
๐งฎ Example: Feature values \(= \{2,\ 4,\ 6,\ 8,\ 10\}\)
\[\mu = \frac{2+4+6+8+10}{5} = 6.0\]
\[\sigma = \sqrt{\frac{(2-6)^2+(4-6)^2+(6-6)^2+(8-6)^2+(10-6)^2}{4}} = \sqrt{\frac{16+4+0+4+16}{4}} = \sqrt{10} \approx 3.16\]
For \(x=2\): \(\;z = \dfrac{2-6}{3.16} \approx -1.27\) For \(x=10\): \(\;z = \dfrac{10-6}{3.16} \approx +1.27\)
โ After this, every feature has mean \(= 0\) and standard deviation \(= 1\).
Now we measure how each pair of features varies together. The result is an \(p \times p\) symmetric matrix (where \(p\) = number of features).
For the full data matrix \(\mathbf{X}\) (already mean-centred), the covariance matrix is:
๐งฎ Example โ 3 standardised data points, 2 features \(x_1,\, x_2\):
\[\mathbf{X} = \begin{pmatrix} 1 & 2 \\ 2 & 3 \\ 3 & 4 \end{pmatrix}, \qquad \bar{x}_1 = 2,\quad \bar{x}_2 = 3\]
\[\text{Cov}(x_1, x_2) = \frac{(1-2)(2-3)+(2-2)(3-3)+(3-2)(4-3)}{3-1} = \frac{(1)+(0)+(1)}{2} = 1.0\]
Similarly \(\text{Cov}(x_1,x_1)=1\) and \(\text{Cov}(x_2,x_2)=1\), so:
\[\mathbf{C} = \begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix}\]
Diagonal entries = variance of each feature. Off-diagonal = covariance between pairs.
We decompose the covariance matrix. Each eigenvector gives a direction (Principal Component); its eigenvalue tells us how much variance lives in that direction.
\(\mathbf{v}\) = eigenvector (direction), \(\lambda\) = eigenvalue (amount of variance). To find \(\lambda\), solve:
๐งฎ Example with \(\mathbf{C} = \begin{pmatrix}2&1\\1&2\end{pmatrix}\):
Find eigenvalues:
\[\det\!\left(\mathbf{C} - \lambda\mathbf{I}\right) = \det\begin{pmatrix}2-\lambda & 1\\1 & 2-\lambda\end{pmatrix} = (2-\lambda)^2 - 1 = 0\]
\[\lambda^2 - 4\lambda + 3 = 0 \implies (\lambda - 3)(\lambda - 1) = 0\]
\[\boxed{\lambda_1 = 3, \qquad \lambda_2 = 1}\]
Find eigenvectors โ substitute each \(\lambda\) into \((\mathbf{C}-\lambda\mathbf{I})\mathbf{v}=\mathbf{0}\):
For \(\lambda_1 = 3\): \[\mathbf{v}_1 = \frac{1}{\sqrt{2}}\begin{pmatrix}1\\1\end{pmatrix} \approx \begin{pmatrix}0.707\\0.707\end{pmatrix}\]
For \(\lambda_2 = 1\): \[\mathbf{v}_2 = \frac{1}{\sqrt{2}}\begin{pmatrix}1\\-1\end{pmatrix} \approx \begin{pmatrix}0.707\\-0.707\end{pmatrix}\]
Note: \(\mathbf{v}_1 \cdot \mathbf{v}_2 = 0\) โ they are orthogonal โ
Rank your Principal Components. Largest eigenvalue = PC1 (most important). Calculate how much variance each one explains:
๐งฎ From our example โ \(\lambda_1 = 3,\; \lambda_2 = 1\):
\[\text{PC1:} \quad \frac{3}{3+1} = \frac{3}{4} = 75\%\]
\[\text{PC2:} \quad \frac{1}{3+1} = \frac{1}{4} = 25\%\]
Cumulative: PC1 + PC2 = 100%. If 95% rule applies, PC1 alone isn't quite enough here โ keep both.
Three ways to decide โ pick your weapon:
Stack your chosen \(k\) eigenvectors as columns into a projection matrix \(\mathbf{W}\), then multiply to get the new reduced dataset \(\mathbf{Z}\).
\(m\) = number of samples, \(p\) = original features, \(k\) = chosen components (\(k \ll p\))
๐งฎ Example projection โ projecting 2 data points onto PC1 only (\(k=1\)):
\[\mathbf{X}_{\text{std}} = \begin{pmatrix}1.2 & 0.8 \\ -0.5 & 0.3\end{pmatrix}, \qquad \mathbf{w}_1 = \begin{pmatrix}0.707 \\ 0.707\end{pmatrix}\]
\[\mathbf{Z} = \mathbf{X}_{\text{std}} \cdot \mathbf{w}_1 = \begin{pmatrix}(1.2)(0.707)+(0.8)(0.707)\\(-0.5)(0.707)+(0.3)(0.707)\end{pmatrix} = \begin{pmatrix}1.414\\-0.141\end{pmatrix}\]
Your 2-feature data is now 1 number per point โ โ dimension reduced from 2 to 1!
Raw data โ 4 samples, 2 features (height & weight, made up for the example):
| Sample | \(x_1\) (height) | \(x_2\) (weight) |
|---|---|---|
| 1 | 2.5 | 2.4 |
| 2 | 0.5 | 0.7 |
| 3 | 2.2 | 2.9 |
| 4 | 1.9 | 2.2 |
Step 1 โ Compute means:
\[\bar{x}_1 = \frac{2.5+0.5+2.2+1.9}{4} = \frac{7.1}{4} = 1.775\]
\[\bar{x}_2 = \frac{2.4+0.7+2.9+2.2}{4} = \frac{8.2}{4} = 2.05\]
Step 2 โ Covariance matrix entries:
\[\text{Cov}(x_1,x_1) = \frac{(2.5-1.775)^2+(0.5-1.775)^2+(2.2-1.775)^2+(1.9-1.775)^2}{3}\] \[= \frac{(0.725)^2+(-1.275)^2+(0.425)^2+(0.125)^2}{3} = \frac{0.526+1.626+0.181+0.016}{3} = \frac{2.349}{3} \approx 0.616\]
\[\text{Cov}(x_1,x_2) \approx 0.615 \qquad \text{Cov}(x_2,x_2) \approx 0.716\]
\[\mathbf{C} = \begin{pmatrix}0.616 & 0.615 \\ 0.615 & 0.716\end{pmatrix}\]
Step 3 โ Eigenvalues via characteristic equation:
\[\text{tr}(\mathbf{C}) = 0.616 + 0.716 = 1.332 \qquad \det(\mathbf{C}) = (0.616)(0.716)-(0.615)^2 = 0.441-0.378 = 0.063\]
\[\lambda = \frac{\text{tr}(\mathbf{C}) \pm \sqrt{\,\text{tr}(\mathbf{C})^2 - 4\det(\mathbf{C})\,}}{2} = \frac{1.332 \pm \sqrt{1.774-0.252}}{2} = \frac{1.332 \pm \sqrt{1.522}}{2}\]
\[\boxed{\lambda_1 \approx 1.284 \qquad \lambda_2 \approx 0.049}\]
Step 4 โ Explained variance:
\[\text{PC1:}\quad \frac{1.284}{1.284+0.049} = \frac{1.284}{1.333} \approx 96.3\%\quad \checkmark\;\text{Keep only PC1!}\]
\[\text{PC2:}\quad \frac{0.049}{1.333} \approx 3.7\%\quad \text{(basically vibes โ drop it)}\]
Instead of computing the covariance matrix, sklearn uses Singular Value Decomposition (SVD) on the data matrix directly โ more numerically stable!
| SVD component | shape | maps to PCA as... |
|---|---|---|
| \(\mathbf{V}\) (right singular vectors) | \(p \times p\) | Eigenvectors of \(\mathbf{C}\) |
| Singular values \(\sigma_i^2/(n-1)\) | โ | Eigenvalues \(\lambda_i\) |
| \(\mathbf{U}\boldsymbol{\Sigma}\) | \(m \times p\) | PCA scores (projected data) |
The algebraic link between SVD and eigendecomposition:
\[\mathbf{C} = \frac{\mathbf{X}^\top\mathbf{X}}{n-1} = \frac{(\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top)^\top(\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top)}{n-1} = \mathbf{V}\,\frac{\boldsymbol{\Sigma}^2}{n-1}\,\mathbf{V}^\top\]
that's literally it. that's PCA. you know PCA now. you're literally a data scientist ๐โจ
๐ the sacred words of exam prophecy ๐
Cylinder Dumeer Exam Damalyou literally understood PCA. you are the principal component of your friend group. go SLAY that exam bestie ๐๐โจ