โœฆ
โœฟ
โ˜…
โœฆ
โœฟ
โ˜…
๐ŸŽ€ โœจ ๐ŸŒธ โœจ ๐ŸŽ€

PCA Cheat Sheet

I GOTCHU BBG

๐Ÿง  Ez for moola โœจ exam chi ๐Ÿ“ maths thoosi ๐Ÿ’– cutesy important
๐ŸŒธ chapter 1 โ€” the vibe check
๐ŸŒธ

๐Ÿคฏ what even IS PCA??

Okay so. Imagine your data is like a chaotic group chat with 50 people all talking at once. It's a LOT. PCA is that one bestie who goes "okay girlies, let's SUMMARISE." It takes your high-dimensional data and squishes it into fewer dimensions while keeping the most important info intact.

It's like Marie Kondo-ing your dataset. Keep what sparks joy (variance). Throw out the clutter. ๐Ÿงนโœจ

Official definition: PCA is an unsupervised dimensionality reduction technique that finds new axes (Principal Components) along which the data has maximum variance, then projects the data onto those axes.

"Why does PCA exist?" โ€” because nobody wants to visualise 50 dimensions. Not even your professor. Especially not your professor. ๐Ÿ˜ญ
unsupervised dimensionality reduction linear method preserves variance
๐ŸŒท chapter 2 โ€” before you even start
๐Ÿ’œ

๐Ÿ“‹ assumptions (the rules of the club)

assumptionwhat it means in plain englishvibe
LinearityPCA only finds straight-line relationships between features๐Ÿ“
Large variance = infoThe more a feature varies, the more important it is. Boring features get yeeted.๐Ÿ“Š
Standardised dataAll features must be on the same scale or the big numbers bully the small onesโš–๏ธ
OrthogonalityEach PC is perpendicular (90ยฐ) to every other PC. They never overlap.๐Ÿ“
Continuous dataWorks best with numbers, not categories like "cat" or "dog"๐Ÿ”ข
PCA is picky about its assumptions. Respect the rules or the math breaks and cries ๐Ÿ˜ค
๐Ÿ’ฌ chapter 3 โ€” vocab drop (memorise these girlie)
๐ŸŒฟ

๐Ÿ—๏ธ the glossary of girlies

fancy wordwhat it meansremember it as
EigenvectorThe DIRECTION of a Principal Component โ€” a unit vector \(\mathbf{v}\) where \(\|\mathbf{v}\|=1\)The arrow pointing to where drama is ๐Ÿน
EigenvalueHOW MUCH variance (\(\lambda\)) is in that direction. Bigger = more important.The drama level on a scale of 1โ€“100 ๐Ÿ“ฃ
Principal ComponentA new axis = linear combination of original features arranged by importanceThe main character energy axis โญ
LoadingCorrelation between an original feature and a PC (values in eigenvector)Which friend is actually holding up the group ๐Ÿ’
ScoreWhere a data point lands after projection: \(z = \mathbf{x} \cdot \mathbf{v}\)Your data point's new address ๐Ÿ“
CovarianceHow two features move together โ€” positive, negative, or not at allAre they besties or enemies? ๐Ÿ’•๐Ÿ‘Š
Scree PlotGraph of \(\lambda_i\) in descending order. Find the "elbow" = how many PCs to keep.The Netflix scroll of eigenvalues ๐Ÿ“บ
Explained VariancePercentage of total info captured: \(\lambda_i / \sum_j \lambda_j\)How much gossip one person knows ๐Ÿ—ฃ๏ธ

โœจ chapter 4 โ€” the sacred ritual (all 6 steps)
๐ŸŽ€

๐Ÿช„ PCA step by step (the algorithm, bestie)

1

๐Ÿงน Standardise your data โ€” z-score everything!

Every feature must be on the same scale. Without this, salary = 50000 bulldozes age = 25, even if age matters equally. We compute the z-score for every value in every feature column.

\[ z_i = \frac{x_i - \mu}{\sigma} \]

where  \(\displaystyle\mu = \frac{1}{n}\sum_{i=1}^{n} x_i\)  is the column mean, and  \(\displaystyle\sigma = \sqrt{\frac{\sum_{i=1}^{n}(x_i-\mu)^2}{n-1}}\)  is the standard deviation.

๐Ÿงฎ Example: Feature values \(= \{2,\ 4,\ 6,\ 8,\ 10\}\)

\[\mu = \frac{2+4+6+8+10}{5} = 6.0\]

\[\sigma = \sqrt{\frac{(2-6)^2+(4-6)^2+(6-6)^2+(8-6)^2+(10-6)^2}{4}} = \sqrt{\frac{16+4+0+4+16}{4}} = \sqrt{10} \approx 3.16\]

For \(x=2\): \(\;z = \dfrac{2-6}{3.16} \approx -1.27\)   For \(x=10\): \(\;z = \dfrac{10-6}{3.16} \approx +1.27\)

โœ… After this, every feature has mean \(= 0\) and standard deviation \(= 1\).

Step 1 is not optional. Skip it and your PCA will be as unhinged as a group project where everyone used different fonts ๐Ÿ˜ญ
2

๐Ÿค Compute the Covariance Matrix

Now we measure how each pair of features varies together. The result is an \(p \times p\) symmetric matrix (where \(p\) = number of features).

\[ \text{Cov}(X, Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y}) \]

For the full data matrix \(\mathbf{X}\) (already mean-centred), the covariance matrix is:

\[ \mathbf{C} = \frac{1}{n-1}\,\mathbf{X}^\top \mathbf{X} \]

๐Ÿงฎ Example โ€” 3 standardised data points, 2 features \(x_1,\, x_2\):

\[\mathbf{X} = \begin{pmatrix} 1 & 2 \\ 2 & 3 \\ 3 & 4 \end{pmatrix}, \qquad \bar{x}_1 = 2,\quad \bar{x}_2 = 3\]

\[\text{Cov}(x_1, x_2) = \frac{(1-2)(2-3)+(2-2)(3-3)+(3-2)(4-3)}{3-1} = \frac{(1)+(0)+(1)}{2} = 1.0\]

Similarly \(\text{Cov}(x_1,x_1)=1\) and \(\text{Cov}(x_2,x_2)=1\), so:

\[\mathbf{C} = \begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix}\]

Diagonal entries = variance of each feature. Off-diagonal = covariance between pairs.

Cov > 0 โ†’ besties ๐Ÿ’•  |  Cov < 0 โ†’ enemies ๐Ÿ‘Š  |  Cov โ‰ˆ 0 โ†’ they don't know each other exist ๐Ÿคท
3

๐Ÿ” Compute Eigenvalues & Eigenvectors

We decompose the covariance matrix. Each eigenvector gives a direction (Principal Component); its eigenvalue tells us how much variance lives in that direction.

\[ \mathbf{C}\,\mathbf{v} = \lambda\,\mathbf{v} \]

\(\mathbf{v}\) = eigenvector (direction), \(\lambda\) = eigenvalue (amount of variance). To find \(\lambda\), solve:

\[ \det(\mathbf{C} - \lambda\,\mathbf{I}) = 0 \]

๐Ÿงฎ Example with \(\mathbf{C} = \begin{pmatrix}2&1\\1&2\end{pmatrix}\):

Find eigenvalues:

\[\det\!\left(\mathbf{C} - \lambda\mathbf{I}\right) = \det\begin{pmatrix}2-\lambda & 1\\1 & 2-\lambda\end{pmatrix} = (2-\lambda)^2 - 1 = 0\]

\[\lambda^2 - 4\lambda + 3 = 0 \implies (\lambda - 3)(\lambda - 1) = 0\]

\[\boxed{\lambda_1 = 3, \qquad \lambda_2 = 1}\]

Find eigenvectors โ€” substitute each \(\lambda\) into \((\mathbf{C}-\lambda\mathbf{I})\mathbf{v}=\mathbf{0}\):

For \(\lambda_1 = 3\): \[\mathbf{v}_1 = \frac{1}{\sqrt{2}}\begin{pmatrix}1\\1\end{pmatrix} \approx \begin{pmatrix}0.707\\0.707\end{pmatrix}\]

For \(\lambda_2 = 1\): \[\mathbf{v}_2 = \frac{1}{\sqrt{2}}\begin{pmatrix}1\\-1\end{pmatrix} \approx \begin{pmatrix}0.707\\-0.707\end{pmatrix}\]

Note: \(\mathbf{v}_1 \cdot \mathbf{v}_2 = 0\) โ€” they are orthogonal โœ…

Eigenvectors are the "main character directions" of your data. Eigenvalues are how much screen time each gets. ๐ŸŽฌ
4

๐Ÿ“Š Sort Eigenvalues Biggest โ†’ Smallest

Rank your Principal Components. Largest eigenvalue = PC1 (most important). Calculate how much variance each one explains:

\[ \%\,\text{variance explained by PC}_i \;=\; \frac{\lambda_i}{\displaystyle\sum_{j=1}^{p}\lambda_j} \times 100 \]

๐Ÿงฎ From our example โ€” \(\lambda_1 = 3,\; \lambda_2 = 1\):

\[\text{PC1:} \quad \frac{3}{3+1} = \frac{3}{4} = 75\%\]

\[\text{PC2:} \quad \frac{1}{3+1} = \frac{1}{4} = 25\%\]

Cumulative: PC1 + PC2 = 100%. If 95% rule applies, PC1 alone isn't quite enough here โ€” keep both.

PC1 is the main character. PC2 is the best friend. Everything after that is background extras. ๐ŸŒŸ
5

โœ‚๏ธ Choose How Many Components to Keep (\(k\))

Three ways to decide โ€” pick your weapon:

๐ŸŽฏ Scree Plot: Plot \(\lambda_i\) vs PC index. Find the elbow โ€” keep PCs before the sharp drop.
๐Ÿ“ˆ Cumulative Variance: Keep the smallest \(k\) such that \(\displaystyle\sum_{i=1}^{k} \frac{\lambda_i}{\sum_j \lambda_j} \geq 0.95\). Most common rule!
๐Ÿ‘‘ Kaiser Rule: Keep only PCs where \(\lambda_i > 1\) โ€” they explain more than a single original variable.
The scree plot elbow is basically "vibes-based statistics." Your prof will love you for knowing this. ๐Ÿ“บโœ‚๏ธ
6

๐Ÿš€ Project Data onto the New PC Axes

Stack your chosen \(k\) eigenvectors as columns into a projection matrix \(\mathbf{W}\), then multiply to get the new reduced dataset \(\mathbf{Z}\).

\[ \mathbf{Z}_{m \times k} = \mathbf{X}_{\text{std},\; m \times p} \;\cdot\; \mathbf{W}_{p \times k} \]

\(m\) = number of samples, \(p\) = original features, \(k\) = chosen components (\(k \ll p\))

๐Ÿงฎ Example projection โ€” projecting 2 data points onto PC1 only (\(k=1\)):

\[\mathbf{X}_{\text{std}} = \begin{pmatrix}1.2 & 0.8 \\ -0.5 & 0.3\end{pmatrix}, \qquad \mathbf{w}_1 = \begin{pmatrix}0.707 \\ 0.707\end{pmatrix}\]

\[\mathbf{Z} = \mathbf{X}_{\text{std}} \cdot \mathbf{w}_1 = \begin{pmatrix}(1.2)(0.707)+(0.8)(0.707)\\(-0.5)(0.707)+(0.3)(0.707)\end{pmatrix} = \begin{pmatrix}1.414\\-0.141\end{pmatrix}\]

Your 2-feature data is now 1 number per point โœ… โ€” dimension reduced from 2 to 1!

You went from 50-dimensional chaos to a clean 2-feature banger. PCA really said "less is more bestie." ๐Ÿ’…โœจ

๐Ÿงฎ chapter 5 โ€” full worked example
๐ŸŒป

๐Ÿ“ doing the whole thing start to finish

Raw data โ€” 4 samples, 2 features (height & weight, made up for the example):

Sample\(x_1\) (height)\(x_2\) (weight)
12.52.4
20.50.7
32.22.9
41.92.2

Step 1 โ€” Compute means:

\[\bar{x}_1 = \frac{2.5+0.5+2.2+1.9}{4} = \frac{7.1}{4} = 1.775\]

\[\bar{x}_2 = \frac{2.4+0.7+2.9+2.2}{4} = \frac{8.2}{4} = 2.05\]

Step 2 โ€” Covariance matrix entries:

\[\text{Cov}(x_1,x_1) = \frac{(2.5-1.775)^2+(0.5-1.775)^2+(2.2-1.775)^2+(1.9-1.775)^2}{3}\] \[= \frac{(0.725)^2+(-1.275)^2+(0.425)^2+(0.125)^2}{3} = \frac{0.526+1.626+0.181+0.016}{3} = \frac{2.349}{3} \approx 0.616\]

\[\text{Cov}(x_1,x_2) \approx 0.615 \qquad \text{Cov}(x_2,x_2) \approx 0.716\]

\[\mathbf{C} = \begin{pmatrix}0.616 & 0.615 \\ 0.615 & 0.716\end{pmatrix}\]

Step 3 โ€” Eigenvalues via characteristic equation:

\[\text{tr}(\mathbf{C}) = 0.616 + 0.716 = 1.332 \qquad \det(\mathbf{C}) = (0.616)(0.716)-(0.615)^2 = 0.441-0.378 = 0.063\]

\[\lambda = \frac{\text{tr}(\mathbf{C}) \pm \sqrt{\,\text{tr}(\mathbf{C})^2 - 4\det(\mathbf{C})\,}}{2} = \frac{1.332 \pm \sqrt{1.774-0.252}}{2} = \frac{1.332 \pm \sqrt{1.522}}{2}\]

\[\boxed{\lambda_1 \approx 1.284 \qquad \lambda_2 \approx 0.049}\]

Step 4 โ€” Explained variance:

\[\text{PC1:}\quad \frac{1.284}{1.284+0.049} = \frac{1.284}{1.333} \approx 96.3\%\quad \checkmark\;\text{Keep only PC1!}\]

\[\text{PC2:}\quad \frac{0.049}{1.333} \approx 3.7\%\quad \text{(basically vibes โ€” drop it)}\]

96.3% with just one component?? PCA really said "one is enough" and ATE. ๐Ÿ’…
๐Ÿ’ป chapter 6 โ€” the python code (copy and paste bestie)
๐Ÿ’™

๐Ÿ sklearn code (actual exam weapon)

# step 1: import the girlies
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np

# step 2: standardise (NEVER skip this!!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# step 3: apply PCA (choose k=2 components)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# step 4: check how much variance each PC explains
pca.explained_variance_ratio_ # e.g. [0.963, 0.037]
pca.components_ # the eigenvectors (loadings)
pca.explained_variance_ # the eigenvalues themselves

# bonus: auto-choose k for 95% variance threshold
pca_auto = PCA(n_components=0.95)
X_auto = pca_auto.fit_transform(X_scaled)
sklearn does all the eigen-maths for you. But you still have to know HOW it works for the exam ๐Ÿ˜‡
๐ŸŽฏ chapter 7 โ€” use it or lose it
โœ…

use PCA when...

  • Too many features (high dimensionality)
  • Features are correlated / multicollinear
  • You want to visualise data in 2D/3D
  • You want to speed up ML training
  • You want to remove noise from data
  • You need to compress data
โŒ

skip PCA when...

  • You need feature interpretability
  • Data is non-linear (use Kernel PCA)
  • Dataset is tiny (you'll overfit)
  • Features are already independent
  • You have categorical data only
  • Accuracy > compression matters
๐Ÿ‘

๐Ÿ”„ PCA via SVD (the sneaky alternative)

Instead of computing the covariance matrix, sklearn uses Singular Value Decomposition (SVD) on the data matrix directly โ€” more numerically stable!

\[ \mathbf{X} = \mathbf{U}\,\boldsymbol{\Sigma}\,\mathbf{V}^\top \]
SVD componentshapemaps to PCA as...
\(\mathbf{V}\) (right singular vectors)\(p \times p\)Eigenvectors of \(\mathbf{C}\)
Singular values \(\sigma_i^2/(n-1)\)โ€”Eigenvalues \(\lambda_i\)
\(\mathbf{U}\boldsymbol{\Sigma}\)\(m \times p\)PCA scores (projected data)

The algebraic link between SVD and eigendecomposition:

\[\mathbf{C} = \frac{\mathbf{X}^\top\mathbf{X}}{n-1} = \frac{(\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top)^\top(\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top)}{n-1} = \mathbf{V}\,\frac{\boldsymbol{\Sigma}^2}{n-1}\,\mathbf{V}^\top\]

SVD is PCA's undercover alias. Same results, fancier math. Very spy coded. ๐Ÿ•ต๏ธโ€โ™€๏ธ

๐Ÿšจ EXAM TRAPS โ€” don't be a victim bestie

โšก chapter 8 โ€” all formulas (screenshot this bestie)
๐Ÿ“

๐Ÿ“‹ the formula wall

Z-score (standardise)
\[ z_i = \frac{x_i - \mu}{\sigma} \]
Covariance
\[ \text{Cov}(X,Y) = \frac{\sum_{i}(x_i-\bar{x})(y_i-\bar{y})}{n-1} \]
Covariance matrix (compact)
\[ \mathbf{C} = \frac{1}{n-1}\,\mathbf{X}^\top\mathbf{X} \]
Characteristic equation
\[ \det(\mathbf{C} - \lambda\,\mathbf{I}) = 0 \]
Explained variance %
\[ \text{EV}_i = \frac{\lambda_i}{\displaystyle\sum_{j=1}^{p}\lambda_j} \times 100 \]
Projection
\[ \mathbf{Z} = \mathbf{X}_{\text{std}} \cdot \mathbf{W}_{p \times k} \]
SVD & its link to PCA
\[ \mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top \;\;\implies\;\; \mathbf{C} = \frac{\mathbf{X}^\top\mathbf{X}}{n-1} = \mathbf{V}\,\frac{\boldsymbol{\Sigma}^2}{n-1}\,\mathbf{V}^\top \]

๐ŸŒธ the 6 steps at a glance

1๏ธโƒฃ Standardise 2๏ธโƒฃ Covariance Matrix 3๏ธโƒฃ Eigenvectors 4๏ธโƒฃ Sort by ฮป 5๏ธโƒฃ Choose k 6๏ธโƒฃ Project!

that's literally it. that's PCA. you know PCA now. you're literally a data scientist ๐ŸŽ“โœจ


๐ŸŒธ โœจ ๐Ÿ’… โœจ ๐ŸŒธ

๐ŸŽ€ the sacred words of exam prophecy ๐ŸŽ€

Cylinder  Dumeer  Exam  Damal
๐ŸŽ€ ๐Ÿ’œ ๐Ÿ’› โœจ ๐Ÿ’™ ๐Ÿงก ๐ŸŽ€

you literally understood PCA. you are the principal component of your friend group. go SLAY that exam bestie ๐Ÿ†๐Ÿ’–โœจ