🎀 ✨ 🌸 ✨ 🎀

PCA Cheat Sheet

I GOTCHU BBG

🧠 Ez for moola ✨ exam chi 📐 maths thoosi 💖 cutesy important

🌸 chapter 1 — the vibe check

🌸

🤯 what even IS PCA??

Okay so. Imagine your data is like a chaotic group chat with 50 people all talking at once. It's a LOT. PCA is that one bestie who goes "okay girlies, let's SUMMARISE." It takes your high-dimensional data and squishes it into fewer dimensions while keeping the most important info intact.

It's like Marie Kondo-ing your dataset. Keep what sparks joy (variance). Throw out the clutter. 🧹✨

Official definition: PCA is an unsupervised dimensionality reduction technique that finds new axes (Principal Components) along which the data has maximum variance, then projects the data onto those axes.

"Why does PCA exist?" — because nobody wants to visualise 50 dimensions. Not even your professor. Especially not your professor. 😭

unsupervised dimensionality reduction linear method preserves variance

🌷 chapter 2 — before you even start

💜

📋 assumptions (the rules of the club)

assumption	what it means in plain english	vibe
Linearity	PCA only finds straight-line relationships between features	📏
Large variance = info	The more a feature varies, the more important it is. Boring features get yeeted.	📊
Standardised data	All features must be on the same scale or the big numbers bully the small ones	⚖️
Orthogonality	Each PC is perpendicular (90°) to every other PC. They never overlap.	📐
Continuous data	Works best with numbers, not categories like "cat" or "dog"	🔢

PCA is picky about its assumptions. Respect the rules or the math breaks and cries 😤

💬 chapter 3 — vocab drop (memorise these girlie)

🌿

🗝️ the glossary of girlies

fancy word	what it means	remember it as
Eigenvector	The DIRECTION of a Principal Component — a unit vector \(\mathbf{v}\) where \(\\|\mathbf{v}\\|=1\)	The arrow pointing to where drama is 🏹
Eigenvalue	HOW MUCH variance (\(\lambda\)) is in that direction. Bigger = more important.	The drama level on a scale of 1–100 📣
Principal Component	A new axis = linear combination of original features arranged by importance	The main character energy axis ⭐
Loading	Correlation between an original feature and a PC (values in eigenvector)	Which friend is actually holding up the group 💁
Score	Where a data point lands after projection: \(z = \mathbf{x} \cdot \mathbf{v}\)	Your data point's new address 📍
Covariance	How two features move together — positive, negative, or not at all	Are they besties or enemies? 💕👊
Scree Plot	Graph of \(\lambda_i\) in descending order. Find the "elbow" = how many PCs to keep.	The Netflix scroll of eigenvalues 📺
Explained Variance	Percentage of total info captured: \(\lambda_i / \sum_j \lambda_j\)	How much gossip one person knows 🗣️

✨ chapter 4 — the sacred ritual (all 6 steps)

🎀

🪄 PCA step by step (the algorithm, bestie)

🧹 Standardise your data — z-score everything!

Every feature must be on the same scale. Without this, salary = 50000 bulldozes age = 25, even if age matters equally. We compute the z-score for every value in every feature column.

\[ z_i = \frac{x_i - \mu}{\sigma} \]

where \(\displaystyle\mu = \frac{1}{n}\sum_{i=1}^{n} x_i\) is the column mean, and \(\displaystyle\sigma = \sqrt{\frac{\sum_{i=1}^{n}(x_i-\mu)^2}{n-1}}\) is the standard deviation.

🧮 Example: Feature values \(= \{2,\ 4,\ 6,\ 8,\ 10\}\)

\[\mu = \frac{2+4+6+8+10}{5} = 6.0\]

\[\sigma = \sqrt{\frac{(2-6)^2+(4-6)^2+(6-6)^2+(8-6)^2+(10-6)^2}{4}} = \sqrt{\frac{16+4+0+4+16}{4}} = \sqrt{10} \approx 3.16\]

For \(x=2\): \(\;z = \dfrac{2-6}{3.16} \approx -1.27\) For \(x=10\): \(\;z = \dfrac{10-6}{3.16} \approx +1.27\)

✅ After this, every feature has mean \(= 0\) and standard deviation \(= 1\).

Step 1 is not optional. Skip it and your PCA will be as unhinged as a group project where everyone used different fonts 😭

🤝 Compute the Covariance Matrix

Now we measure how each pair of features varies together. The result is an \(p \times p\) symmetric matrix (where \(p\) = number of features).

\[ \text{Cov}(X, Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y}) \]

For the full data matrix \(\mathbf{X}\) (already mean-centred), the covariance matrix is:

\[ \mathbf{C} = \frac{1}{n-1}\,\mathbf{X}^\top \mathbf{X} \]

🧮 Example — 3 standardised data points, 2 features \(x_1,\, x_2\):

\[\mathbf{X} = \begin{pmatrix} 1 & 2 \\ 2 & 3 \\ 3 & 4 \end{pmatrix}, \qquad \bar{x}_1 = 2,\quad \bar{x}_2 = 3\]

\[\text{Cov}(x_1, x_2) = \frac{(1-2)(2-3)+(2-2)(3-3)+(3-2)(4-3)}{3-1} = \frac{(1)+(0)+(1)}{2} = 1.0\]

Similarly \(\text{Cov}(x_1,x_1)=1\) and \(\text{Cov}(x_2,x_2)=1\), so:

\[\mathbf{C} = \begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix}\]

Diagonal entries = variance of each feature. Off-diagonal = covariance between pairs.

Cov > 0 → besties 💕 | Cov < 0 → enemies 👊 | Cov ≈ 0 → they don't know each other exist 🤷

🔍 Compute Eigenvalues & Eigenvectors

We decompose the covariance matrix. Each eigenvector gives a direction (Principal Component); its eigenvalue tells us how much variance lives in that direction.

\[ \mathbf{C}\,\mathbf{v} = \lambda\,\mathbf{v} \]

\(\mathbf{v}\) = eigenvector (direction), \(\lambda\) = eigenvalue (amount of variance). To find \(\lambda\), solve:

\[ \det(\mathbf{C} - \lambda\,\mathbf{I}) = 0 \]

🧮 Example with \(\mathbf{C} = \begin{pmatrix}2&1\\1&2\end{pmatrix}\):

Find eigenvalues:

\[\det\!\left(\mathbf{C} - \lambda\mathbf{I}\right) = \det\begin{pmatrix}2-\lambda & 1\\1 & 2-\lambda\end{pmatrix} = (2-\lambda)^2 - 1 = 0\]

\[\lambda^2 - 4\lambda + 3 = 0 \implies (\lambda - 3)(\lambda - 1) = 0\]

\[\boxed{\lambda_1 = 3, \qquad \lambda_2 = 1}\]

Find eigenvectors — substitute each \(\lambda\) into \((\mathbf{C}-\lambda\mathbf{I})\mathbf{v}=\mathbf{0}\):

For \(\lambda_1 = 3\): \[\mathbf{v}_1 = \frac{1}{\sqrt{2}}\begin{pmatrix}1\\1\end{pmatrix} \approx \begin{pmatrix}0.707\\0.707\end{pmatrix}\]

For \(\lambda_2 = 1\): \[\mathbf{v}_2 = \frac{1}{\sqrt{2}}\begin{pmatrix}1\\-1\end{pmatrix} \approx \begin{pmatrix}0.707\\-0.707\end{pmatrix}\]

Note: \(\mathbf{v}_1 \cdot \mathbf{v}_2 = 0\) — they are orthogonal ✅

Eigenvectors are the "main character directions" of your data. Eigenvalues are how much screen time each gets. 🎬

📊 Sort Eigenvalues Biggest → Smallest

Rank your Principal Components. Largest eigenvalue = PC1 (most important). Calculate how much variance each one explains:

\[ \%\,\text{variance explained by PC}_i \;=\; \frac{\lambda_i}{\displaystyle\sum_{j=1}^{p}\lambda_j} \times 100 \]

🧮 From our example — \(\lambda_1 = 3,\; \lambda_2 = 1\):

\[\text{PC1:} \quad \frac{3}{3+1} = \frac{3}{4} = 75\%\]

\[\text{PC2:} \quad \frac{1}{3+1} = \frac{1}{4} = 25\%\]

Cumulative: PC1 + PC2 = 100%. If 95% rule applies, PC1 alone isn't quite enough here — keep both.

PC1 is the main character. PC2 is the best friend. Everything after that is background extras. 🌟

✂️ Choose How Many Components to Keep (\(k\))

Three ways to decide — pick your weapon:

🎯 Scree Plot: Plot \(\lambda_i\) vs PC index. Find the elbow — keep PCs before the sharp drop.

📈 Cumulative Variance: Keep the smallest \(k\) such that \(\displaystyle\sum_{i=1}^{k} \frac{\lambda_i}{\sum_j \lambda_j} \geq 0.95\). Most common rule!

👑 Kaiser Rule: Keep only PCs where \(\lambda_i > 1\) — they explain more than a single original variable.

The scree plot elbow is basically "vibes-based statistics." Your prof will love you for knowing this. 📺✂️

🚀 Project Data onto the New PC Axes

Stack your chosen \(k\) eigenvectors as columns into a projection matrix \(\mathbf{W}\), then multiply to get the new reduced dataset \(\mathbf{Z}\).

\[ \mathbf{Z}_{m \times k} = \mathbf{X}_{\text{std},\; m \times p} \;\cdot\; \mathbf{W}_{p \times k} \]

\(m\) = number of samples, \(p\) = original features, \(k\) = chosen components (\(k \ll p\))

🧮 Example projection — projecting 2 data points onto PC1 only (\(k=1\)):

\[\mathbf{X}_{\text{std}} = \begin{pmatrix}1.2 & 0.8 \\ -0.5 & 0.3\end{pmatrix}, \qquad \mathbf{w}_1 = \begin{pmatrix}0.707 \\ 0.707\end{pmatrix}\]

\[\mathbf{Z} = \mathbf{X}_{\text{std}} \cdot \mathbf{w}_1 = \begin{pmatrix}(1.2)(0.707)+(0.8)(0.707)\\(-0.5)(0.707)+(0.3)(0.707)\end{pmatrix} = \begin{pmatrix}1.414\\-0.141\end{pmatrix}\]

Your 2-feature data is now 1 number per point ✅ — dimension reduced from 2 to 1!

You went from 50-dimensional chaos to a clean 2-feature banger. PCA really said "less is more bestie." 💅✨

🧮 chapter 5 — full worked example

🌻

📝 doing the whole thing start to finish

Raw data — 4 samples, 2 features (height & weight, made up for the example):

Sample	\(x_1\) (height)	\(x_2\) (weight)
1	2.5	2.4
2	0.5	0.7
3	2.2	2.9
4	1.9	2.2

Step 1 — Compute means:

\[\bar{x}_1 = \frac{2.5+0.5+2.2+1.9}{4} = \frac{7.1}{4} = 1.775\]

\[\bar{x}_2 = \frac{2.4+0.7+2.9+2.2}{4} = \frac{8.2}{4} = 2.05\]

Step 2 — Covariance matrix entries:

\[\text{Cov}(x_1,x_1) = \frac{(2.5-1.775)^2+(0.5-1.775)^2+(2.2-1.775)^2+(1.9-1.775)^2}{3}\] \[= \frac{(0.725)^2+(-1.275)^2+(0.425)^2+(0.125)^2}{3} = \frac{0.526+1.626+0.181+0.016}{3} = \frac{2.349}{3} \approx 0.616\]

\[\text{Cov}(x_1,x_2) \approx 0.615 \qquad \text{Cov}(x_2,x_2) \approx 0.716\]

\[\mathbf{C} = \begin{pmatrix}0.616 & 0.615 \\ 0.615 & 0.716\end{pmatrix}\]

Step 3 — Eigenvalues via characteristic equation:

\[\text{tr}(\mathbf{C}) = 0.616 + 0.716 = 1.332 \qquad \det(\mathbf{C}) = (0.616)(0.716)-(0.615)^2 = 0.441-0.378 = 0.063\]

\[\lambda = \frac{\text{tr}(\mathbf{C}) \pm \sqrt{\,\text{tr}(\mathbf{C})^2 - 4\det(\mathbf{C})\,}}{2} = \frac{1.332 \pm \sqrt{1.774-0.252}}{2} = \frac{1.332 \pm \sqrt{1.522}}{2}\]

\[\boxed{\lambda_1 \approx 1.284 \qquad \lambda_2 \approx 0.049}\]

Step 4 — Explained variance:

\[\text{PC1:}\quad \frac{1.284}{1.284+0.049} = \frac{1.284}{1.333} \approx 96.3\%\quad \checkmark\;\text{Keep only PC1!}\]

\[\text{PC2:}\quad \frac{0.049}{1.333} \approx 3.7\%\quad \text{(basically vibes — drop it)}\]

96.3% with just one component?? PCA really said "one is enough" and ATE. 💅

💻 chapter 6 — the python code (copy and paste bestie)

💙

🐍 sklearn code (actual exam weapon)

# step 1: import the girlies

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

import numpy as np

# step 2: standardise (NEVER skip this!!)

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# step 3: apply PCA (choose k=2 components)

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

# step 4: check how much variance each PC explains

pca.explained_variance_ratio_  # e.g. [0.963, 0.037]

pca.components_              # the eigenvectors (loadings)

pca.explained_variance_       # the eigenvalues themselves

# bonus: auto-choose k for 95% variance threshold

pca_auto = PCA(n_components=0.95)

X_auto = pca_auto.fit_transform(X_scaled)

sklearn does all the eigen-maths for you. But you still have to know HOW it works for the exam 😇

🎯 chapter 7 — use it or lose it

✅

use PCA when...

Too many features (high dimensionality)
Features are correlated / multicollinear
You want to visualise data in 2D/3D
You want to speed up ML training
You want to remove noise from data
You need to compress data

❌

skip PCA when...

You need feature interpretability
Data is non-linear (use Kernel PCA)
Dataset is tiny (you'll overfit)
Features are already independent
You have categorical data only
Accuracy > compression matters

🍑

🔄 PCA via SVD (the sneaky alternative)

Instead of computing the covariance matrix, sklearn uses Singular Value Decomposition (SVD) on the data matrix directly — more numerically stable!

\[ \mathbf{X} = \mathbf{U}\,\boldsymbol{\Sigma}\,\mathbf{V}^\top \]

SVD component	shape	maps to PCA as...
\(\mathbf{V}\) (right singular vectors)	\(p \times p\)	Eigenvectors of \(\mathbf{C}\)
Singular values \(\sigma_i^2/(n-1)\)	—	Eigenvalues \(\lambda_i\)
\(\mathbf{U}\boldsymbol{\Sigma}\)	\(m \times p\)	PCA scores (projected data)

The algebraic link between SVD and eigendecomposition:

\[\mathbf{C} = \frac{\mathbf{X}^\top\mathbf{X}}{n-1} = \frac{(\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top)^\top(\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top)}{n-1} = \mathbf{V}\,\frac{\boldsymbol{\Sigma}^2}{n-1}\,\mathbf{V}^\top\]

SVD is PCA's undercover alias. Same results, fancier math. Very spy coded. 🕵️‍♀️

🚨 EXAM TRAPS — don't be a victim bestie

Forgetting to STANDARDISE first — this ruins literally everything, it's the number 1 mistake
PCA is UNSUPERVISED — it has no idea your labels exist. It ignores class info entirely.
Eigenvector \(\mathbf{v}\) = DIRECTION (a vector). Eigenvalue \(\lambda\) = AMOUNT of variance (a scalar). They are different things!!!
More PCs is NOT better. The whole point is fewer dimensions. Don't keep all \(p\) of them.
PCs are always orthogonal: \(\mathbf{v}_i^\top\mathbf{v}_j = 0\) for \(i \neq j\) — guaranteed by construction
PCA loses interpretability — PCs are linear combos of original features and don't have clear standalone meaning
PCA assumes LINEAR structure. Curved/nonlinear data? Use Kernel PCA with a kernel trick instead.
Eigenvector signs can flip: \(\mathbf{v}\) and \(-\mathbf{v}\) are both valid solutions. Don't panic.

⚡ chapter 8 — all formulas (screenshot this bestie)

📐

📋 the formula wall

Z-score (standardise)

\[ z_i = \frac{x_i - \mu}{\sigma} \]

Covariance

\[ \text{Cov}(X,Y) = \frac{\sum_{i}(x_i-\bar{x})(y_i-\bar{y})}{n-1} \]

Covariance matrix (compact)

\[ \mathbf{C} = \frac{1}{n-1}\,\mathbf{X}^\top\mathbf{X} \]

Characteristic equation

\[ \det(\mathbf{C} - \lambda\,\mathbf{I}) = 0 \]

Explained variance %

\[ \text{EV}_i = \frac{\lambda_i}{\displaystyle\sum_{j=1}^{p}\lambda_j} \times 100 \]

Projection

\[ \mathbf{Z} = \mathbf{X}_{\text{std}} \cdot \mathbf{W}_{p \times k} \]

SVD & its link to PCA

\[ \mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top \;\;\implies\;\; \mathbf{C} = \frac{\mathbf{X}^\top\mathbf{X}}{n-1} = \mathbf{V}\,\frac{\boldsymbol{\Sigma}^2}{n-1}\,\mathbf{V}^\top \]

🌸 the 6 steps at a glance

1️⃣ Standardise 2️⃣ Covariance Matrix 3️⃣ Eigenvectors 4️⃣ Sort by λ 5️⃣ Choose k 6️⃣ Project!

that's literally it. that's PCA. you know PCA now. you're literally a data scientist 🎓✨

🌸 ✨ 💅 ✨ 🌸

🎀 the sacred words of exam prophecy 🎀

Cylinder Dumeer Exam Damal

🎀 💜 💛 ✨ 💙 🧡 🎀

you literally understood PCA. you are the principal component of your friend group. go SLAY that exam bestie 🏆💖✨