Principal Component Analysis Cheat Sheet

Motivation

Handle High Multicollinearity
Existing Solutions:
1. Variable Selection (stepwise/forward/backward)
Cons: Each time dropping a variable, some information is lost
Visualization of more features
Existing Solutions:
1. Pairwise scatter plots pC2 = (p*(p-1)/2), where p is number of variables
Cons: if p=20, this would mean 190 plots!

There must be a better way of doing this. Goal is to find an algorithm to reduce the number of variables without losing information. i.e. PCA

Usecases

1. Dimensionality Reduction without losing information.
2. Easy Data Visualization and Exploratory Data Analysis
3. Create uncorrelated features/variables that can be an input to a prediction model
4. Uncovering latent variables/themes/concepts
5. Noise reduction in dataset

Prerequisite Knowledge

Building Blocks:
1. The basis of a space:
Set of linearly independent vectors/directions that span the entire space i.e. Any point in space can be represented as a combination of these vectors.
Ex: Each row of a dataset is a point in the space. Each column is a basis vector (representation of any point in terms of columns).
2. Basis transformation:
The process of converting your information from one set of basis to another. (OR) Representing your data in new columns different from original. Often for convenience, efficiency or just for common sense.
Ex: Dropping or Adding a column to the dataset.
3. Variance as information:
Variance = Information
If two variables are highly correlated, they together don't add a lot of information than they do individually. So one of them can be dropped.

In 2D geometry, X and Y axes are dimensions. i (1,0) is a unit vector in X direction, j (0,1) is a unit vector in the Y direction. For point a: ax, ay are the units to move in 'i' and 'j' directions to reach 'a' and also denoted as: ax i + ay j. Any point in 2D space can be represented in term of 'i' and 'j'. The 'i' and 'j' vectors are the 'basis of the space'. 'i' and 'j' are independent i.e. 'i' can't be expressed in terms of 'j' and vice versa

What does it do?

PCA is one of a family of techniques for taking high-dimensional data, and using the dependencies between the variables to represent it in a more tractable, lower-dimensional basis, without losing too much information.

Given p features/variables in a dataset, PCA finds the principal components as
1. a linear combination of the original features.
2. the principal components capture maximum variance in the dataset.

Mathematical Representation

The above equation represents the First Principal Component. PCA finds φ values such that the variance on Z₁ is maximum. The Second Principal Component is found as one that has maximal variance of all linear combinations that are uncorrelated with Z₁. And like this, each additional component is capturing incremental variance. The algorithm calculates 'k' principal components, where k<=p (p is number of variables in dataset).

Workings of PCA

1. Find Principal Components
a) Using SVD (Singular Value Decomposition)
2. Choose optimal number of principal components (k).

Singular Value Decomposition (SVD)

'Decomposition' because it breaks the original data matrix into 3 new matrices

Code

from sklearn.decomposition import PCA
import numpy as np

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)

Principal Component Analysis Cheat Sheet (DRAFT) by dganesh

Motivation

Usecases

Prerequisite Knowledge

What does it do?

Mathematical Representation

Workings of PCA

Singular Value Decomposition (SVD)

Code

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

Principal Component Analysis Cheat Sheet (DRAFT) by dganesh

Motivation

Usecases

Prereq­uisite Knowledge

What does it do?

Mathem­atical Repres­ent­ation

Workings of PCA

Singular Value Decomp­osition (SVD)

Code

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

Prerequisite Knowledge

Mathematical Representation

Singular Value Decomposition (SVD)