Show Menu
Cheatography

Principal Component Analysis Cheat Sheet (DRAFT) by

A machine learning algorithm

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Motivation

Handle High Multic­oll­ine­arity
Existing Solutions:
1. Variable Selection (stepw­ise­/fo­rwa­rd/­bac­kward)
Cons: Each time dropping a variable, some inform­ation is lost
Visual­ization of more features
Existing Solutions:
1. Pairwise scatter plots pC2 = (p*(p-­1)/2), where p is number of variables
Cons: if p=20, this would mean 190 plots!
There must be a better way of doing this. Goal is to find an algorithm to reduce the number of variables without losing inform­ation. i.e. PCA

Usecases

1. Dimens­ion­ality Reduction without losing inform­ation.
2. Easy Data Visual­ization and Explor­atory Data Analysis
3. Create uncorr­elated featur­es/­var­iables that can be an input to a prediction model
4. Uncovering latent variab­les­/th­eme­s/c­oncepts
5. Noise reduction in dataset

Prereq­uisite Knowledge

Building Blocks:
1. The basis of a space:
Set of linearly indepe­ndent vector­s/d­ire­ctions that span the entire space i.e. Any point in space can be repres­ented as a combin­ation of these vectors.
Ex: Each row of a dataset is a point in the space. Each column is a basis vector (repre­sen­tation of any point in terms of columns).
2. Basis transf­orm­ation:
The process of converting your inform­ation from one set of basis to another. (OR) Repres­enting your data in new columns different from original. Often for conven­ience, efficiency or just for common sense.
Ex: Dropping or Adding a column to the dataset.
3. Variance as inform­ation:
Variance = Inform­ation
If two variables are highly correl­ated, they together don't add a lot of inform­ation than they do indivi­dually. So one of them can be dropped.
In 2D geometry, X and Y axes are dimens­ions. i (1,0) is a unit vector in X direction, j (0,1) is a unit vector in the Y direction. For point a: ax, ay are the units to move in 'i' and 'j' directions to reach 'a' and also denoted as: ax i + ay j. Any point in 2D space can be repres­ented in term of 'i' and 'j'. The 'i' and 'j' vectors are the 'basis of the space'. 'i' and 'j' are indepe­ndent i.e. 'i' can't be expressed in terms of 'j' and vice versa
 

What does it do?

PCA is one of a family of techniques for taking high-d­ime­nsional data, and using the depend­encies between the variables to represent it in a more tractable, lower-­dim­ens­ional basis, without losing too much inform­ation.

Given p featur­es/­var­iables in a dataset, PCA finds the principal components as
1. a linear combin­ation of the original features.
2. the principal components capture maximum variance in the dataset.

Mathem­atical Repres­ent­ation

The above equation represents the First Principal Component. PCA finds φ values such that the variance on Z₁ is maximum. The Second Principal Component is found as one that has maximal variance of all linear combin­ations that are uncorr­elated with Z₁. And like this, each additional component is capturing increm­ental variance. The algorithm calculates 'k' principal compon­ents, where k<=p (p is number of variables in dataset).

Workings of PCA

1. Find Principal Components
a) Using SVD (Singular Value Decomp­osi­tion)
2. Choose optimal number of principal components (k).

Singular Value Decomp­osition (SVD)

'Decom­pos­ition' because it breaks the original data matrix into 3 new matrices

Code

from sklearn.decomposition import PCA
import numpy as np

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)