Show Menu
Cheatography

M5 Machine Learning Cheat Sheet (DRAFT) by

Cheat sheet for ML quiz

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Supervised learning

Uses labelled training data with mapped features to known labels­/ta­rgets to predict outcomes for new unseen data (the test set)
Classi­fic­ation: predicts catego­rical outcomes
Logistic regression: type of parametric classi­fier; passes a linear combin­ation of inputs through a logit (sigmoid) function; decision boundary classes everything to left as 0 and right as 1; data is not linearly separable producing a non-zero error rate
Regression: predicts continuous outcomes
Cross-­val­idation: for tuning hyperp­ara­meters and choosing between models, prevents overfi­tting or data-l­eakage by separating from test data

Scaling

Brings features into comparable ranges leading to faster and more stable model conver­gence i.e. distan­ce-­based algorithms
Normal­isation: constrains values to a fixed range e.g. [0,1] or [-1,1]; MinMax­Sca­ler() or Normal­izer()
Standa­rdi­sation: transforms the mean to 0 and varian­ce/sd to 1 (z-sco­ring), data is unitless; Standa­rdS­caler()

Evaluation metrics (linear regres­sion)

R2: proportion of variance explained by model features; closer to 1 is better
MAE: average magnitude of errors and easily interp­retable (same units as target), robust to outliers; smaller is better
MSE/RMSE: averaged squared difference between predicted and actual, sensitive to outliers; smaller is better

Evaluation metrics (class­ifi­cation)

False Positives are misdia­gnoses, so precision gives actual TPs
False Negatives are missed diagnoses, so recall­/se­nsi­tivity gives identified TPs
ROC-AUC: true positive rate vs false positive rate; closer to 1 is better
(Specif­icity: TN/(TN+FP)

Supervised Learning Pipeline

X = data.drop(columns='target')
y = data[['target']]
x_train, x_test, y_train, y_test =
train_test_split(X, y, test_size = 0.20..)
scaler = StandardScaler()
x_train=...scaler.fit_transform(x_train))
x_test=...scaler.transform(x_test))
model = LinearRegression()
model.fit(x_train, y_train)
y_pred = model.p­re­dic­t(x­_test)

K-fold cross validation

Splits dataset into 'k' equal-­sized folds, using k-1 folds for training and the remaining fold for valida­tion, repeating k times to get an average perfor­mance score; useful when data is limited because every data point is used for both training and valida­tion; leave-­one-out cross-­val­idation (LOOCV)
 

K-Nearest Neighbours (KNN)

Non-pa­ram­etric classifier that looks at K points in training set nearest to test input x then computes average of these neighb­ours; memory­-ba­sed­/in­sta­nce­-based learning; works well given good distance metric (Euclidean) and sufficient training data; poor perfor­mance under high dimens­ion­ality; KNeigh­bor­sCl­ass­ifi­er(­n_n­eig­hbo­rs=­3).f­it­(X,y)

L1 vs L2 regula­ris­ation in regression

L1 (Lasso): sets some coeffi­cients to 0 (feature select­ion); may jeopardise accuracy in small datasets
L2 (Ridge): shrinks coeffi­cients and penalises higher weights

Parametric vs Nonpar­ametric models

Parametric models: fixed set of parameters depending on number of features in e.g. regres­sion, Naive Bayes or number of centroids e.g. k-means cluste­ring; faster perfor­mance but stronger assump­tions
Non-pa­ram­etric models: makes no assump­tions about dataset; number of parameters grow with amount of training data e.g. KNN, decision trees, random forest, kernel SVMs; flexible but comput­ati­onally expensive

Unsupe­rvised Learning

K-means clustering: uses euclidean distance (scale features!) and iterat­ively minimises inertia (withi­n-c­luster sum-of­-sq­uares)
k-cluster centroids chosen at random → each datapoint assigned to cluster with nearest centroid → each centroid updated by taking mean of all points assigned to that cluster
Elbow method: determines optimal number of clusters
kmeans = KMeans­(n_­clu­sters = 3, init = 'k-mea­ns++', max_iter = 300...)
y_kmeans = kmeans.fi­t_p­red­ict­(fe­at_­array)

Margin­ali­sation

Sum of all probab­ility values where 𝑋=𝑥 occurs with all possible values of 𝑌
 

Condit­ional probab­ility

Bayes Rule

Base-rate fallacy: ignores prior probab­ilities of FPs, additi­onally use precision or confusion matrices

Naive Bayes

Assumes features are indepe­ndent; requires small amount of training data to estimate parame­ters; aim is to predict P(label | features); fast but bad estimator

Gaussian Naive Bayes classifier

GNB = Gaussi­anN­B(v­ar_­smo­oth­ing­=0.5)
GNB.fi­t(x­_train, y_train)
y_pred = GNB.pr­edi­ct(­x_test)
y_pred­_probs = GNB.pr­edi­ct_­pro­ba(­x_test)
Compute calibr­ation curves and brier score (lower is better), vary classi­fic­ation decision threshold (typically .5) and assess AUC, use GridSearch to vary var_sm­oothing parameter

Support Vector Machines (SVMs)

Supervised classifier that attempts to separate classes of data using a hyperplane wherein the 2 categories are linearly separable
Optimal hyperplane: maximises the margin between training points to minimise noise and hinge loss, preventing overfi­tting
Types of kernels: linear, poly, rbf, sigmoid - higher capacity and overfi­tting risk with more complex kernels
svc = SVC(kernel='linear'...)
svc.fit(X_train, y_train)
Compute accuracy and decision bounda­ries, use GridSearch to tune kernel and C hyperp­ara­meters (large C = small margin)
Pros: high-d­ime­nsional spaces; memory­-ef­ficient as uses support vectors; versatile with different kernels
Limita­tions: do not provide direct probab­ility estimates; poor perfor­mance if no features > no samples

Decision Trees (DTs)

Nonpar­ametric supervised learning for both classi­fic­ation and regression
Selects features iterat­ively based on a criterion: lowest entropy/highest inform­ation gain, Gini impurity i.e. how impure classes are within a dataset
node = feature, branch = choice, leaves = outcome
dt = Decisi­onT­ree­Cla­ssi­fie­r(...); dt.fit­(X_­train, y_train); Compute accuracy and decision bounda­ries; Use GridSearch to tune criterion and tree depth parameters
Limita­tions: prone to overfit; poor genera­lis­abi­lity; high variance; slight changes in dataset can drasti­cally change splits, compli­cating interp­ret­ation; unstable; errors at the top affect lower splits due to hierar­chical nature; biased if dataset unbalanced

Random Forest

Ensemble model that consists of multiple trees/base estimators; overcomes limita­tions of DTs
Averaging: build several indepe­ndent estimators and average predic­tions, reducing variance or overfi­tting in combined estimator
Pasting: random subsets of dataset are drawn as random subsets of samples
Baggin­g/b­oot­str­apping: samples are drawn with replac­ement
Random Subspaces: random subsets of dataset are drawn as random subsets of features
Random Patches: base estimators are built on subsets of both samples and features
Boosting: base estimators built sequen­tially with the next/c­ombined estimator trying to minimise bias and underf­itting
XGBoost builds trees in parallel; Gradient Boosting minimises residuals sequen­tially (itera­tive)
rf = Random­For­est­Cla­ssi­fier(); rf.fit­(X_­train, y_train); Use pipeline for hetero­geneous ensembles