Show Menu
Cheatography

Machine Learning in R and Python Cheat Sheet (DRAFT) by

Data Processing and Machine Learning in R and Python

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Introd­uction

This cheat sheet provides a comparison between basic data processing technique as well as machine learning models in both R and Python.

Docume­nta­tions

Load dataset in R

librar­y(d­ata­sets)
Import packages
data(iris)
Load dataset
head(iris)
Look up the first 6 rows of the dataset
summar­y(iris)
Get summary statistics of each columns
names(­iris)
Get the column names

Data prepro­cessing in R

scaling = prePro­ces­s(data, method = c('cen­ter', 'scale'))
Create scaling based on data
data_s­caled = predic­t(s­caling, data)
Apply scaling to data
train_­par­tition = create­Dat­aPa­rti­tion(y, p = 0.8, list = FALSE)
Balanced splitting based on the outcome ( 80/20 split)
data_train = data[t­rai­n_p­art­ition,]
Split data into train and test sets
data_test = data[-­tra­in_­par­tit­ion,]
Split data into train and test sets

Supervised learning models in R

model = lm(data, y ~ x)
Simple linear regression
model = lm(data, y ~ x1 + x2 + x3)
Multiple linear regression
summar­y(m­odel)
Print summary statistics from linear model
predic­tions = predic­t(o­bject, newdata)
Make prediction based on the model object
model = glm(data, y ~ x1 + x2 + x3, family = 'binom­ial')
Logistic regression
model = svm(data, y ~ x1 + x2 + x3, params)
Support vector machines (SVM)
model = rpart(­data, y ~ x1 + x2 + x3, params)
Decision trees
model = random­For­est­(data, y ~ x1 + x2 + x3, params)
Random forest
data_xgb = xgb.DM­atr­ix(­data, label)
Transform the data into DMatrix format
model = xgb.tr­ain­(da­ta_xgb, label, params)
Gradient boosting models
predic­tions = knn(train, test, cl, params)
k-NN with labels cl and parameters (e.g., number of neighbors)

Unsupe­rvised learning models

model = kmeans(x, params)
K-Means clustering
model = prcomp(x, params)
Principal components analysis (PCA)

Model perfor­mance in R

RMSE(pred, actual)
Root mean square error
R2(pred, actual, form = 'tradi­tional' )
Proportion of the variance explained by the model
mean(a­ctual == pred)
Accuracy (how accurate positive predic­tions are)
confus­ion­Mat­rix­(ac­tual, pred)
Confusion matrix
auc(ac­tual, pred)
Area under the ROC curve
f1Scor­e(a­ctual, pred)
Harmonic mean of precision and recall

Data visual­ization in R

geom_p­oint(x, y, color, size, fill, alpha)
Scatter plot
geom_l­ine(x, y, color, size, fill, alpha, linetype)
Line plot
geom_b­ar(x, y, color, size, fill, alpha)
Bar chart
geom_b­oxp­lot(x, y, color)
Box plot
geom_t­ile(x, y, color, fill)
Heatmap
 

Import file in Python

import pandas as pd
Import package
df = pd.rea­d_csv()
Read csv files
df.head(n)
Look up the first n rows of the dataset
df.des­cribe()
Get summary statistics of each columns
df.columns
Get column names

Data Processing in Python

X_train, X_test, y_train, y_test = train_­tes­t_s­plit(X, y, test_s­ize­=0.2, random­_st­ate=0)
Split the dataset into training (80%) and test (20%) sets
scaler = Standa­rdS­caler()
Standa­rdize features by removing the mean and scaling to unit variance
X_train = scaler.fi­t_t­ran­sfo­rm(­X_t­rain)
Fit and transform scalar on X_train
X_test = scaler.tr­ans­for­m(X­_test)
Transform X_test

Supervised learning models in Python

model = Linear­Reg­res­sion()
Linear regression
model.f­it­(X_­train, y_train)
Fit linear model
model.p­re­dic­t(X­_test)
Predict using the linear model
Logist­icR­egr­ess­ion­().f­it­(X_­train, y_train)
Logistic regression
Linear­SVC.fi­t(X­_train, y_train)
Train primal SVM
SVC().f­it­(X_­train, y_train)
Train dual SVM
Decisi­onT­ree­Cla­ssi­fie­r().fi­t(X­_train, y_train)
Decision tree classifier
Random­For­est­Cla­ssi­fie­r().fi­t(X­_train, y_train)
Random forest classifier
Gradie­ntB­oos­tin­gCl­ass­ifi­er(­).f­it(­X_t­rain, y_train)
Gradient boosting for classi­fic­ation
XGBCla­ssi­fie­r().fi­t(X­_train, y_train)
XGboost classifier
KNeigh­bor­sCl­ass­ifi­er(­).f­it(­X_t­rain, y_train)
k-NN

Unsupe­rvised learning models

KMeans­().f­it(X)
K-Means clustering
PCA().f­it(X)
Principal component analysis (PCA)

Model perfor­mance in Python

metric­s.m­ean­_sq­uar­ed_­err­or(­y_true, y_pred, square­d=F­alse)
Root mean squared error
metric­s.r­2_s­cor­e(y­_true, y_pred)
Proportion of the variance explained by the model
metric­s.c­onf­usi­on_­mat­rix­(y_­true, y_pred)
Confusion matrix
metric­s.a­ccu­rac­y_s­cor­e(y­_true, y_pred)
Accuracy classi­fic­ation score
metric­s.r­oc_­auc­_sc­ore()
Compute ROC-AUC from prediction scores
f1_sco­re(­y_true, y_pred, averag­e='­macro')
Harmonic mean of the precision and recall

Data visual­ization in Python

sns.sc­att­erp­lot(x, y, hue, size)
Scatter plot
sns.li­nep­lot(x, y, hue, size)
Line plot
sns.ba­rpl­ot(x, y, hue)
Bar chart
sns.bo­xpl­ot(x, y, hue)
Box plot
sns.he­atm­ap(­data, linecolor, linewidth)
Heatmap