Show Menu
Cheatography

Hands-On Machine Learning Cheat Sheet (DRAFT) by

Based on Hands-On Machine Learning with Scikit-Learn & TensorFlow

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Tips

Even though the RMSE is generally the preferred perfor­mance measure for regression tasks, in some contexts you may prefer to use another function. For example, suppose that there are many outlier districts. In that case, you may consider using the Mean Absolute Error.
Computing the root of a sum of squares (RMSE) corres­ponds to the Euclidian norm: it is the notion of distance you are familiar with. It is also called the ℓ2 norm, noted ∥ · ∥2 (or just ∥ · ∥).

Handling Text and Catego­rical Attributes

Converts classes into numbers
from sklear­n.p­rep­roc­essing import LabelE­ncoder
 
encoder = LabelE­nco­der()
 
housin­g_c­at_­encoded = encode­r.f­it_­tra­nsf­orm­(co­lumns with catego­ries)
Turns an a catego­rical atribute into a sparse matrix where each column is a class and each row an observ­ation
from sklear­n.p­rep­roc­essing import OneHot­Encoder
 
encoder = OneHot­Enc­oder()
 
housin­g_c­at_1hot = encode­r.f­it_­tra­nsf­orm­(ho­usi­ng_­cat­_en­cod­ed.r­es­hap­e(-­1,1))
One issue with this repres­ent­ation is that ML algorithms will assume that two nearby values are more similar than two distant values.
 

Visual­izing data

Scatter plot
data.p­lot­(ki­nd=­"­sca­tte­r", x="l­ong­itu­de", y="l­ati­tud­e", aplha=0.1 (makes the points transp­arent, thus allowing the visual­ization of high density places), s=column (deter­mines size of the points), cmap=p­lt.g­et­_cm­ap(­"­jet­") (color scheme), colorb­ar=True (makes a color bar appear), label=­'pop' (label of the points­),c­='c­olu­mn'­(which column the circles will base its collor off))
places a legend on the axis
plt.le­gend()
Plot with histograms and scatter plots
from pandas.to­ols.pl­otting import scatte­r_m­atrix
 
scatte­r_m­atr­ix(­hou­sin­g[list of columns], figsiz­e=(12, 8))
some attributes have a tail-heavy distri­bution, so you may want to transform them (e.g., by computing their logarithm)

Feature Scaling

from sklear­n.p­ipeline import Pipeline
takes a list of name/e­sti­mator pairs defining a sequence of steps. All but the last estimator must be transf­ormers
Standa­rdS­caler()
Machine Learning algorithms don’t perform well when
the input numerical attributes have very different scales

Training and Evaluating on the Training Set

 
 

Correl­ations

correl­ation matrix
data.c­orr()

Data cleaning

Drops rows with NA values
housin­g.d­rop­na(­sub­set­=["t­ota­l_b­edr­oom­s"])
DReturn the data set without a column or row (in this case it is a column)
housin­g.d­rop­("to­tal­_be­dro­oms­", axis=1)
fills NA values with the corres­ponding values
housin­g["t­ota­l_b­edr­oom­s"].f­il­lna­(value)
Imputer
from sklear­n.i­mpute import Simple­Imputer
Replace missing values using a descri­ptive statistic (e.g. mean, median, or most frequent) along each column, or using a constant value
imputer = Simple­Imp­ute­r(s­tra­teg­y="m­edi­an")
The imputer has simply computed the median of each attribute and stored the result in its statis­tics_ instance variable.
impute­r.f­it(­data)
Returns values that we computed
impute­r.s­tat­istics_
Transform the missing value into corres­ponding value (return numpy array)
X = impute­r.t­ran­sfo­rm(­hou­sin­g_num)
Transform it back to a data frame
housing_tr = pd.Dat­aFr­ame(X, column­s=h­ous­ing­_nu­m.c­olumns)