Show Menu
Cheatography

Machine Learning with Python Cookbook Cheat Sheet (DRAFT) by

Based on Machine Learning with Python Cookbook

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Introd­uction

Creating a row Vector
np.arr­ay([1, 2, 3])
Creating a column Vector
np.arr­ay(­[[1], [2], [3]])
Creating a Matrix
np.arr­ay([[1, 2], [1, 2], [1, 2]])
Creating a Sparse Matrix
from scipy import sparse
 
sparse.cs­r_m­atr­ix(­matrix) #shows the indixes of non zero elements
Select all elements of a vector
vector[:]
Select all rows and the second column
matrix­[:,1:2]
View number of rows and columns
matrix.shape
View number of elements
matrix.size
View number of dimensions
matrix.ndim
Applying Operations to Elements
add_100 = lambda i: i + 100
 
vector­ize­d_a­dd_100 = np.vec­tor­ize­(ad­d_100)
 
vector­ize­d_a­dd_­100­(ma­trix)
maximum value in an array
np.max­(ma­trix)
minimum value in an array
np.min­(ma­trix)
Return mean
np.mea­n(m­atrix)
Return variance
np.var­(ma­trix)
Return standard deviation
np.std­(ma­trix)
Reshaping Arrays
matrix.re­sha­pe(2, 6)
Transp­osing a Vector or Matrix
matrix.T
You need to transform a matrix into a one-di­men­sional array
matrix.fl­atten()
Return matrix rank (This corres­ponds to the maximal number of linearly indepe­ndent columns of the matrix)
np.lin­alg.ma­tri­x_r­ank­(ma­trix)
Calcul­ating the Determ­inant
np.lin­alg.de­t(m­atrix)
Getting the Diagonal line of a Matrix
matrix.di­ago­nal­(of­fset=1 (offsets the diagonal by the amount we put, can be negative))
Return trace (sum of the diagonal elements)
matrix.tr­ace()
Finding Eigenv­alues and Eigenv­ectors
eigenv­alues, eigenv­ectors = np.lin­alg.ei­g(m­atrix)
Calcul­ating Dot Products (sum of the product of the elements of two vectores)
np.dot­(ve­ctor_a, vector_b)
Add two matrices
np.add­(ma­trix_a, matrix_b)
Subtract two matrices
np.sub­tra­ct(­mat­rix_a, matrix_b)
 
Altern­ati­vely, we can simply use the + and - operators
Multip­lying Matrices
np.dot­(ma­trix_a, matrix_b)
 
Altern­ati­vely, in Python 3.5+ we can use the @ operator
Multiply two matrices elemen­t-wise
matrix_a * matrix_b
Inverting a Matrix
p.lina­lg.i­nv­(ma­trix)
Set seed for random value generation
np.ran­dom.se­ed(0)
Generate three random floats between 0.0 and 1.0
np.ran­dom.ra­ndom(3)
Generate three random integers between 1 and 10
np.ran­dom.ra­ndi­nt(0, 11, 3)
Draw three numbers from a normal distri­bution with mean 0.0 and standard deviation of 1.0
np.ran­dom.no­rma­l(0.0, 1.0, 3)
Draw three numbers from a logistic distri­bution with mean 0.0 and scale of 1.0
np.ran­dom.lo­gis­tic­(0.0, 1.0, 3)
Draw three numbers greater than or equal to 1.0 and less than 2.0
np.ran­dom.un­ifo­rm(1.0, 2.0, 3)
We select element from matrixes and vectores like we do in R.
# Find maximum element in each column
np.max­(ma­trix, axis=0) -> array([7, 8, 9])
One useful argument in reshape is -1, which effect­ively means “as many as needed,” so reshape(1, -1) means one row and as many columns as needed:

Clustering

Clustering Using K-Means
Load libraries
from sklear­n.c­luster import KMeans
Create k-mean object
cluster = KMeans­(n_­clu­ste­rs=3, random­_st­ate=0, n_jobs=-1)
Train model
model = cluste­r.f­it(­fea­tur­es_std)
Predict observ­ation's cluster
model.p­re­dic­t(n­ew_­obs­erv­ation)
View predict class
model.l­abels_
Speeding Up K-Means Clustering
Load libraries
from sklear­n.c­luster import MiniBa­tch­KMeans
Create k-mean object
cluster = MiniBa­tch­KMe­ans­(n_­clu­ste­rs=3, random­_st­ate=0, batch_­siz­e=100)
Train model
model = cluste­r.f­it(­fea­tur­es_std)
Clustering Using Meanshift
group observ­ations without assuming the number of clusters or their shape
Load libraries
from sklear­n.c­luster import MeanShift
Create meanshift object
cluster = MeanSh­ift­(n_­job­s=-1)
Train model
model = cluste­r.f­it(­fea­tur­es_std)
Note on meanshift
cluste­r_a­ll=­False wherein orphan observ­ations are given the label of -1
Clustering Using DBSCAN
group observ­ations into clusters of high density
Load libraries
from sklear­n.c­luster import DBSCAN
Create meanshift object
cluster = DBSCAN­(n_­job­s=-1)
Train model
model = cluste­r.f­it(­fea­tur­es_std)
DBSCAN has three main parameters to set:
eps
The maximum distance from an observ­ation for another observ­ation to be considered its neighbor.
min_sa­mples
The minimum number of observ­ations less than eps distance from an observ­ation for it to be considered a core observ­ation.
metric
The distance metric used by eps—for example, minkowski or euclidean
Clustering Using Hierar­chical Merging
Load libraries
from sklear­n.c­luster import Agglom­era­tiv­eCl­ust­ering
Create meanshift object
cluster = Agglom­era­tiv­eCl­ust­eri­ng(­n_c­lus­ters=3)
Train model
model = cluste­r.f­it(­fea­tur­es_std)
Agglom­era­tiv­eCl­ust­ering uses the linkage parameter to determine the merging strategy to minimize the following:
Variance of merged clusters (ward)
 
Average distance between observ­ations from pairs of clusters (average)
 
Maximum distance between observ­ations from pairs of clusters (complete)
MiniBa­tch­KMeans works similarly to KMeans, with one signif­icant differ­ence: the batch_size parameter. batch_size controls the number of randomly selected observ­ations in each batch.

Handling Catego­rical Data

Encoding Nominal Catego­rical Features
from sklear­n.p­rep­roc­essing import LabelB­ina­rizer, MultiL­abe­lBi­narizer
Create one-hot encoder
one_hot = LabelB­ina­rizer()
One-hot encode feature
one_ho­t.f­it_­tra­nsf­orm­(fe­ature)
View feature classes
one_ho­t.c­lasses_
reverse the one-hot encoding
one_ho­t.i­nve­rse­_tr­ans­for­m(o­ne_­hot.tr­ans­for­m(f­eat­ure))
Create dummy variables from feature
pd.get­_du­mmi­es(­fea­tur­e[:,0])
Create multiclass one-hot encoder
one_ho­t_m­ult­iclass = MultiL­abe­lBi­nar­izer()
One-hot encode multiclass feature
one_ho­t_m­ult­icl­ass.fi­t_t­ran­sfo­rm(­mul­tic­las­s_f­eature)
see the classes with the classes_ method
ne_hot­_mu­lti­cla­ss.c­la­sses_
Encoding Ordinal Catego­rical Features
datafr­ame­["Sc­ore­"­].r­epl­ace(dic with catego­ricals as keys and numbers as values)
Encoding Dictio­naries of Features
from sklear­n.f­eat­ure­_ex­tra­ction import DictVe­cto­rizer
Create dictionary
data_dict = [{"R­ed": 2, "­Blu­e": 4}, {"Re­d": 4, "­Blu­e": 3}, {"Re­d": 1, "­Yel­low­": 2}, {"Re­d": 2, "­Yel­low­": 2}]
Create dictionary vectorizer
dictve­cto­rizer = DictVe­cto­riz­er(­spa­rse­=False)
Convert dictionary to feature matrix
features = dictve­cto­riz­er.f­it­_tr­ans­for­m(d­ata­_dict)
Get feature names
featur­e_names = dictve­cto­riz­er.g­et­_fe­atu­re_­names()
Imputing Missing Class Values
from sklear­n.n­eig­hbors import KNeigh­bor­sCl­ass­ifier
# Train KNN learner
clf = KNeigh­bor­sCl­ass­ifi­er(3, weight­s='­dis­tance')
 
traine­d_model = clf.fi­t(X­[:,1:], X[:,0])
Predict missing values' class
impute­d_v­alues = traine­d_m­ode­l.p­red­ict­(X_­wit­h_n­an[­:,1:])
Join column of predicted class with their other features
X_with­_im­puted = np.hst­ack­((i­mpu­ted­_va­lue­s.r­esh­ape­(-1,1), X_with­_na­n[:­,1:]))
Join two feature matrices
np.vst­ack­((X­_wi­th_­imp­uted, X))
Use imputer to fill most frequen value
imputer = Impute­r(s­tra­teg­y='­mos­t_f­req­uent', axis=0)
Handling Imbalanced Classes
Random­For­est­Cla­ssi­fie­r(c­las­s_w­eig­ht=­"­bal­anc­ed")
downsample the majority class
i_class0 = np.whe­re(­target == 0)[0]
 
i_class1 = np.whe­re(­target == 1)[0]
Number of observ­ations in each class
n_class0 = len(i_­class0)
 
n_class1 = len(i_­class1)
For every observ­ation of class 0, randomly sample from class 1 without replac­ement
i_clas­s1_­dow­nsa­mpled = np.ran­dom.ch­oic­e(i­_cl­ass1, size=n­_cl­ass0, replac­e=F­alse)
Join together class 0's target vector with the downsa­mpled class 1's target vector
np.hst­ack­((t­arg­et[­i_c­lass0], target­[i_­cla­ss1­_do­wns­amp­led]))
Join together class 0's feature matrix with the downsa­mpled class 1's feature matrix
np.vst­ack­((f­eat­ure­s[i­_cl­ass­0,:], featur­es[­i_c­las­s1_­dow­nsa­mpl­ed,­:])­)[0:5]
upsample the minority class
i_clas­s0_­ups­ampled = np.ran­dom.ch­oic­e(i­_cl­ass0, size=n­_cl­ass1, replac­e=True)
Join together class 0's upsampled target vector with class 1's target vector
np.con­cat­ena­te(­(ta­rge­t[i­_cl­ass­0_u­psa­mpled], target­[i_­cla­ss1]))
Join together class 0's upsampled feature matrix with class 1's feature matrix
np.vst­ack­((f­eat­ure­s[i­_cl­ass­0_u­psa­mpl­ed,:], featur­es[­i_c­las­s1,­:])­)[0:5]
A second strategy is to use a model evaluation metric better suited to imbalanced classes. Accuracy is often used as a metric for evaluating the perfor­mance of a model, but when imbalanced classes are present accuracy can be ill suited. Some better metrics we discuss in later chapters are confusion matrices, precision, recall, F1 scores, and ROC curves

Dimens­ion­ality Reduction Using Feature Extraction

Reducing Features Using Principal Components
from sklear­n.d­eco­mpo­sition import PCA
 
from sklear­n.p­rep­roc­essing import Standa­rdS­caler
Standa­rdize the feature matrix
features = Standa­rdS­cal­er(­).f­it_­tra­nsf­orm­(di­git­s.data)
Create a PCA that will retain 99% of variance
pca = PCA(n_­com­pon­ent­s=0.99, whiten­=True)
Conduct PCA
featur­es_pca = pca.fi­t_t­ran­sfo­rm(­fea­tures)
Reducing Features When Data Is Linearly Insepa­rable
Use an extension of principal component analysis that uses kernels to allow for non-linear dimens­ion­ality reduction
 
from sklear­n.d­eco­mpo­sition import PCA, KernelPCA
Apply kernal PCA with radius basis function (RBF) kernel
kpca = Kernel­PCA­(ke­rne­l="r­bf", gamma=15, n_comp­one­nts=1)
 
featur­es_kpca = kpca.f­it_­tra­nsf­orm­(fe­atures)
Reducing Features by Maximizing Class Separa­bility
from sklear­n.d­isc­rim­ina­nt_­ana­lysis import Linear­Dis­cri­min­ant­Ana­lysis
Create and run an LDA, then use it to transform the features
Linear­Dis­cri­min­ant­Ana­lys­is(­n_c­omp­one­nts=1)
 
featur­es_lda = lda.fi­t(f­eat­ures, target­).t­ran­sfo­rm(­fea­tures)
amount of variance explained by each component
lda.ex­pla­ine­d_v­ari­anc­e_r­atio_
non-ne­gative matrix factor­ization (NMF) to reduce the dimens­ion­ality of the feature matrix
from sklear­n.d­eco­mpo­sition import NMF
Create, fit, and apply NMF
nmf = NMF(n_­com­pon­ent­s=10, random­_st­ate=1)
 
featur­es_nmf = nmf.fi­t_t­ran­sfo­rm(­fea­tures)
Reducing Features on Sparse Data (Truncated Singular Value Decomp­osition (TSVD))
from sklear­n.d­eco­mpo­sition import Trunca­tedSVD
 
from scipy.s­parse import csr_matrix
Standa­rdize feature matrix
features = Standa­rdS­cal­er(­).f­it_­tra­nsf­orm­(di­git­s.data)
# Make sparse matrix
featur­es_­sparse = csr_ma­tri­x(f­eat­ures)
Create a TSVD
tsvd = Trunca­ted­SVD­(n_­com­pon­ent­s=10)
Conduct TSVD on sparse matrix
featur­es_­spa­rse­_tsvd = tsvd.f­it(­fea­tur­es_­spa­rse­).t­ran­sfo­rm(­fea­tur­es_­sparse)
Sum of first three compon­ents' explained variance ratios
tsvd.e­xpl­ain­ed_­var­ian­ce_­rat­io_­[0:­3].s­um()
196 e 200
One major requir­ement of NMA is that, as the name implies, the feature matrix cannot contain negative values.

Trees and Forests

Training a Decision Tree Classifier
from sklear­n.tree import Decisi­onT­ree­Cla­ssifier
Create decision tree classifier object
decisi­ontree = Decisi­onT­ree­Cla­ssi­fie­r(r­and­om_­sta­te=0)
Train model
model = decisi­ont­ree.fi­t(f­eat­ures, target)
Predict observ­ation's class
model.p­re­dic­t(o­bse­rva­tion)
Training a Decision Tree Regressor
from sklear­n.tree import Decisi­onT­ree­Reg­ressor
Create decision tree classifier object
decisi­ontree = Decisi­onT­ree­Reg­res­sor­(ra­ndo­m_s­tate=0)
Train model
model = decisi­ont­ree.fi­t(f­eat­ures, target)
Create decision tree classifier object using entropy
decisi­ont­ree_mae = Decisi­onT­ree­Reg­res­sor­(cr­ite­rio­n="m­ae", random­_st­ate=0)
Visual­izing a Decision Tree Model
from IPytho­n.d­isplay import Image
 
import pydotplus
 
from sklearn import tree
Create DOT data
dot_data = tree.e­xpo­rt_­gra­phv­iz(­dec­isi­ontree, out_fi­le=­None, featur­e_n­ame­s=i­ris.fe­atu­re_­names, class_­nam­es=­iri­s.t­arg­et_­names)
Draw graph
graph = pydotp­lus.gr­aph­_fr­om_­dot­_da­ta(­dot­_data)
Show graph
Image(­gra­ph.c­re­ate­_png())
Create PDF
graph.w­ri­te_­pdf­("ir­is.p­df­")
Create PNG
graph.w­ri­te_­png­("ir­is.p­ng­")
Training a Random Forest Classifier
from sklear­n.e­nsemble import Random­For­est­Cla­ssifier
Create random forest classifier object
random­forest = Random­For­est­Cla­ssi­fie­r(r­and­om_­sta­te=0, n_jobs=-1)
Create random forest classifier object using entropy
random­for­est­_en­tropy = Random­For­est­Cla­ssi­fier( criter­ion­="en­tro­py", random­_st­ate=0)
Training a Random Forest Regressor
from sklear­n.e­nsemble import Random­For­est­Reg­ressor
Create random forest classifier object
random­forest = Random­For­est­Reg­res­sor­(ra­ndo­m_s­tate=0, n_jobs=-1)
Identi­fying Important Features in Random Forests
from sklear­n.e­nsemble import Random­For­est­Cla­ssifier
Create random forest classifier object
random­forest = Random­For­est­Cla­ssi­fie­r(r­and­om_­sta­te=0, n_jobs=-1)
Calculate feature import­ances
import­ances = model.f­ea­tur­e_i­mpo­rta­nces_
Sort feature import­ances in descending order
indices = np.arg­sor­t(i­mpo­rta­nce­s)[­::-1]
Rearrange feature names so they match the sorted feature import­ances
names = [iris.f­ea­tur­e_n­ames[i] for i in indices]
Create plot
plt.fi­gure()
Create plot title
plt.ti­tle­("Fe­ature Import­anc­e")
Add bars
plt.ba­r(r­ang­e(f­eat­ure­s.s­hap­e[1]), import­anc­es[­ind­ices])
Add feature names as x-axis labels
plt.xt­ick­s(r­ang­e(f­eat­ure­s.s­hap­e[1]), names, rotati­on=90)
Show plot
plt.show()
Selecting Important Features in Random Forests
from sklear­n.f­eat­ure­_se­lection import Select­Fro­mModel
Create random forest classifier
random­forest = Random­For­est­Cla­ssi­fie­r(r­and­om_­sta­te=0, n_jobs=-1)
Create object that selects features with importance greater than or equal to a threshold
selector = Select­Fro­mMo­del­(ra­ndo­mfo­rest, thresh­old­=0.3)
Feature new feature matrix using selector
featur­es_­imp­ortant = select­or.f­it­_tr­ans­for­m(f­eat­ures, target)
Train random forest using most important featres
model = random­for­est.fi­t(f­eat­ure­s_i­mpo­rtant, target)
Handling Imbalanced Classes
Train a decision tree or random forest model with class_­wei­ght­="ba­lan­ced­"
Create random forest classifier object
random­forest = Random­For­est­Cla­ssi­fier( random­_st­ate=0, n_jobs=-1, class_­wei­ght­="ba­lan­ced­")
Contro­lling Tree Size
Create decision tree classifier object
decisi­ontree = Decisi­onT­ree­Cla­ssi­fie­r(r­and­om_­sta­te=0, max_de­pth­=None, min_sa­mpl­es_­spl­it=2, min_sa­mpl­es_­leaf=1, min_we­igh­t_f­rac­tio­n_l­eaf=0, max_le­af_­nod­es=­None, min_im­pur­ity­_de­cre­ase=0)
Improving Perfor­mance Through Boosting
from sklear­n.e­nsemble import AdaBoo­stC­las­sifier
Create adaboost tree classifier object
adaboost = AdaBoo­stC­las­sif­ier­(ra­ndo­m_s­tate=0)
Evaluating Random Forests with Out-of- Bag Errors
You need to evaluate a random forest model without using cross-­val­idation
Create random tree classifier object
random­forest = Random­For­est­Cla­ssi­fier( random­_st­ate=0, n_esti­mat­ors­=1000, oob_sc­ore­=True, n_jobs=-1)
OOB scores of a random forest
oob_score_

Trees and Forests

Training a Decision Tree Classifier
Load libraries
from sklear­n.tree import Decisi­onT­ree­Cla­ssifier
Create decision tree classifier object
decisi­ontree = Decisi­onT­ree­Cla­ssi­fie­r(r­and­om_­sta­te=0)
Train model
model = decisi­ont­ree.fi­t(f­eat­ures, target)
Training a Decision Tree Regressor
Use scikit­-le­arn’s Decisi­onT­ree­Reg­ressor
from sklear­n.tree import Decisi­onT­ree­Reg­ressor
Create decision tree classifier object
decisi­ontree = Decisi­onT­ree­Reg­res­sor­(ra­ndo­m_s­tate=0)
Train model
model = decisi­ont­ree.fi­t(f­eat­ures, target)

Linear Regression

Fitting a Line
Load libraries
from sklear­n.l­ine­ar_­model import Linear­Reg­ression
Create linear regression
regression = Linear­Reg­res­sion()
Fit the linear regression
model = regres­sio­n.f­it(­fea­tures, target)
Handling Intera­ctive Effects
You have a feature whose effect on the target variable depends on another feature.
Load libraries
from sklear­n.p­rep­roc­essing import Polyno­mia­lFe­atures
Create intera­ction term
intera­ction = Polyno­mia­lFe­atures( degree=3, includ­e_b­ias­=False, intera­cti­on_­onl­y=True)
 
featur­es_­int­era­ction = intera­cti­on.f­it­_tr­ans­for­m(f­eat­ures)
Create linear regression
regression = Linear­Reg­res­sion()
Fit the linear regression
model = regres­sio­n.f­it(­fea­tur­es_­int­era­ction, target)
Fitting a Nonlinear Relati­onship
Create a polynomial regression by including polynomial features in a linear regression model
Load library
from sklear­n.p­rep­roc­essing import Polyno­mia­lFe­atures
Create polynomial features x2 and x3
polynomial = Polyno­mia­lFe­atu­res­(de­gree=3, includ­e_b­ias­=False)
 
featur­es_­pol­ynomial = polyno­mia­l.f­it_­tra­nsf­orm­(fe­atures)
Create linear regression
regression = Linear­Reg­res­sion()
Fit the linear regression
model = regres­sio­n.f­it(­fea­tur­es_­pol­yno­mial, target)
Reducing Variance with Regula­riz­ation
Use a learning algorithm that includes a shrinkage penalty (also called regula­riz­ation) like ridge regression and lasso regres­sion:
Load libraries
from sklear­n.l­ine­ar_­model import Ridge
Create ridge regression with an alpha value
regression = Ridge(­alp­ha=0.5)
Fit the linear regression
model = regres­sio­n.f­it(­fea­tur­es_­sta­nda­rdized, target)
Load library
from sklear­n.l­ine­ar_­model import RidgeCV
Create ridge regression with three alpha values
regr_cv = RidgeC­V(a­lph­as=­[0.1, 1.0, 10.0])
Fit the linear regression
model_cv = regr_c­v.f­it(­fea­tur­es_­sta­nda­rdized, target)
View coeffi­cients
model_­cv.c­oef_
View alpha
model_­cv.a­lpha_
Reducing Features with Lasso Regression
You want to simplify your linear regression model by reducing the number of features.
Load library
from sklear­n.l­ine­ar_­model import Lasso
Create lasso regression with alpha value
regression = Lasso(­alp­ha=0.5)
Fit the linear regression
model = regres­sio­n.f­it(­fea­tur­es_­sta­nda­rdized, target)
Create lasso regression with a high alpha
regres­sio­n_a10 = Lasso(­alp­ha=10)
 
model_a10 = regres­sio­n_a­10.f­it­(fe­atu­res­_st­and­ard­ized, target)
intera­cti­on_­onl­y=True tells Polyno­mia­lFe­atures to only return intera­ction terms
Polyno­mia­lFe­atures will add a feature containing ones called a bias. We can prevent that with includ­e_b­ias­=False
Polynomial regression is an extension of linear regression to allow us to model nonlinear relati­ons­hips.
 

Loading Data

Loading a Sample Dataset
from sklearn import datasets
 
digits = datase­ts.l­oa­d_d­igits()
 
features = digits.data
 
target = digits.target
Creating a Simulated Dataset for regression
from sklear­n.d­atasets import make_r­egr­ession
 
features, target, coeffi­cients = make_r­egr­ess­ion­(n_­samples = 100, n_features = 3, n_info­rmative = 3, n_targets = 1, noise = 0.0, coef = True, random­_state = 1)
Creating a Simulated Dataset for classi­fic­ation
from sklear­n.d­atasets import make_c­las­sif­ication
 
features, target = make_c­las­sif­ica­tio­n(n­_sa­mples = 100, n_features = 3, n_info­rmative = 3, n_redu­ndant = 0, n_classes = 2, weights = [.25, .75], random­_state = 1)
Creating a Simulated Dataset for clustering
from sklear­n.d­atasets import make_blobs
 
features, target = make_b­lob­s(n­_sa­mples = 100, n_features = 2, centers = 3, cluste­r_std = 0.5, shuffle = True, random­_state = 1)
Loading a CSV File
dataframe = pd.rea­d_c­sv(­dat­a,s­ep=',')
Loading an Excel File
pd.rea­d_e­xce­l(url, sheetn­ame=0, header=1)
 
If we need to load multiple sheets, include them as a list.
Loading a JSON File
pd.rea­d_j­son­(url, orient­='c­olu­mns')
 
The key difference is the orient parameter, which indicates to pandas how the JSON file is struct­ured. However, it might take some experi­menting to figure out which argument (split, records, index, columns, and values) is the right one.
convert semist­ruc­tured JSON data into a pandas DataFrame
json_n­orm­alize
Querying a SQL Database
from sqlalchemy import create­_engine
 
databa­se_­con­nection = create­_en­gin­e('­sql­ite­://­/sa­mpl­e.db')
 
pd.rea­d_s­ql_­que­ry(­'SELECT * FROM data', databa­se_­con­nec­tion)
In addition, make_c­las­sif­ication contains a weights parameter that allows us to simulate datasets with imbalanced classes. For example, weights = [.25,.75]
For make_b­lobs, the centers parameter determines the number of clusters
generated.

Naive Bayes

Training a Classifier for Continuous Features
Use a Gaussian naive Bayes classifier
Load libraries
from sklear­n.n­aiv­e_bayes import GaussianNB
Create Gaussian Naive Bayes object
classifer = Gaussi­anNB()
Train model
model = classi­fer.fi­t(f­eat­ures, target)
Create Gaussian Naive Bayes object with prior probab­ilities of each class
clf = Gaussi­anN­B(p­rio­rs=­[0.25, 0.25, 0.5])
Training a Classifier for Discrete and Count Features
Given discrete or count data
Load libraries
from sklear­n.n­aiv­e_bayes import Multin­omialNB
 
from sklear­n.f­eat­ure­_ex­tra­cti­on.text import CountV­ect­orizer
Create bag of words
count = CountV­ect­ori­zer()
 
bag_of­_words = count.f­it­_tr­ans­for­m(t­ext­_data)
Create feature matrix
features = bag_of­_wo­rds.to­array()
Create multin­omial naive Bayes object with prior probab­ilities of each class
classifer = Multin­omi­alN­B(c­las­s_p­rio­r=[­0.25, 0.5])
Training a Naive Bayes Classifier for Binary Features
Load libraries
from sklear­n.n­aiv­e_bayes import Bernou­lliNB
Create Bernoulli Naive Bayes object with prior probab­ilities of each class
classifer = Bernou­lli­NB(­cla­ss_­pri­or=­[0.25, 0.5])
Calibr­ating Predicted Probab­ilities
You want to calibrate the predicted probab­ilities from naive Bayes classi­fiers so they are interp­ret­able.
Load libraries
from sklear­n.c­ali­bration import Calibr­ate­dCl­ass­ifierCV
Create calibrated cross-­val­idation with sigmoid calibr­ation
classi­fer­_si­gmoid = Calibr­ate­dCl­ass­ifi­erC­V(c­las­sifer, cv=2, method­='s­igm­oid')
Calibrate probab­ilities
classi­fer­_si­gmo­id.f­it­(fe­atures, target)
View calibrated probab­ilities
classi­fer­_si­gmo­id.p­re­dic­t_p­rob­a(n­ew_­obs­erv­ation)
If class_­prior is not specified, prior probab­ilities are learned using the data. However, if we want a uniform distri­bution to be used as the prior, we can set fit_pr­ior­=False.

Logistic Regression

Training a Binary Classifier
from sklear­n.l­ine­ar_­model import Logist­icR­egr­ession
 
from sklear­n.p­rep­roc­essing import Standa­rdS­caler
Create logistic regression object
logist­ic_­reg­ression = Logist­icR­egr­ess­ion­(ra­ndo­m_s­tate=0)
View predicted probab­ilities
model.p­re­dic­t_p­rob­a(n­ew_­obs­erv­ation)
Training a Multiclass Classifier
Create one-vs­-rest logistic regression object
logist­ic_­reg­ression = Logist­icR­egr­ess­ion­(ra­ndo­m_s­tate=0, multi_­cla­ss=­"­ovr­")
Reducing Variance Through Regula­riz­ation
Tune the regula­riz­ation strength hyperp­ara­meter, C
Create decision tree classifier object
logist­ic_­reg­ression = Logist­icR­egr­ess­ionCV( penalt­y='l2', Cs=10, random­_st­ate=0, n_jobs=-1)
Training a Classifier on Very Large Data
Create logistic regression object
logist­ic_­reg­ression = Logist­icR­egr­ess­ion­(ra­ndo­m_s­tate=0, solver­="sa­g")
Handling Imbalanced Classes
Create target vector indicating if class 0, otherwise 1
target = np.whe­re(­(target == 0), 0, 1)
Create decision tree classifier object
logist­ic_­reg­ression = Logist­icR­egr­ess­ion­(ra­ndo­m_s­tate=0, class_­wei­ght­="ba­lan­ced­")

K-Nearest Neighbors

Finding an Observ­ation’s Nearest Neighbors
from sklear­n.n­eig­hbors import Neares­tNe­ighbors
Create standa­rdizer
standa­rdizer = Standa­rdS­caler()
Standa­rdize features
featur­es_­sta­nda­rdized = standa­rdi­zer.fi­t_t­ran­sfo­rm(­fea­tures)
Two nearest neighbors
neares­t_n­eig­hbors = Neares­tNe­igh­bor­s(n­_ne­igh­bor­s=2­).f­it(­fea­tur­es_­sta­nda­rdized)
Create an observ­ation
new_ob­ser­vation = [ 1, 1, 1, 1]
Find distances and indices of the observ­ation's nearest neighbors
distances, indices = neares­t_n­eig­hbo­rs.k­ne­igh­bor­s([­new­_ob­ser­vat­ion])
View the nearest neighbors
eature­s_s­tan­dar­diz­ed[­ind­ices]
Find two nearest neighbors based on euclidean distance
neares­tne­igh­bor­s_e­ucl­idean = Neares­tNe­igh­bors( n_neig­hbo­rs=2, metric­='e­ucl­ide­an'­).f­it(­fea­tur­es_­sta­nda­rdized)
create a matrix indicating each observ­ation’s nearest neighbors
Find each observ­ation's three nearest neighbors based on euclidean distance (including itself)
neares­tne­igh­bor­s_e­ucl­idean = Neares­tNe­igh­bors( n_neig­hbo­rs=3, metric­="eu­cli­dea­n").f­it­(fe­atu­res­_st­and­ard­ized)
List of lists indicating each observ­ation's 3 nearest neighbors
neares­t_n­eig­hbo­rs_­wit­h_self = neares­tne­igh­bor­s_e­ucl­ide­an.k­ne­igh­bor­s_g­raph( featur­es_­sta­nda­rdi­zed­).t­oar­ray()
Remove 1's marking an observ­ation is a nearest neighbor to itself
for i, x in enumer­ate­(ne­are­st_­nei­ghb­ors­_wi­th_­self):
 
x[i] = 0
View first observ­ation's two nearest neighbors
neares­t_n­eig­hbo­rs_­wit­h_s­elf[0]
Creating a K-Nearest Neighbor Classifier
Train a KNN classifier with 5 neighbors
knn = KNeigh­bor­sCl­ass­ifi­er(­n_n­eig­hbo­rs=5, n_jobs­=-1­).f­it(­X_std, y)
Identi­fying the Best Neighb­orhood Size
Load libraries
from sklear­n.p­ipeline import Pipeline, Featur­eUnion
 
from sklear­n.m­ode­l_s­ele­ction import GridSe­archCV
Create a pipeline
pipe = Pipeli­ne(­[("s­tan­dar­diz­er", standa­rdi­zer), ("kn­n", knn)])
Create space of candidate values
search­_space = [{"k­nn_­_n_­nei­ghb­ors­": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]
Create grid search
classifier = GridSe­archCV( pipe, search­_space, cv=5, verbos­e=0­).f­it(­fea­tur­es_­sta­nda­rdized, target)
Best neighb­orhood size (k)
classi­fie­r.b­est­_es­tim­ato­r_.g­et­_pa­ram­s()­["kn­n__­n_n­eig­hbo­rs"]
Creating a Radius­-Based Nearest Neighbor Classifier
from sklear­n.n­eig­hbors import Radius­Nei­ghb­ors­Cla­ssifier
Train a radius neighbors classifier
rnn = Radius­Nei­ghb­ors­Cla­ssi­fier( radius=.5, n_jobs­=-1­).f­it(­fea­tur­es_­sta­nda­rdized, target)

Model Selection

Selecting Best Models Using Exhaustive Search
from sklear­n.m­ode­l_s­ele­ction import GridSe­archCV
Create range of candidate penalty hyperp­ara­meter values
penalty = ['l1', 'l2']
Create range of candidate regula­riz­ation hyperp­ara­meter values
C = np.log­spa­ce(0, 4, 10)
 
numpy.l­og­spa­ce(­start, stop, num=50, endpoi­nt=­True, base=10.0, dtype=­None, axis=0)
Create dictionary hyperp­ara­meter candidates
hyperp­ara­meters = dict(C=C, penalt­y=p­enalty)
Create grid search
gridsearch = GridSe­arc­hCV­(lo­gistic, hyperp­ara­meters, cv=5, verbose=0)
Fit grid search
best_model = gridse­arc­h.f­it(­fea­tures, target)
Predict target vector
best_m­ode­l.p­red­ict­(fe­atures)
Selecting Best Models Using Randomized Search
Load libraries
from sklear­n.m­ode­l_s­ele­ction import Random­ize­dSe­archCV
Create range of candidate regula­riz­ation penalty hyperp­ara­meter values
penalty = ['l1', 'l2']
Create distri­bution of candidate regula­riz­ation hyperp­ara­meter values
from scipy.s­tats import uniform
 
C = unifor­m(l­oc=0, scale=4)
Create hyperp­ara­meter options
hyperp­ara­meters = dict(C=C, penalt­y=p­enalty)
Create randomized search
random­ize­dsearch = Random­ize­dSe­archCV( logistic, hyperp­ara­meters, random­_st­ate=1, n_iter­=100, cv=5, verbose=0, n_jobs=-1)
Fit randomized search
best_model = random­ize­dse­arc­h.f­it(­fea­tures, target)
Predict target vector
best_m­ode­l.p­red­ict­(fe­atures)
Selecting Best Models from Multiple
Load libraries
from sklear­n.m­ode­l_s­ele­ction import GridSe­archCV
 
from sklear­n.p­ipeline import Pipeline
Create a pipeline
pipe = Pipeli­ne(­[("c­las­sif­ier­", Random­For­est­Cla­ssi­fie­r())])
Create dictionary with candidate learning algorithms and their hyperp­ara­meters
search­_space = [{"c­las­sif­ier­": [Logis­tic­Reg­res­sio­n()], "­cla­ssi­fie­r__­pen­alt­y": ['l1', 'l2'], "­cla­ssi­fie­r__­C": np.log­spa­ce(0, 4, 10)}, {"cl­ass­ifi­er": [Rando­mFo­res­tCl­ass­ifi­er()], "­cla­ssi­fie­r__­n_e­sti­mat­ors­": [10, 100, 1000], "­cla­ssi­fie­r__­max­_fe­atu­res­": [1, 2, 3]}]
Create grid search
gridsearch = GridSe­arc­hCV­(pipe, search­_space, cv=5, verbose=0)
Fit grid search
best_model = gridse­arc­h.f­it(­fea­tures, target)
View best model
best_m­ode­l.b­est­_es­tim­ato­r_.g­et­_pa­ram­s()­["cl­ass­ifi­er"]
Predict target vector
best_m­ode­l.p­red­ict­(fe­atures)
Selecting Best Models When Prepro­cessing
Load libraries
from sklear­n.p­ipeline import Pipeline, Featur­eUnion
Create a prepro­cessing object that includes Standa­rdS­caler features and PCA
preprocess = Featur­eUn­ion­([(­"­std­", Standa­rdS­cal­er()), ("pc­a", PCA())])
Create a pipeline
pipe = Pipeli­ne(­[("p­rep­roc­ess­", prepro­cess), ("cl­ass­ifi­er", Logist­icR­egr­ess­ion­())])
Create space of candidate values
search­_space = [{"p­rep­roc­ess­__p­ca_­_n_­com­pon­ent­s": [1, 2, 3], "­cla­ssi­fie­r__­pen­alt­y": ["l1­", "­l2"], "­cla­ssi­fie­r__­C": np.log­spa­ce(0, 4, 10)}]
Create grid search
clf = GridSe­arc­hCV­(pipe, search­_space, cv=5, verbose=0, n_jobs=-1)
Fit grid search
best_model = clf.fi­t(f­eat­ures, target)
Speeding Up Model Selection with Parall­eli­zation
Use all the cores in your machine by setting n_jobs=-1
 
gridsearch = GridSe­arc­hCV­(lo­gistic, hyperp­ara­meters, cv=5, n_jobs=-1, verbose=1)
peeding Up Model Selection Using Algori­thm­-Sp­ecific Methods
If you are using a select number of learning algori­thms, use scikit­-le­arn’s models­pecific cross-­val­idation hyperp­ara­meter tuning.
Create cross-­val­idated logistic regression
logit = linear­_mo­del.Lo­gis­tic­Reg­res­sio­nCV­(Cs­=100)
Train model
logit.f­it­(fe­atures, target)
Evaluating Perfor­mance After Model Selection
Load libraries
from sklear­n.m­ode­l_s­ele­ction import GridSe­archCV, cross_­val­_score
Conduct nested cross-­val­idation and outut the average score
cross_­val­_sc­ore­(gr­ids­earch, features, target­).m­ean()
In scikit­-learn, many learning algorithms (e.g., ridge,
lasso, and elastic net regres­sion) have an algori­thm­-sp­ecific cross-­val­idation
method to take advantage of this.

Handling Dates and Times

Create strings
date_s­trings = np.arr­ay(­['0­3-0­4-2005 11:35 PM', '23-05­-2010 12:01 AM', '04-09­-2009 09:09 PM'])
Convert to datetimes
[pd.to­_da­tet­ime­(date, format­='%­d-%m-%Y %I:%M %p', errors­="co­erc­e") for date in date_s­trings]
Handling Time Zones
Create datetime
pd.Tim­est­amp­('2­017­-05-01 06:00:00', tz='Eu­rop­e/L­ondon')
We can add a time zone to a previously created datetime
date_i­n_l­ondon = date.t­z_l­oca­liz­e('­Eur­ope­/Lo­ndon')
convert to a different time zone
date_i­n_l­ond­on.t­z_­con­ver­t('­Afr­ica­/Ab­idjan')
tz_loc­alize and tz_convert to every element
dates.d­t.t­z_­loc­ali­ze(­'Af­ric­a/A­bid­jan')
importing all_ti­mezones
from pytz import all_ti­mezones
Create datetimes range
datafr­ame­['d­ate'] = pd.dat­e_r­ang­e('­1/1­/2001', period­s=1­00000, freq='H')
Select observ­ations between two datetimes
datafr­ame­[(d­ata­fra­me[­'date'] > '2002-1-1 01:00:00') & (dataf­ram­e['­date'] <= '2002-1-1 04:00:­00')]
Breaking Up Date Data into Multiple Features
datafr­ame­['y­ear'] = datafr­ame­['d­ate­'].d­t.year
 
datafr­ame­['m­onth'] = datafr­ame­['d­ate­'].d­t.m­onth
 
datafr­ame­['day'] = datafr­ame­['d­ate­'].d­t.day
 
datafr­ame­['h­our'] = datafr­ame­['d­ate­'].d­t.hour
 
datafr­ame­['m­inute'] = datafr­ame­['d­ate­'].d­t.m­inute
Calculate duration between features
pd.Ser­ies­(de­lta.days for delta in (dataf­ram­e['­Left'] - datafr­ame­['A­rri­ved']))
Show days of the week
dates.d­t.w­ee­kda­y_name
Show days of the week as numbers (Monday is 0)
dates.d­t.w­eekday
Creating a Lagged Feature (Lagged values by one row)
datafr­ame­["pr­evi­ous­_da­ys_­sto­ck_­pri­ce"] = datafr­ame­["st­ock­_pr­ice­"­].s­hift(1)
Calculate rolling mean or moving average
datafr­ame.ro­lli­ng(­win­dow­=2).mean()
Handling Missing Data in Time Series
Interp­olate missing values
datafr­ame.in­ter­pol­ate()
replace missing values with the last known value (i.e., forwar­d-f­illing)
datafr­ame.ff­ill()
eplace missing values with the latest known value (i.e., backfi­lling)
datafr­ame.bf­ill()
If we believe the line between the two known points is nonlinear
datafr­ame.in­ter­pol­ate­(me­tho­d="q­uad­rat­ic")
Interp­olate missing values
datafr­ame.in­ter­pol­ate­(li­mit=1, limit_­dir­ect­ion­="fo­rwa­rd")

Handling Numerical Data

Min Max scaler
from sklearn import prepro­cessing
Create scaler
minmax­_scale = prepro­ces­sin­g.M­inM­axS­cal­er(­fea­tur­e_r­ang­e=(0, 1))
Scale feature
minmax­_sc­ale.fi­t_t­ran­sfo­rm(­fea­ture)
Standa­rdizing a Feature
from sklearn import prepro­cessing
Create scaler
scaler = prepro­ces­sin­g.S­tan­dar­dSc­aler()
Transform the feature
standa­rdized = scaler.fi­t_t­ran­sfo­rm(x)
Normal­izing Observ­ations (unit norm -> all values have values lower than one)
from sklear­n.p­rep­roc­essing import Normalizer
Create normalizer
normalizer = Normal­ize­r(n­orm­="l2­")
Transform feature matrix
normal­ize­r.t­ran­sfo­rm(­fea­tures)
 
This type of rescaling is often used when we have many equivalent features (e.g., text classi­fic­ation)
Generating Polynomial and Intera­ction Features
from sklear­n.p­rep­roc­essing import Polyno­mia­lFe­atures
Create Polyno­mia­lFe­atures object
polyno­mia­l_i­nte­raction = Polyno­mia­lFe­atu­res­(de­gree=2, intera­cti­on_­onl­y=T­rue,, includ­e_b­ias­=False)
Create polynomial features
polyno­mia­l_i­nte­rac­tio­n.f­it_­tra­nsf­orm­(fe­atures)
Transf­orming Features
from sklear­n.p­rep­roc­essing import Functi­onT­ran­sformer
 
does the same as apply
Detecting Outliers
from sklear­n.c­ova­riance import Ellipt­icE­nvelope
Create detector
outlie­r_d­etector = Ellipt­icE­nve­lop­e(c­ont­ami­nat­ion=.1)
Fit detector
outlie­r_d­ete­cto­r.f­it(­fea­tures)
Predict outliers
outlie­r_d­ete­cto­r.p­red­ict­(fe­atures)
IQR for outlier detection
def indici­es_­of_­out­lie­rs(x):
 
q1, q3 = np.per­cen­tile(x, [25, 75])
 
iqr = q3 - q1
 
lower_­bound = q1 - (iqr * 1.5)
 
upper_­bound = q3 + (iqr * 1.5)
 
return np.whe­re((x > upper_­bound) | (x < lower_­bound))
Handling Outliers
houses­[ho­use­s['­Bat­hro­oms'] < 20]
Create feature based on boolean condition to detect outliers
houses­["Ou­tli­er"] = np.whe­re(­hou­ses­["Ba­thr­oom­s"] < 20, 0, 1)
Transform the feature to dampen the effect of the outlier
houses­["Lo­g_O­f_S­qua­re_­Fee­t"] = [np.log(x) for x in houses­["Sq­uar­e_F­eet­"]]
Standa­rdi­zation if we have outliers
Robust­Scaler
Discre­tiz­ating Features (binning)
from sklear­n.p­rep­roc­essing import Binarizer
Create binarizer
binarizer = Binari­zer(18)
Transform feature
binari­zer.fi­t_t­ran­sfo­rm(age) array(­[[0], [0],
break up numerical features according to multiple thresholds
np.dig­iti­ze(age, bins=[­20,­30,64], right=True (closes the right interval instead of the left))
Grouping Observ­ations Using Clustering
from sklear­n.c­luster import KMeans
Make k-means clusterer
clusterer = KMeans(3, random­_st­ate=0)
Fit clusterer
cluste­rer.fi­t(f­eat­ures)
Predict values
datafr­ame­["gr­oup­"] = cluste­rer.pr­edi­ct(­fea­tures)
Keep only observ­ations that are not (denoted by ~) missing
featur­es[­~np.is­nan­(fe­atu­res­).a­ny(­axi­s=1)]
drop missing observ­ations using pandas
datafr­ame.dr­opna()
Predict the missing values in the feature matrix
featur­es_­knn­_im­puted = KNN(k=5, verbos­e=0­).c­omp­let­e(s­tan­dar­diz­ed_­fea­tures)
Imputer module to fill in missing values
from sklear­n.p­rep­roc­essing import Imputer
Create imputer
mean_i­mputer = Impute­r(s­tra­teg­y="m­ean­", axis=0)
Impute values
featur­es_­mea­n_i­mputed = mean_i­mpu­ter.fi­t_t­ran­sfo­rm(­fea­tures)
One option is to use fit to calculate the minimum and maximum values of the feature, then
use transform to rescale the feature. The second option is to use fit_tr­ansform to do both operations at once. There is no mathem­atical difference between the two options, but there is sometimes a practical benefit to keeping the operations separate because it allows us to apply the same
transf­orm­ation to different sets of the data.

Deep learning

Prepro­cessing Data for Neural Networks
Load libraries
from sklearn import prepro­cessing
Create scaler
scaler = prepro­ces­sin­g.S­tan­dar­dSc­aler()
Transform the feature
featur­es_­sta­nda­rdized = scaler.fi­t_t­ran­sfo­rm(­fea­tures) # Show feature featur­es_­sta­nda­rdized array(­[[-­1.1­254­1308, 1.9642­9418], [-1.15­329466,
Designing a Neural Network
Load libraries
from keras import models
 
from keras import layers
Start neural network
network = models.Se­que­ntial()
Add fully connected layer with a ReLU activation function
networ­k.a­dd(­lay­ers.De­nse­(un­its=16, activa­tio­n="r­elu­", input_­sha­pe=­(10,)))
Add fully connected layer with a ReLU activation function
networ­k.a­dd(­lay­ers.De­nse­(un­its=16, activa­tio­n="r­elu­"))
Add fully connected layer with a sigmoid activation function
networ­k.a­dd(­lay­ers.De­nse­(un­its=1, activa­tio­n="s­igm­oid­"))
Compile neural network
networ­k.c­omp­ile­(lo­ss=­"­bin­ary­_cr­oss­ent­rop­y", # Cross-­entropy optimi­zer­="rm­spr­op", # Root Mean Square Propag­ation metric­s=[­"­acc­ura­cy"]) # Accuracy perfor­mance metric
Training a Binary Classifier
Load libraries
from keras.d­at­asets import imdb
 
from keras.p­re­pro­ces­sin­g.text import Tokenizer
 
from keras import models
 
from keras import layers
Set the number of features we want
number­_of­_fe­atures = 1000
Start neural network
network = models.Se­que­ntial()
Add fully connected layer with a ReLU activation function
networ­k.a­dd(­lay­ers.De­nse­(un­its=16, activa­tio­n="r­elu­", input_­shape=( number­_of­_fe­atu­res,)))
Add fully connected layer with a ReLU activation function
networ­k.a­dd(­lay­ers.De­nse­(un­its=16, activa­tio­n="r­elu­"))
Add fully connected layer with a sigmoid activation function
networ­k.a­dd(­lay­ers.De­nse­(un­its=1, activa­tio­n="s­igm­oid­"))
Compile neural network
networ­k.c­omp­ile­(lo­ss=­"­bin­ary­_cr­oss­ent­rop­y", # Cross-­entropy optimi­zer­="rm­spr­op", # Root Mean Square Propag­ation metric­s=[­"­acc­ura­cy"])
Train neural network
history = networ­k.f­it(­fea­tur­es_­train, # Features target­_train, # Target vector­epo­chs=3, # Number of epochs verbose=1, # Print descri­ption after each epoch batch_­siz­e=100, # Number of observ­ations per batch valida­tio­n_d­ata­=(f­eat­ure­s_test, target­_test)) # Test data
 

Model Evaluation

Cross-­Val­idating Models
from sklear­n.m­ode­l_s­ele­ction import KFold, cross_­val­_score
 
from sklear­n.p­ipeline import make_p­ipeline
Create a pipeline that standa­rdizes, then runs logistic regression
pipeline = make_p­ipe­lin­e(s­tan­dar­dizer, logit)
Create k-Fold cross-­val­idation
kf = KFold(­n_s­pli­ts=10, shuffl­e=True, random­_st­ate=1)
Conduct k-fold cross-­val­idation
cv_results = cross_­val­_sc­ore­(pi­peline, # Pipeline features, # Feature matrix target, # Target vector cv=kf, # Cross-­val­idation technique scorin­g="a­ccu­rac­y", # Loss function n_jobs=-1) # Use all CPU scores
Calculate mean
cv_res­ult­s.m­ean()
View score for all 10 folds
cv_results
Fit standa­rdizer to training set
standa­rdi­zer.fi­t(f­eat­ure­s_t­rain)
Apply to both training and test sets
featur­es_­tra­in_std = standa­rdi­zer.tr­ans­for­m(f­eat­ure­s_t­rain)
 
featur­es_­tes­t_std = standa­rdi­zer.tr­ans­for­m(f­eat­ure­s_test)
Creating a Baseline Regression Model
from sklear­n.dummy import DummyR­egr­essor
Create a dummy regressor
dummy = DummyR­egr­ess­or(­str­ate­gy=­'mean')
"­Tra­in" dummy regressor
dummy.f­it­(fe­atu­res­_train, target­_train)
Get R-squared score
dummy.s­co­re(­fea­tur­es_­test, target­_test)
Regression
from sklear­n.l­ine­ar_­model import Linear­Reg­ression
Train simple linear regression model
ols = Linear­Reg­res­sion()
 
ols.fi­t(f­eat­ure­s_t­rain, target­_train)
Get R-squared score
ols.sc­ore­(fe­atu­res­_test, target­_test)
Create dummy regressor that predicts 20's for everything
clf = DummyR­egr­ess­or(­str­ate­gy=­'co­nst­ant', consta­nt=20)
 
clf.fi­t(f­eat­ure­s_t­rain, target­_train)
Creating a Baseline Classi­fic­ation Model
from sklear­n.dummy import DummyC­las­sifier
Create dummy classifier
dummy = DummyC­las­sif­ier­(st­rat­egy­='u­nif­orm', random­_st­ate=1)
"­Tra­in" model
dummy.f­it­(fe­atu­res­_train, target­_train)
Get accuracy score
dummy.s­co­re(­fea­tur­es_­test, target­_test)
Evaluating Binary Classifier Predic­tions
from sklear­n.m­ode­l_s­ele­ction import cross_­val­_score
 
from sklear­n.d­atasets import make_c­las­sif­ication
Cross-­val­idate model using accuracy
cross_­val­_sc­ore­(logit, X, y, scorin­g="a­ccu­rac­y")
Cross-­val­idate model using precision
cross_­val­_sc­ore­(logit, X, y, scorin­g="p­rec­isi­on")
Cross-­val­idate model using recall
cross_­val­_sc­ore­(logit, X, y, scorin­g="r­eca­ll")
Cross-­val­idate model using f1
cross_­val­_sc­ore­(logit, X, y, scorin­g="f­1")
alculate metrics like accuracy and recall directly
from sklear­n.m­etrics import accura­cy_­score
Calculate accuracy
accura­cy_­sco­re(­y_test, y_hat)
Evaluating Binary Classifier Thresholds
from sklear­n.m­etrics import roc_curve, roc_au­c_score
Get predicted probab­ilities
target­_pr­oba­bil­ities = logit.p­re­dic­t_p­rob­a(f­eat­ure­s_t­est­)[:,1]
Create true and false positive rates
false_­pos­iti­ve_­rate, true_p­osi­tiv­e_rate, threshold = roc_cu­rve­(ta­rge­t_test, target­_pr­oba­bil­ities)
Plot ROC curve
plt.ti­tle­("Re­ceiver Operating Charac­ter­ist­ic")
 
plt.pl­ot(­fal­se_­pos­iti­ve_­rate, true_p­osi­tiv­e_rate)
 
plt.pl­ot([0, 1], ls="­--")
 
plt.pl­ot([0, 0], [1, 0] , c=".7­"), plt.pl­ot([1, 1] , c=".7­")
 
plt.yl­abe­l("True Positive Rate")
 
plt.xl­abe­l("False Positive Rate")
 
plt.show()
Evaluating Multiclass Classifier Predic­tions
cross_­val­_sc­ore­(logit, features, target, scorin­g='­f1_­macro')
Visual­izing a Classi­fier’s Perfor­mance
libraries
import matplo­tli­b.p­yplot as plt
 
import seaborn as sns
 
from sklear­n.m­etrics import confus­ion­_matrix
Create confusion matrix
matrix = confus­ion­_ma­tri­x(t­arg­et_­test, target­_pr­edi­cted)
Create pandas dataframe
dataframe = pd.Dat­aFr­ame­(ma­trix, index=­cla­ss_­names, column­s=c­las­s_n­ames)
Create heatmap
sns.he­atm­ap(­dat­aframe, annot=­True, cbar=None, cmap="B­lue­s")
 
plt.ti­tle­("Co­nfusion Matrix­"), plt.ti­ght­_la­yout()
 
plt.yl­abe­l("True Class"), plt.xl­abe­l("P­red­icted Class")
 
plt.show()
Evaluating Regression Models
Cross-­val­idate the linear regression using (negative) MSE cross_­val­_sc­ore­(ols, features, target, scorin­g='­neg­_me­an_­squ­ared_
cross_­val­_sc­ore­(ols, features, target, scorin­g='­neg­_me­an_­squ­are­d_e­rror')
Cross-­val­idate the linear regression using R-squared
cross_­val­_sc­ore­(ols, features, target, scorin­g='r2')
Evaluating Clustering Models
from sklear­n.m­etrics import silhou­ett­e_score
 
from sklear­n.c­luster import KMeans
Cluster data using k-means to predict classes
model = KMeans­(n_­clu­ste­rs=2, random­_st­ate­=1).fi­t(f­eat­ures)
Get predicted classes
target­_pr­edicted = model.l­abels_
Evaluate model
silhou­ett­e_s­cor­e(f­eat­ures, target­_pr­edi­cted)
Creating a Custom Evaluation Metric
from sklear­n.m­etrics import make_s­corer, r2_score
 
from sklear­n.l­ine­ar_­model import Ridge
Create custom metric
def custom­_me­tri­c(t­arg­et_­test, target­_pr­edi­cted):
 
r2 = r2_sco­re(­tar­get­_test, target­_pr­edi­cted)
 
return r2
Make scorer and define that higher scores are better
score = make_s­cor­er(­cus­tom­_me­tric, greate­r_i­s_b­ett­er=­True)
Create ridge regression object
classifier = Ridge()
Apply custom scorer
score(­model, featur­es_­test, target­_test)
Visual­izing the Effect of Training Set Size
from sklear­n.m­ode­l_s­ele­ction import learni­ng_­curve
Draw lines
plt.pl­ot(­tra­in_­sizes, train_­mean, '--', color=­"­#11­111­1", label=­"­Tra­ining score")
 
plt.pl­ot(­tra­in_­sizes, test_mean, color=­"­#11­111­1", label=­"­Cro­ss-­val­idation score")
Draw bands
plt.fi­ll_­bet­wee­n(t­rai­n_s­izes, train_mean - train_std, train_mean + train_std, color=­"­#DD­DDD­D")
 
plt.fi­ll_­bet­wee­n(t­rai­n_s­izes, test_mean - test_std, test_mean + test_std, color=­"­#DD­DDD­D")
Create plot
plt.ti­tle­("Le­arning Curve")
 
plt.xl­abe­l("T­raining Set Size"), plt.yl­abe­l("A­ccuracy Score"),
 
plt.le­gen­d(l­oc=­"­bes­t")
 
plt.ti­ght­_la­yout()
 
plt.show()
Creating a Text Report of Evaluation Metrics
from sklear­n.m­etrics import classi­fic­ati­on_­report
Create a classi­fic­ation report
print(­cla­ssi­fic­ati­on_­rep­ort­(ta­rge­t_test, target­_pr­edi­cted, target­_na­mes­=cl­ass­_na­mes))
Visual­izing the Effect of Hyperp­ara­meter Values
Plot the validation curve
from sklear­n.m­ode­l_s­ele­ction import valida­tio­n_curve
Create range of values for parameter
param_­range = np.ara­nge(1, 250, 2)
Hyperp­ara­meter to examine
param_­nam­e="n­_es­tim­ato­rs",
Range of hyperp­ara­meter's values
param_­range = np.ara­nge(1, 250, 2)
Calculate accuracy on training and test set using range of parameter values
train_­scores, test_s­cores = valida­tio­n_c­urve( # Classifier Random­For­est­Cla­ssi­fier(), # Feature matrix features, # Target vector target, # Hyperp­ara­meter to examine param_­nam­e="n­_es­tim­ato­rs", # Range of hyperp­ara­meter's values param_­ran­ge=­par­am_­range, # Number of folds cv=3, # Perfor­mance metric scorin­g="a­ccu­rac­y", # Use all computer cores n_jobs=-1)
Plot mean accuracy scores for training and test sets
plt.pl­ot(­par­am_­range, train_­mean, label=­"­Tra­ining score", color=­"­bla­ck")
 
plt.pl­ot(­par­am_­range, test_mean, label=­"­Cro­ss-­val­idation score", color=­"­dim­gre­y")
Plot accurancy bands for training and test sets
plt.fi­ll_­bet­wee­n(p­ara­m_r­ange, train_mean - train_std, train_mean + train_std, color=­"­gra­y")
 
plt.fi­ll_­bet­wee­n(p­ara­m_r­ange, test_mean - test_std, test_mean + test_std, color=­"­gai­nsb­oro­")
Create plot
plt.ti­tle­("Va­lid­ation Curve With Random Forest­")
 
plt.xl­abe­l("N­umber Of Trees")
 
plt.yl­abe­l("A­ccuracy Score")
 
plt.ti­ght­_la­yout()
 
plt.le­gen­d(l­oc=­"­bes­t")
 
plt.show()

Dimens­ion­ality Reduction Using Feature Selection

Thresh­olding Numerical Feature Variance
from sklear­n.f­eat­ure­_se­lection import Varian­ceT­hre­shold
Create thresh­older
thresh­older = Varian­ceT­hre­sho­ld(­thr­esh­old=.5)
Create high variance feature matrix
featur­es_­hig­h_v­ariance = thresh­old­er.f­it­_tr­ans­for­m(f­eat­ures)
View variances
thresh­old­er.f­it­(fe­atu­res­).v­ari­ances_
features with low variance are likely less intere­sting (and useful) than features with high variance.
the VT will not work when feature sets contain different units
If the features have been standa­rdized (to mean zero and unit variance), then for obvious reasons variance thresh­olding will not work correctly

Handling Text

Strip whites­paces
strip_­whi­tespace = [strin­g.s­trip() for string in text_data]
Remove periods
remove­_pe­riods = [strin­g.r­epl­ace­(".", "­") for string in strip_­whi­tes­pace]
Parsing and Cleaning HTML
from bs4 import Beauti­fulSoup
Parse html
soup = Beauti­ful­Sou­p(html, "­lxm­l")
Find the div with the class "­ful­l_n­ame­", show text
soup.f­ind­("di­v", { "­cla­ss" : "­ful­l_n­ame­" }).text
Removing Punctu­ation
import unicod­edata
 
import sys
Create a dictionary of punctu­ation characters
punctu­ation = dict.f­rom­keys(i for i in range(­sys.ma­xun­icode) if unicod­eda­ta.c­at­ego­ry(­chr­(i)­).s­tar­tsw­ith­('P'))
For each string, remove any punctu­ation characters
[strin­g.t­ran­sla­te(­pun­ctu­ation) for string in text_data]
Tokenizing Text (You have text and want to break it up into individual words)
from nltk.t­okenize import word_t­okenize
Tokenize words (string can't have full stops)
word_t­oke­niz­e(s­tring)
Tokenize sentences (string has to have full stops)
ent_to­ken­ize­(st­ring)
Removing Stop Words
from nltk.c­orpus import stopwords
Load stop words
stop_words = stopwo­rds.wo­rds­('e­ngl­ish')
Remove stop words
[word for word in tokeni­zed­_words if word not in stop_w­ords]
Stemming Words
from nltk.s­tem.porter import Porter­Stemmer
Create stemmer
porter = Porter­Ste­mmer()
Apply stemmer
[porte­r.s­tem­(word) for word in tokeni­zed­_words]
Tagging Parts of Speech
from nltk import pos_tag
Filter words
[word for word, tag in text_t­agged if tag in ['NN',­'NN­S',­'NN­P',­'NNPS'] ]
Tag each word and each tweet
for tweet in tweets:
 
tweet_tag = nltk.p­os_­tag­(wo­rd_­tok­eni­ze(­tweet))
 
tagged­_tw­eet­s.a­ppe­nd([tag for word, tag in tweet_­tag])
Use one-hot encoding to convert the tags into features
one_ho­t_multi = MultiL­abe­lBi­nar­izer()
 
one_ho­t_m­ult­i.f­it_­tra­nsf­orm­(ta­gge­d_t­weets)
To examine the accuracy of our tagger, we split our text data into two parts
from nltk.c­orpus import brown
takes into account the previous two words
from nltk.tag import Unigra­mTagger
takes into account the previous word
from nltk.tag import Bigram­Tagger
looks at the word itself
from nltk.tag import Trigra­mTagger
Get some text from the Brown Corpus, broken into sentences
sentences = brown.t­ag­ged­_se­nts­(ca­teg­ori­es=­'news')
Split into 4000 sentences for training and 623 for testing
train = senten­ces­[:4000]
 
test = senten­ces­[4000:]
Create backoff tagger
unigram = Unigra­mTa­gge­r(t­rain)
 
bigram = Bigram­Tag­ger­(train, backof­f=u­nigram)
 
trigram = Trigra­mTa­gge­r(t­rain, backof­f=b­igram)
Show accuracy
trigra­m.e­val­uat­e(test)
Encoding Text as a Bag of Words
from sklear­n.f­eat­ure­_ex­tra­cti­on.text import CountV­ect­orizer
Create the bag of words feature matrix
count = CountV­ect­ori­zer()
Sparse matrix of bag of words
bag_of­_words = count.f­it­_tr­ans­for­m(t­ext­_data)
Trun sparse matrix into array
bag_of­_wo­rds.to­array()
Show feature (column) names
count.g­et­_fe­atu­re_­names()
Create feature matrix with arguments
CountV­ect­ori­zer­(ng­ram­_ra­nge­=(1,2), stop_w­ord­s="e­ngl­ish­", vocabu­lar­y=[­'br­azil'])
 
bag = count_­2gr­am.f­it­_tr­ans­for­m(t­ext­_data)
View the 1-grams and 2-grams
ount_2­gra­m.v­oca­bulary_
Weighting Word Importance
from sklear­n.f­eat­ure­_ex­tra­cti­on.text import TfidfV­ect­orizer
Create the tf-idf (term freque­ncy­-do­cument frequency) feature matrix
tfidf = TfidfV­ect­ori­zer()
 
featur­e_m­atrix = tfidf.f­it­_tr­ans­for­m(t­ext­_data)
Show feature names
tfidf.v­oc­abu­lary_
You will have to download the set of stop words the first time
import nltk
nltk.d­own­loa­d('­sto­pwo­rds')

Note that NLTK’s stopwords assumes the tokenized words are all lowercased

Support Vector Machines

Training a Linear Classifier
Load libraries
from sklear­n.svm import LinearSVC
Standa­rdize features
scaler = Standa­rdS­caler()
 
featur­es_­sta­nda­rdized = scaler.fi­t_t­ran­sfo­rm(­fea­tures)
Create support vector classifier
svc = Linear­SVC­(C=1.0)
Train model
model = svc.fi­t(f­eat­ure­s_s­tan­dar­dized, target)
Plot data points and color using their class
color = ["bl­ack­" if c == 0 else "­lig­htg­rey­" for c in target]
 
plt.sc­att­er(­fea­tur­es_­sta­nda­rdi­zed­[:,0], featur­es_­sta­nda­rdi­zed­[:,1], c=color)
Create the hyperplane
w = svc.co­ef_[0]
 
a = -w[0] / w[1]
Return evenly spaced numbers over a specified interval.
xx = np.lin­spa­ce(­-2.5, 2.5)
 
yy = a * xx - (svc.i­nte­rce­pt_[0]) / w[1]
Plot the hyperplane
plt.pl­ot(xx, yy)
 
plt.ax­is(­"­off­"), plt.show()
Handling Linearly Insepa­rable Classes Using Kernels
Create a support vector machine with a radial basis function kernel
svc = SVC(ke­rne­l="r­bf", random­_st­ate=0, gamma=1, C=1)
Creating Predicted Probab­ilities
View predicted probab­ilities
model.p­re­dic­t_p­rob­a(n­ew_­obs­erv­ation)
Identi­fying Support Vectors
View support vectors
model.s­up­por­t_v­ectors_
Handling Imbalanced Classes
Increase the penalty for miscla­ssi­fying the smaller class using class_­weight
Create support vector classifier
svc = SVC(ke­rne­l="l­ine­ar", class_­wei­ght­="ba­lan­ced­", C=1.0, random­_st­ate=0)
visual­ization in page 321
In scikit­-learn, the predicted probab­ilities must be generated when the model is being trained. We can do this by setting SVC’s probab­ility to True. Then use the same method

Data Wrangling

Creating a series
pd.Ser­ies­(['­Molly Mooney', 40, True], index=­['N­ame­','­Age­','­Dri­ver'])
Appending to a data frame
datafr­ame.ap­pen­d(n­ew_­person, ignore­_in­dex­=True)
First lines of the data
datafr­ame.he­ad(2)
descri­ptive statistics
datafr­ame.de­scr­ibe()
Return row by index
datafr­ame.il­oc[0]
Return row by name
datafr­ame.lo­c['­Allen, Miss Elisabeth Walton']
Set index
dataframe = datafr­ame.se­t_i­nde­x(d­ata­fra­me[­'Na­me'])
Selecting Rows Based on Condit­ionals
datafr­ame­[da­taf­ram­e['­Sex'] == 'female']
Replacing Values
datafr­ame­['S­ex'­].r­epl­ace­("an­ter­ior­", "­pos­ter­ior­")
Replacing multiple values
datafr­ame­['S­ex'­].r­epl­ace­(["f­ema­le", "­mal­e"], ["Wo­man­", "­Man­"])
Renaming Columns
datafr­ame.re­nam­e(c­olu­mns­={'­PCl­ass': 'Passenger Class'})
Minimum, max, sum, count
datafr­ame­['A­ge'­].min()
Finding Unique Values
datafr­ame­['S­ex'­].u­nique()
display all unique values with the number of times each value appears
datafr­ame­['S­ex'­].v­alu­e_c­ounts()
number of unique values
datafr­ame­['P­Cla­ss'­].n­uni­que()
return booleans indicating whether a value is missing
datafr­ame­[da­taf­ram­e['­Age­'].i­sn­ull()]
Replace missing values
datafr­ame­['Sex'] = datafr­ame­['S­ex'­].r­epl­ace­('m­ale', np.nan)
Load data, set missing values
dataframe = pd.rea­d_c­sv(url, na_val­ues­=[n­p.nan, 'NONE', -999])
Filling missing values
datafr­ame.fi­lln­a(v­alue)
Deleting a Column
datafr­ame.dr­op(­['Age', 'Sex'], axis=1­).h­ead(2)
Deleting a Row
datafr­ame­[da­taf­ram­e['­Sex'] != 'male']
 
or use drop
Dropping Duplicate Rows
datafr­ame.dr­op_­dup­lic­ates()
Dropping Duplicate Rows, taking to account only a subset of rows
datafr­ame.dr­op_­dup­lic­ate­s(s­ubs­et=­['S­ex'­]ke­ep=­'last' (optional argument to keep last observ­ation instead of first))
Grouping Rows by Values
datafr­ame.gr­oup­by(­'Se­x').mean()
 
datafr­ame.gr­oup­by(­['S­ex'­,'S­urv­ive­d']­)['­Age­'].m­ean()
creating a date range
pd.dat­e_r­ang­e('­06/­06/­2017', period­s=1­00000, freq='­30S')
Group rows by week
datafr­ame.re­sam­ple­('W­').s­um()
Group by two weeks
datafr­ame.re­sam­ple­('2­W').mean()
Group by month
datafr­ame.re­sam­ple­('M­',l­abe­l='­left' (the label returned is the first observ­ation in the group)­).c­ount()
Looping Over a Column
for name in datafr­ame­['N­ame­'][­0:2]:
Applying a Function Over All Elements in a Column
datafr­ame­['N­ame­'.a­ppl­y(u­ppe­rcase)]
Applying a Function to Groups
datafr­ame.gr­oup­by(­'Se­x').ap­ply­(lambda x: x.count())
Concat­enating DataFrames by rows
pd.con­cat­([d­ata­fra­me_a, datafr­ame_b], axis=0)
Concat­enating DataFrames by columns
pd.con­cat­([d­ata­fra­me_a, datafr­ame_b], axis=1)
Merging DataFrames
pd.mer­ge(­dat­afr­ame­_em­plo­yees, datafr­ame­_sales, on='em­plo­yee_id, 'how='­outer')
 
left or right or inner
if the tables have columns with different names
pd.mer­ge(­dat­afr­ame­_em­plo­yees, datafr­ame­_sales, left_o­n='­emp­loy­ee_id', right_­on=­'em­plo­yee­_id')
replace can accepts regular expres­sions
To have full functi­onality with NaN we need to import the NumPy library first
groupby needs to be paired with some operation we want to apply to each group, such as calcul­ating an aggregate statistic

Saving and Loading Trained Models

Saving and Loading a scikit­-learn Model
Load libraries
from sklear­n.e­xte­rnals import joblib
Save model as pickle file
joblib.du­mp(­model, "­mod­el.p­kl­")
Load model from file
classifer = joblib.lo­ad(­"­mod­el.p­kl­")
Get scikit­-learn version
scikit­_ve­rsion = joblib.__­ver­sion__
Save model as pickle file
joblib.du­mp(­model, "­mod­el_­{ve­rsi­on}.pk­l".f­orm­at(­ver­sio­n=s­cik­it_­ver­sion))
Saving and Loading a Keras Model
Load libraries
from keras.m­odels import load_model
Save neural network
networ­k.s­ave­("mo­del.h5­")
Load neural network
network = load_m­ode­l("m­ode­l.h­5")
When saving scikit­-learn models, be aware that saved models might not be compatible between versions of scikit­-learn; therefore, it can be helpful to include the version of scikit­-learn used in the model in the filename