Show Menu
Cheatography

Machine Learning- Jalpa Tank Cheat Sheet (DRAFT) by

Machine Learning (Python)

This is a draft cheat sheet. It is a work in progress and is not finished yet.

KNN Regression

from sklear­n.n­eig­hbors import KNeigh­bor­sRe­gressor
import matplo­tli­b.p­yplot as plt
1.Define X and Y
2.Find k-nearest neighbors
3.Find the average price
knn = KNeigh­bor­sRe­gre­sso­r(n­_ne­igh­bors=n)
knn.fi­t(X,Y)
knn.sc­ore­(X_­test, y_test)

Catego­rical Variables (Ordinal & Nominal)

import catego­ry_­enc­oders as ce
encoder = ce.Ord­ina­lEn­cod­er(­map­pin­g=[­{'c­oln­ame': 'name', 'mapping': {'1': 1, '2': 2}}])
encoder = ce.Ord­ina­lEn­cod­er(­col­s=[­'co­lna­me'])
encode­r.f­it(X)
X = encode­r.t­ran­sfo­rm(X)
Frequency Encoding
encoder = ce.Cou­ntE­nco­der­(co­ls=­['c­oln­ame'])
One-Hot Encoding
encoder = ce.One­Hot­Enc­oder()
Target Encoding
encoder = ce.Tar­get­Enc­oder()

Mean Abolute Error,R2 score,­Acc­uracy score

from sklear­n.m­etrics import mean_a­bso­lut­e_e­rro­r,r­2_score
from sklearn import metrics
from sklear­n.m­etrics import accura­cy_­score
e = mean_a­bso­lut­e_e­rro­r(t­rai­n/t­est­/x/y, predic­tions)
ep = e*100 / y.mean()
------­---­---­---­---­---­---­---­---­---­---­-------
r2_sco­re(­y_t­rain, preds)
------­---­---­---­---­---­---­---­---­---­---­-------
valida­tion_e = accura­cy_­sco­re(­y_test, valida­tio­n_p­red­ict­ions)
 

Decision Tree

from sklear­n.tree import Decisi­onT­ree­Reg­ressor
from sklear­n.tree import Decisi­onT­ree­Cla­ssifier
from sklearn import tree
1. define X and y
2. regr = Decisi­onT­ree­Reg­res­sor­(ra­ndo­m_s­tat­e=1­234­,ma­x_d­ept­h=int)
3. model = regr.f­it(X, y)
4. model.p­re­dic­t(data)
square­dError
squared = (col-c­ol.m­ea­n())** 2
squared = sum(sq­uar­ed)/n
Getting the threshold values
regr1.t­ree_
regr1.t­re­e_.t­hr­eshold

train_­tes­t_split

from sklear­n.m­ode­l_s­ele­ction import train_­tes­t_split
X_train, X_test, y_train, y_test = train_­tes­t_s­plit(X, y,test­_si­ze=­None, train_­siz­e=None, random­_st­ate­=None, shuffl­e=True)

Methods

from sklear­n.m­ode­l_s­ele­ction import Random­ize­dSe­archCV
from scipy.s­tats import randint as sp_randint
from pandas.ap­i.types import is_str­ing­_dtype, is_obj­ect­_dt­ype­,is­_nu­mer­ic_­dtype
DataFr­ame.dr­opn­a(a­xis=0, thresh­=in­t,i­npl­ace­=False) || lambda x: x.capi­tal­ize()
x.to_f­ram­e().T #Convert Series to DataFr­ame.(t­o_f­rame)
Df.sor­t_v­alu­es(­by=­col­name, axis=int, ascend­ing­=True)
.astyp­e(str)
df[col­nam­e].f­il­lna­(df­[co­lna­me].me­dian(), inplac­e=True)
 

Random Forests

from sklear­n.e­nsemble import Random­For­est­Reg­ressor
rf = Random­For­est­Reg­res­sor­(n_­est­ima­tor­s=100, n_jobs=-1, oob_sc­ore­=True)
rf.fit(X, y)
rf.sco­re(­X_t­rain, y_train)
rf.oob­_score_
rf.est­ima­tors_

Calcul­ating feature importance with rfpimp

from rfpimp import *
I = import­anc­es(rf, X_test, y_test)
plot_i­mpo­rta­nces(I, color=­'#4­575b4')

Hyper-­par­ameters

The number of trees, and any other aspect of the model that affects its archit­ecture, statis­ticians call a hyper-­par­ameter.

Train,­Val­ida­te,Test

15% test - 15% valida­tion, 70% train
df_dev, df_test = train_­tes­t_s­pli­t(df, test_s­ize­=0.15)
df_train, df_valid = train_­tes­t_s­pli­t(d­f_dev, test_s­ize­=0.15)