Show Menu
Cheatography

Scikit-Learn(Cyber-security) Cheat Sheet by

This is a Cheat sheet for Scikit-Learn

Definition

Scikit­-learn is an open source Python library that implements a range of machine learning, prepro­ces­sing, cross-­val­idation and visual­ization algorithms using a unified interface

Splitting Data

from sklear­n.m­ode­l_s­ele­ction import train_­tes­t_split
X_train, X_test, y_train, y_test = train_­tes­t_s­plit(X, y, random­_st­ate=7)

Handling Missing Data

from sklear­n.i­mpute import Simple­Imputer
missin­gvalues = Simple­Imp­ute­r(m­iss­ing­_values = np.nan, strategy = 'mean')
missin­gvalues = missin­gva­lue­s.f­it(X[:, 1:3])
X[:, 1:3]=m­iss­ing­val­ues.tr­ans­for­m(X[:, 1:3])

Linear Regression

from sklear­n.l­ine­ar_­model import Linear­Reg­ression
linear_reg = Linear­Reg­res­sion()
linear­_re­g.fit( X , y )

Decision Tree and Random forest

from sklear­n.tree import Decisi­onT­ree­Reg­ressor
from sklear­n.e­nsemble import Random­For­est­Reg­ressor
regressor = Decisi­onT­ree­Reg­res­sor­(ra­ndo­m_state = 0)
regres­sor.fi­t(X,y)
regressor2 = Random­For­est­Reg­res­sor­(n_­est­imators = 100,random_state=0)
regressor2.fit(X,y)

Cross-­Val­idation

from sklear­n.d­atasets import make_regression
from sklear­n.l­ine­ar_­model import LinearRegression
from sklear­n.m­ode­l_s­ele­ction import cross_­val­idate
X , y = make)r­egr­ess­ion­(n_­samples = 1000, random­_state = 0)
lr = Linear­Reg­res­sion()
result = cross_validate(lr,X,y)
result['test_score']
It is used to know the effect­iveness of our Models by re-sam­pling and applying to models in different iterat­ions.

Pandas functions for importing Data

pd.rea­d_c­sv(­fil­ename)
From a CSV file
pd.rea­d_e­xce­l(f­ile­name)
From an Excel file
pd.rea­d_s­ql(­query, connec­tio­n_o­bject)
Read from a SQL table/­dat­abase
pd.rea­d_c­lip­board()
Takes the contents of your clipboard and passes it to read_t­able()

Visual­ization using Scikit­-learn

from sklear­n.m­etrics import plot_r­oc_­curve
Importing "­plo­t_r­oc_­cur­ve" to plot
svc_disp = plot_r­oc_­cur­ve(svc, X_test, y_test)
Plotting Receiver operating charac­ter­istic Curve
metric­s.p­lot­_co­nfu­sio­n_m­atrix
Plotting Confusion Matrix.

Clustering metrics

Adjusted Rand Index
>>> from sklear­n.m­etrics import adjust­ed_­ran­d_score
>>> adjust­ed_­ran­d_s­cor­e(y­_true, y_pred)
Homoge­neity
>>> from sklear­n.m­etrics import homoge­nei­ty_­score >>> homoge­nei­ty_­sco­re(­y_true, y_pred)
V-measure
>>> from sklear­n.m­etrics import v_meas­ure­_score
>>> metric­s.v­_me­asu­re_­sco­re(­y_true, y_pred)

Pandas Data Cleaning functions

pd.isn­ull()
Checks for null Values, Returns Boolean Arrray
pd.not­null()
Opposite of pd.isn­ull()
df.dro­pna()
Drop all rows that contain null values
df.dro­pna­(ax­is=1)
Drop all columns that contain null values
df.fil­lna(x)
Replace all null values with x

Numpy Basic Functions

import numpy as np
importing numpy
example = [0,1,2]
example = np.arr­ay(­exa­mple)
array([0, 1, 2])
np.ara­nge­(1,4)
array(­[1,­2,3])
np.zer­os(2,2)
array(­[[0­,0]­,[0­,0]])
np.lin­spa­ce(­0,10,2)
array(­[0,5]), gives two evenly spaced values
np.eye(2)
array(­[[1­,0]­,[0­,1]), 2*2 Identity Matrix
exampl­e.r­esh­ape­(3,1)
array(­[[0­],[­1],­[2]])
 

Loading Dataset from local Machine

import pandas as pd
data = pd.rea­d_c­sv(­pat­hname)
If the file is in the local directory then we can directly use File name

Loading Data from Standard datasets

from sklearn import datasets
iris = datase­ts.l­oa­d_i­ris()
digits = datase­ts.l­oa­d_d­igits()

Encoding Catego­rical Variables

from sklear­n.p­rep­roc­essing import LabelE­ncoder
labele­nco­der_X = LabelE­nco­der()
X[ : , 0] = labele­nco­der­_X.f­it­_tr­ans­form(X[ : , 0 ])
onehot­encoder = OneHot­Enc­ode­r(c­ate­gor­ica­l_f­eatures = [0])
X = onehot­enc­ode­r.f­it_­tra­nsf­orm­(X).to­array()

Polynomial Regression

from sklear­n.p­rep­roc­essing import Polyno­mia­lFe­atures
poly_reg = Polyno­mia­lFe­atu­res­(degree =2)
X_poly = poly_r­eg.f­it­_tr­ans­form(X)
It not only checks the relation between X(inde­pen­dent) and y(depe­ndent). But also checks with X2 ..X n. (n is degree specified by us).

Evaluation of Regr­ession Model Perfor­mance

R2 = 1 - SS(res­idu­als­)/S­S(t­otal)
SS(res) = SUM(Yi - y^i)2
SS(Total) = SUM(yi - yavg)2
from sklear­n.m­etrics import r2_score
r2_sco­re(­y_t­rue­,y_­pred)
The Greater the R2 value the better the model is..

Converting Dataframe to Matrix

data = pd.rea­d_c­sv(­"­dat­a.c­sv")
X = data.iloc[ : , :-1].v­alues
y = data.iloc[ : , 3].values
y is Dependent parameter

Feature Scaling

from sklear­n.p­rep­roc­essing import Standa­rdS­caler
sc_X = Standa­rdS­caler()
X_train = sc_X.f­it_­tra­nsf­orm­(X_­train)
X_test = sc_X.t­ran­sfo­rm(­X_test)
Euclidean distance is dominated by the larger numbers and to make all the values on the same scale. hence Scaling should be done. Most of the models do feature scaling by themse­lves.

SVR(No­n-l­inear Regression model)

from sklear­n.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X,y)
y_pred­iction = regressor.predict(values)
Basically, the kernel is selected based on the given problem. If the problem is Linear then kernel­='l­inear'. And if problem is non-linear we can choose either 'poly' or 'rbf'(­gus­sian)

Some Classi­fic­ation Models

Logistic Regression
K-NN(K- nearest neighb­ours)
Support Vector Machin­e(SVM)
Naive Bayes
Decision Tree Classi­fic­ation
Random Forest Classi­fic­ation

Some Clustering Models

K-Means Clustering
Hierar­chial Clustering
DB-SCAN

Knowing about Data inform­ation with Pandas

df.head(n)
First n rows of the DataFrame
df.tail(n)
Last n rows of the DataFrame
df.shape
Number of rows and columns
df.info()
Index, Datatype and Memory inform­ation
df.des­cribe()
Summary statistics for numerical columns
 

Comments

No comments yet. Add yours below!

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.