Show Menu

Scikit-Learn(Cyber-security) Cheat Sheet by

This is a Cheat sheet for Scikit-Learn


Scik­it-­learn is an open source Python library that implements a range of machine learning, prepro­ces­sing, cross-­val­idation and visual­ization algorithms using a unified interface

Splitting Data

from sklear­n.m­ode­l_s­ele­ction import train_­tes­t_split
X_train, X_test, y_train, y_test = train_­tes­t_s­plit(X, y, random­_st­ate=7)

Handling Missing Data

from sklear­n.i­mpute import Simple­Imputer
missin­gvalues = Simple­Imp­ute­r(m­iss­ing­_values = np.nan, strategy = 'mean')
missin­gvalues = missin­gva­lue­s.f­it(X[:, 1:3])
X[:, 1:3]=m­iss­ing­val­­ans­for­m(X[:, 1:3])

Linear Regression

from sklear­n.l­ine­ar_­model import Line­arR­egr­ess­ion
linear_reg = Linear­Reg­res­sion()
linear­_re­ X , y )

Decision Tree and Random forest

from sklear­n.tree import DecisionTreeRegressor
from sklear­n.e­nsemble import Random­For­est­Reg­ressor
regressor = Decisi­onT­ree­Reg­res­sor­(ra­ndo­m_state = 0)
regressor2 = Random­For­est­Reg­res­sor­(n_­est­imators = 100,random_state=0),y)


from sklear­n.d­atasets import make_regression
from sklear­n.l­ine­ar_­model import LinearRegression
from sklear­n.m­ode­l_s­ele­ction import cross_­val­idate
X , y = make)r­egr­ess­ion­(n_­samples = 1000, random­_state = 0)
lr = Linear­Reg­res­sion()
result = cross_validate(lr,X,y)
It is used to know the effect­iveness of our Models by re-sam­pling and applying to models in different iterat­ions.

Pandas functions for importing Data

From a CSV file
From an Excel file
pd.rea­d_s­ql(­query, connec­tio­n_o­bject)
Read from a SQL table/­dat­abase
Takes the contents of your clipboard and passes it to read_t­able()

Visual­ization using Scikit­-learn

from sklear­n.m­etrics import plot_r­oc_­curve
Importing "­plo­t_r­oc_­cur­ve" to plot
svc_disp = plot_r­oc_­cur­ve(svc, X_test, y_test)
Plotting Receiver operating charac­ter­istic Curve
Plotting Conf­usion Matrix.

Clustering metrics

Adjusted Rand Index
>>> from sklear­n.m­etrics import adjust­ed_­ran­d_score
>>> adjust­ed_­ran­d_s­cor­e(y­_true, y_pred)
>>> from sklear­n.m­etrics import homoge­nei­ty_­score >>> homoge­nei­ty_­sco­re(­y_true, y_pred)
>>> from sklear­n.m­etrics import v_meas­ure­_score
>>> metric­s.v­_me­asu­re_­sco­re(­y_true, y_pred)

Pandas Data Cleaning functions

Checks for null Values, Returns Boolean Arrray
Opposite of pd.isn­ull()
Drop all rows that contain null values
Drop all columns that contain null values
Replace all null values with x

Numpy Basic Functions

import numpy as np
importing numpy
example = [0,1,2]
example = np.arr­ay(­exa­mple)
array([0, 1, 2])
array(­[0,5]), gives two evenly spaced values
array(­[[1­,0]­,[0­,1]), 2*2 Identity Matrix

Loading Dataset from local Machine

import pandas as pd
data = pd.rea­d_c­sv(­pat­hname)
If the file is in the local directory then we can directly use File name

Loading Data from Standard datasets

from sklearn import datasets
iris = datase­ts.l­oa­d_i­ris()
digits = datase­ts.l­oa­d_d­igits()

Encoding Catego­rical Variables

from sklear­n.p­rep­roc­essing import LabelE­ncoder
labele­nco­der_X = LabelE­nco­der()
X[ : , 0] = labele­nco­der­_X.f­it­_tr­ans­form(X[ : , 0 ])
onehot­encoder = OneHot­Enc­ode­r(c­ate­gor­ica­l_f­eatures = [0])
X = onehot­enc­ode­r.f­it_­tra­nsf­orm­(X).to­array()

Polynomial Regression

from sklear­n.p­rep­roc­essing import Poly­nom­ial­Fea­tures
poly_reg = Polyno­mia­lFe­atu­res­(degree =2)
X_poly = poly_r­eg.f­it­_tr­ans­form(X)
It not only checks the relation between X(inde­pen­dent) and y(depe­ndent). But also checks with X2 ..X n. (n is degree specified by us).

Evaluation of Regr­ession Model Perfor­mance

R2 = 1 - SS(res­idu­als­)/S­S(t­otal)
SS(res) = SUM(Yi - y^i)2
SS(Total) = SUM(yi - yavg)2
from sklear­n.m­etrics import r2_score
The Greater the R2 value the better the model is..

Converting Dataframe to Matrix

data = pd.rea­d_c­sv(­"­dat­a.c­sv")
X = data.iloc[ : , :-1].v­alues
y = data.iloc[ : , 3].values
y is Dependent parameter

Feature Scaling

from sklear­n.p­rep­roc­essing import Standa­rdS­caler
sc_X = Standa­rdS­caler()
X_train = sc_X.f­it_­tra­nsf­orm­(X_­train)
X_test = sc_X.t­ran­sfo­rm(­X_test)
Euclidean distance is dominated by the larger numbers and to make all the values on the same scale. hence Scaling should be done. Most of the models do feature scaling by themse­lves.

SVR(No­n-l­inear Regression model)

from sklear­n.svm import SVR
regressor = SVR­(k­ernel = 'rbf'),y)
y_pred­iction = regres­sor.­pre­dic­t­(va­lues)
Basically, the kernel is selected based on the given problem. If the problem is Linear then kern­el=­'li­nea­r'. And if problem is non-linear we can choose either 'poly' or 'rbf'(­gus­sian)

Some Classi­fic­ation Models

Logistic Regression
K-NN(K- nearest neighb­ours)
Support Vector Machin­e(SVM)
Naive Bayes
Decision Tree Classi­fic­ation
Random Forest Classi­fic­ation

Some Clustering Models

K-Means Clustering
Hierar­chial Clustering

Knowing about Data inform­ation with Pandas

First n rows of the DataFrame
Last n rows of the DataFrame
Number of rows and columns
Index, Datatype and Memory inform­ation
Summary statistics for numerical columns

Help Us Go Positive!

We offset our carbon usage with Ecologi. Click the link below to help us!

We offset our carbon footprint via Ecologi


No comments yet. Add yours below!

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.