Scikit-Learn(Cyber-security) Cheat Sheet

Definition

Scikit-learn is an open source Python library that implements a range of machine learning, preprocessing, cross-validation and visualization algorithms using a unified interface

Splitting Data

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)

Handling Missing Data

from sklearn.impute import SimpleImputer

missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean')
missingvalues = missingvalues.fit(X[:, 1:3])
X[:, 1:3]=missingvalues.transform(X[:, 1:3])

Linear Regression

from sklearn.linear_model import LinearRegression

linear_reg = LinearRegression()
linear_reg.fit( X , y )

Decision Tree and Random forest

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X,y)

regressor2 = RandomForestRegressor(n_estimators = 100,random_state=0)
regressor2.fit(X,y)

Cross-Validation

from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X , y = make)regression(n_samples = 1000, random_state = 0)
lr = LinearRegression()

result = cross_validate(lr,X,y)
result['test_score']

It is used to know the effectiveness of our Models by re-sampling and applying to models in different iterations.

Pandas functions for importing Data

pd.read_csv(filename)	From a CSV file
pd.read_excel(filename)	From an Excel file
pd.read_sql(query, connection_object)	Read from a SQL table/database
pd.read_clipboard()	Takes the contents of your clipboard and passes it to read_table()

Visualization using Scikit-learn

from sklearn.metrics import plot_roc_curve	Importing "plot_roc_curve" to plot
svc_disp = plot_roc_curve(svc, X_test, y_test)	Plotting Receiver operating characteristic Curve
metrics.plot_confusion_matrix	Plotting Confusion Matrix.

Clustering metrics

Adjusted Rand Index
>>> from sklearn.metrics import adjusted_rand_score
>>> adjusted_rand_score(y_true, y_pred)

Homogeneity
>>> from sklearn.metrics import homogeneity_score >>> homogeneity_score(y_true, y_pred)

V-measure
>>> from sklearn.metrics import v_measure_score
>>> metrics.v_measure_score(y_true, y_pred)

Pandas Data Cleaning functions

pd.isnull()	Checks for null Values, Returns Boolean Arrray
pd.notnull()	Opposite of pd.isnull()
df.dropna()	Drop all rows that contain null values
df.dropna(axis=1)	Drop all columns that contain null values
df.fillna(x)	Replace all null values with x

Numpy Basic Functions

import numpy as np	importing numpy
example = [0,1,2] example = np.array(example)	array([0, 1, 2])
np.arange(1,4)	array([1,2,3])
np.zeros(2,2)	array([[0,0],[0,0]])
np.linspace(0,10,2)	array([0,5]), gives two evenly spaced values
np.eye(2)	array([[1,0],[0,1]), 2*2 Identity Matrix
example.reshape(3,1)	array([[0],[1],[2]])

Loading Dataset from local Machine

import pandas as pd

data = pd.read_csv(pathname)

If the file is in the local directory then we can directly use File name

Loading Data from Standard datasets

from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()

Encoding Categorical Variables

from sklearn.preprocessing import LabelEncoder

labelencoder_X = LabelEncoder()

X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0 ])

onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()

Polynomial Regression

from sklearn.preprocessing import PolynomialFeatures

poly_reg = PolynomialFeatures(degree =2)
X_poly = poly_reg.fit_transform(X)

It not only checks the relation between X(independent) and y(dependent). But also checks with X² ..X ⁿ. (n is degree specified by us).

Evaluation of Regression Model Performance

R² = 1 - SS(residuals)/SS(total)

SS(res) = SUM(Yi - y^{^}i)²
SS(Total) = SUM(yi - yavg)²

from sklearn.metrics import r2_score
r2_score(y_true,y_pred)

The Greater the R² value the better the model is..

Converting Dataframe to Matrix

data = pd.read_csv("data.csv")

X = data.iloc[ : , :-1].values

y = data.iloc[ : , 3].values

y is Dependent parameter

Feature Scaling

from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()

X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

Euclidean distance is dominated by the larger numbers and to make all the values on the same scale. hence Scaling should be done. Most of the models do feature scaling by themselves.

SVR(Non-linear Regression model)

from sklearn.svm import SVR

regressor = SVR(kernel = 'rbf')
regressor.fit(X,y)

y_prediction = regressor.predict(values)

Basically, the kernel is selected based on the given problem. If the problem is Linear then kernel='linear'. And if problem is non-linear we can choose either 'poly' or 'rbf'(gussian)

Some Classification Models

Logistic Regression

K-NN(K- nearest neighbours)

Support Vector Machine(SVM)

Naive Bayes

Decision Tree Classification

Random Forest Classification

Some Clustering Models

K-Means Clustering

Hierarchial Clustering

DB-SCAN

Knowing about Data information with Pandas

df.head(n)	First n rows of the DataFrame
df.tail(n)	Last n rows of the DataFrame
df.shape	Number of rows and columns
df.info()	Index, Datatype and Memory information
df.describe()	Summary statistics for numerical columns

Scikit-Learn(Cyber-security) Cheat Sheet by sati

Definition

Splitting Data

Handling Missing Data

Linear Regression

Decision Tree and Random forest

Cross-Validation

Pandas functions for importing Data

Visualization using Scikit-learn

Clustering metrics

Pandas Data Cleaning functions

Numpy Basic Functions

Loading Dataset from local Machine

Loading Data from Standard datasets

Encoding Categorical Variables

Polynomial Regression

Evaluation of Regression Model Performance

Converting Dataframe to Matrix

Feature Scaling

SVR(Non-linear Regression model)

Some Classification Models

Some Clustering Models

Knowing about Data information with Pandas

Created By

Metadata

Favourited By

Comments

Add a Comment

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

Scikit-Learn(Cyber-security) Cheat Sheet by sati

Definition

Splitting Data

Handling Missing Data

Linear Regression

Decision Tree and Random forest

Cross-­Val­idation

Pandas functions for importing Data

Visual­ization using Scikit­-learn

Clustering metrics

Pandas Data Cleaning functions

Numpy Basic Functions

Loading Dataset from local Machine

Loading Data from Standard datasets

Encoding Catego­rical Variables

Polynomial Regression

Evaluation of Regr­ession Model Perfor­mance

Converting Dataframe to Matrix

Feature Scaling

SVR(No­n-l­inear Regression model)

Some Classi­fic­ation Models

Some Clustering Models

Knowing about Data inform­ation with Pandas

Created By

Metadata

Favourited By

Comments

Add a Comment

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

Cross-Validation

Visualization using Scikit-learn

Encoding Categorical Variables

Evaluation of Regression Model Performance

SVR(Non-linear Regression model)

Some Classification Models

Knowing about Data information with Pandas