Show Menu
Cheatography

Python - Linear Regression Model Cheat Sheet (DRAFT) by

Linear regression model in Python

This is a draft cheat sheet. It is a work in progress and is not finished yet.

TO START

{{noshy}}# IMPORT DATA LIBRARIES
import pandas as pd
import numpy as np

# IMPORT VIS LIBRARIES
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# IMPORT MODELLING LIBRARIES
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics

PRELIM­INARY OPERATIONS

df = pd.rea­d_c­sv(­'da­ta.c­sv')
read data
df.head()
check head df
df.info()
check info df
df.des­cribe()
check stats df
df.columns
check col names

VISUALISE DATA

sns.pa­irp­lot(df)
pairplot
sns.di­stp­lot­(df­['Y'])
distri­bution plot
sns.he­atm­ap(­df.c­orr(), annot=­True)
heatmap with values
 

TRAIN MODEL

CREATE X and y ------­---­------
X = df[['c­ol1­','­col­2',­etc.]]
create df features
y = df['col']
create df var to predict
SPLIT DATASET ------­---­------
X_train, X_test, y_train, y_test =
train_test_split(
                         X,
                         y,
                         test_size=0.3)
split df in train and test df
FIT THE MODEL ------­---­------
lm = Linear­Reg­res­sion()
instatiate model
lm.fit­(X_­train, y_train)
train/fit the model
SHOW RESULTS ------­---­------
lm.int­ercept_
show intercept
lm.coef_
show coeffi­cients
coeff_df = pd.DataFrame
(lm.coef_,X.columns,columns=['Coeff'])*
create coeff df
pd.Dat­aFrame: pd.Dat­aFr­ame­(da­ta=­None, index=­None, column­s=None, dtype=­None, copy=F­alse). data = values, index= name index, columns= name column. This could be useful just to interpret the coeffi­cient of the regres­sion.

MAKE PREDIC­TIONS

predic­tions = lm.pre­dic­t(X­_test)
create predic­tions
plt.sc­att­er(­y_t­est­,pr­edi­cti­ons)*
plot predic­tions
sns.di­stp­lot­((y­_te­st-­pre­dic­tio­ns)­,bi­ns=50)*
distplot of residuals
scatter: this graph show the difference between actual values and the values predicted by the model we trained. It should resemble as much as possible a diagonal line.
distplot: this graph shows the distri­butions of the residual errors, that is, the difference between the actual values minus the predicted values; it should result in an as much as possible normal distri­bution. If not, maybe change model!

EVALUATION METRICS

print(­'MAE:', metric­s.m­ean­_ab­sol­ute­_er­ror­(y_­test, predic­tions))
print(­'MSE:', metric­s.m­ean­_sq­uar­ed_­err­or(­y_test, predic­tions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
MAE is the easiest to unders­tand, because it's the average error.
MSE is more popular than MAE, because MSE "­pun­ish­es" larger errors, which tends to be useful in the real world.
RMSE is even more popular than MSE, because RMSE is interp­retable in the "­y" units.