Machine Learning with Python Cookbook Cheat Sheet

Introduction

Creating a row Vector	np.array([1, 2, 3])
Creating a column Vector	np.array([[1], [2], [3]])
Creating a Matrix	np.array([[1, 2], [1, 2], [1, 2]])
Creating a Sparse Matrix	from scipy import sparse
	sparse.csr_matrix(matrix) #shows the indixes of non zero elements
Select all elements of a vector	vector[:]
Select all rows and the second column	matrix[:,1:2]
View number of rows and columns	matrix.shape
View number of elements	matrix.size
View number of dimensions	matrix.ndim
Applying Operations to Elements	add_100 = lambda i: i + 100
	vectorized_add_100 = np.vectorize(add_100)
	vectorized_add_100(matrix)
maximum value in an array	np.max(matrix)
minimum value in an array	np.min(matrix)
Return mean	np.mean(matrix)
Return variance	np.var(matrix)
Return standard deviation	np.std(matrix)
Reshaping Arrays	matrix.reshape(2, 6)
Transposing a Vector or Matrix	matrix.T
You need to transform a matrix into a one-dimensional array	matrix.flatten()
Return matrix rank (This corresponds to the maximal number of linearly independent columns of the matrix)	np.linalg.matrix_rank(matrix)
Calculating the Determinant	np.linalg.det(matrix)
Getting the Diagonal line of a Matrix	matrix.diagonal(offset=1 (offsets the diagonal by the amount we put, can be negative))
Return trace (sum of the diagonal elements)	matrix.trace()
Finding Eigenvalues and Eigenvectors	eigenvalues, eigenvectors = np.linalg.eig(matrix)
Calculating Dot Products (sum of the product of the elements of two vectores)	np.dot(vector_a, vector_b)
Add two matrices	np.add(matrix_a, matrix_b)
Subtract two matrices	np.subtract(matrix_a, matrix_b)
	Alternatively, we can simply use the + and - operators
Multiplying Matrices	np.dot(matrix_a, matrix_b)
	Alternatively, in Python 3.5+ we can use the @ operator
Multiply two matrices element-wise	matrix_a * matrix_b
Inverting a Matrix	p.linalg.inv(matrix)
Set seed for random value generation	np.random.seed(0)
Generate three random floats between 0.0 and 1.0	np.random.random(3)
Generate three random integers between 1 and 10	np.random.randint(0, 11, 3)
Draw three numbers from a normal distribution with mean 0.0 and standard deviation of 1.0	np.random.normal(0.0, 1.0, 3)
Draw three numbers from a logistic distribution with mean 0.0 and scale of 1.0	np.random.logistic(0.0, 1.0, 3)
Draw three numbers greater than or equal to 1.0 and less than 2.0	np.random.uniform(1.0, 2.0, 3)

We select element from matrixes and vectores like we do in R.
# Find maximum element in each column
np.max(matrix, axis=0) -> array([7, 8, 9])
One useful argument in reshape is -1, which effectively means “as many as needed,” so reshape(1, -1) means one row and as many columns as needed:

Clustering

Clustering Using K-Means
Load libraries	from sklearn.cluster import KMeans
Create k-mean object	cluster = KMeans(n_clusters=3, random_state=0, n_jobs=-1)
Train model	model = cluster.fit(features_std)
Predict observation's cluster	model.predict(new_observation)
View predict class	model.labels_
Speeding Up K-Means Clustering
Load libraries	from sklearn.cluster import MiniBatchKMeans
Create k-mean object	cluster = MiniBatchKMeans(n_clusters=3, random_state=0, batch_size=100)
Train model	model = cluster.fit(features_std)
Clustering Using Meanshift	group observations without assuming the number of clusters or their shape
Load libraries	from sklearn.cluster import MeanShift
Create meanshift object	cluster = MeanShift(n_jobs=-1)
Train model	model = cluster.fit(features_std)
Note on meanshift	cluster_all=False wherein orphan observations are given the label of -1
Clustering Using DBSCAN	group observations into clusters of high density
Load libraries	from sklearn.cluster import DBSCAN
Create meanshift object	cluster = DBSCAN(n_jobs=-1)
Train model	model = cluster.fit(features_std)
DBSCAN has three main parameters to set:
eps	The maximum distance from an observation for another observation to be considered its neighbor.
min_samples	The minimum number of observations less than eps distance from an observation for it to be considered a core observation.
metric	The distance metric used by eps—for example, minkowski or euclidean
Clustering Using Hierarchical Merging
Load libraries	from sklearn.cluster import AgglomerativeClustering
Create meanshift object	cluster = AgglomerativeClustering(n_clusters=3)
Train model	model = cluster.fit(features_std)
AgglomerativeClustering uses the linkage parameter to determine the merging strategy to minimize the following:	Variance of merged clusters (ward)
	Average distance between observations from pairs of clusters (average)
	Maximum distance between observations from pairs of clusters (complete)

MiniBatchKMeans works similarly to KMeans, with one significant difference: the batch_size parameter. batch_size controls the number of randomly selected observations in each batch.

Handling Categorical Data

Encoding Nominal Categorical Features	from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer
Create one-hot encoder	one_hot = LabelBinarizer()
One-hot encode feature	one_hot.fit_transform(feature)
View feature classes	one_hot.classes_
reverse the one-hot encoding	one_hot.inverse_transform(one_hot.transform(feature))
Create dummy variables from feature	pd.get_dummies(feature[:,0])
Create multiclass one-hot encoder	one_hot_multiclass = MultiLabelBinarizer()
One-hot encode multiclass feature	one_hot_multiclass.fit_transform(multiclass_feature)
see the classes with the classes_ method	ne_hot_multiclass.classes_
Encoding Ordinal Categorical Features	dataframe["Score"].replace(dic with categoricals as keys and numbers as values)
Encoding Dictionaries of Features	from sklearn.feature_extraction import DictVectorizer
Create dictionary	data_dict = [{"Red": 2, "Blue": 4}, {"Red": 4, "Blue": 3}, {"Red": 1, "Yellow": 2}, {"Red": 2, "Yellow": 2}]
Create dictionary vectorizer	dictvectorizer = DictVectorizer(sparse=False)
Convert dictionary to feature matrix	features = dictvectorizer.fit_transform(data_dict)
Get feature names	feature_names = dictvectorizer.get_feature_names()
Imputing Missing Class Values	from sklearn.neighbors import KNeighborsClassifier
# Train KNN learner	clf = KNeighborsClassifier(3, weights='distance')
	trained_model = clf.fit(X[:,1:], X[:,0])
Predict missing values' class	imputed_values = trained_model.predict(X_with_nan[:,1:])
Join column of predicted class with their other features	X_with_imputed = np.hstack((imputed_values.reshape(-1,1), X_with_nan[:,1:]))
Join two feature matrices	np.vstack((X_with_imputed, X))
Use imputer to fill most frequen value	imputer = Imputer(strategy='most_frequent', axis=0)
Handling Imbalanced Classes	RandomForestClassifier(class_weight="balanced")
downsample the majority class	i_class0 = np.where(target == 0)[0]
	i_class1 = np.where(target == 1)[0]
Number of observations in each class	n_class0 = len(i_class0)
	n_class1 = len(i_class1)
For every observation of class 0, randomly sample from class 1 without replacement	i_class1_downsampled = np.random.choice(i_class1, size=n_class0, replace=False)
Join together class 0's target vector with the downsampled class 1's target vector	np.hstack((target[i_class0], target[i_class1_downsampled]))
Join together class 0's feature matrix with the downsampled class 1's feature matrix	np.vstack((features[i_class0,:], features[i_class1_downsampled,:]))[0:5]
upsample the minority class	i_class0_upsampled = np.random.choice(i_class0, size=n_class1, replace=True)
Join together class 0's upsampled target vector with class 1's target vector	np.concatenate((target[i_class0_upsampled], target[i_class1]))
Join together class 0's upsampled feature matrix with class 1's feature matrix	np.vstack((features[i_class0_upsampled,:], features[i_class1,:]))[0:5]

A second strategy is to use a model evaluation metric better suited to imbalanced classes. Accuracy is often used as a metric for evaluating the performance of a model, but when imbalanced classes are present accuracy can be ill suited. Some better metrics we discuss in later chapters are confusion matrices, precision, recall, F1 scores, and ROC curves

Dimensionality Reduction Using Feature Extraction

Reducing Features Using Principal Components	from sklearn.decomposition import PCA
	from sklearn.preprocessing import StandardScaler
Standardize the feature matrix	features = StandardScaler().fit_transform(digits.data)
Create a PCA that will retain 99% of variance	pca = PCA(n_components=0.99, whiten=True)
Conduct PCA	features_pca = pca.fit_transform(features)
Reducing Features When Data Is Linearly Inseparable	Use an extension of principal component analysis that uses kernels to allow for non-linear dimensionality reduction
	from sklearn.decomposition import PCA, KernelPCA
Apply kernal PCA with radius basis function (RBF) kernel	kpca = KernelPCA(kernel="rbf", gamma=15, n_components=1)
	features_kpca = kpca.fit_transform(features)
Reducing Features by Maximizing Class Separability	from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
Create and run an LDA, then use it to transform the features	LinearDiscriminantAnalysis(n_components=1)
	features_lda = lda.fit(features, target).transform(features)
amount of variance explained by each component	lda.explained_variance_ratio_
non-negative matrix factorization (NMF) to reduce the dimensionality of the feature matrix	from sklearn.decomposition import NMF
Create, fit, and apply NMF	nmf = NMF(n_components=10, random_state=1)
	features_nmf = nmf.fit_transform(features)
Reducing Features on Sparse Data (Truncated Singular Value Decomposition (TSVD))	from sklearn.decomposition import TruncatedSVD
	from scipy.sparse import csr_matrix
Standardize feature matrix	features = StandardScaler().fit_transform(digits.data)
# Make sparse matrix	features_sparse = csr_matrix(features)
Create a TSVD	tsvd = TruncatedSVD(n_components=10)
Conduct TSVD on sparse matrix	features_sparse_tsvd = tsvd.fit(features_sparse).transform(features_sparse)
Sum of first three components' explained variance ratios	tsvd.explained_variance_ratio_[0:3].sum()

196 e 200
One major requirement of NMA is that, as the name implies, the feature matrix cannot contain negative values.

Trees and Forests

Training a Decision Tree Classifier	from sklearn.tree import DecisionTreeClassifier
Create decision tree classifier object	decisiontree = DecisionTreeClassifier(random_state=0)
Train model	model = decisiontree.fit(features, target)
Predict observation's class	model.predict(observation)
Training a Decision Tree Regressor	from sklearn.tree import DecisionTreeRegressor
Create decision tree classifier object	decisiontree = DecisionTreeRegressor(random_state=0)
Train model	model = decisiontree.fit(features, target)
Create decision tree classifier object using entropy	decisiontree_mae = DecisionTreeRegressor(criterion="mae", random_state=0)
Visualizing a Decision Tree Model	from IPython.display import Image
	import pydotplus
	from sklearn import tree
Create DOT data	dot_data = tree.export_graphviz(decisiontree, out_file=None, feature_names=iris.feature_names, class_names=iris.target_names)
Draw graph	graph = pydotplus.graph_from_dot_data(dot_data)
Show graph	Image(graph.create_png())
Create PDF	graph.write_pdf("iris.pdf")
Create PNG	graph.write_png("iris.png")
Training a Random Forest Classifier	from sklearn.ensemble import RandomForestClassifier
Create random forest classifier object	randomforest = RandomForestClassifier(random_state=0, n_jobs=-1)
Create random forest classifier object using entropy	randomforest_entropy = RandomForestClassifier( criterion="entropy", random_state=0)
Training a Random Forest Regressor	from sklearn.ensemble import RandomForestRegressor
Create random forest classifier object	randomforest = RandomForestRegressor(random_state=0, n_jobs=-1)
Identifying Important Features in Random Forests	from sklearn.ensemble import RandomForestClassifier
Create random forest classifier object	randomforest = RandomForestClassifier(random_state=0, n_jobs=-1)
Calculate feature importances	importances = model.feature_importances_
Sort feature importances in descending order	indices = np.argsort(importances)[::-1]
Rearrange feature names so they match the sorted feature importances	names = [iris.feature_names[i] for i in indices]
Create plot	plt.figure()
Create plot title	plt.title("Feature Importance")
Add bars	plt.bar(range(features.shape[1]), importances[indices])
Add feature names as x-axis labels	plt.xticks(range(features.shape[1]), names, rotation=90)
Show plot	plt.show()
Selecting Important Features in Random Forests	from sklearn.feature_selection import SelectFromModel
Create random forest classifier	randomforest = RandomForestClassifier(random_state=0, n_jobs=-1)
Create object that selects features with importance greater than or equal to a threshold	selector = SelectFromModel(randomforest, threshold=0.3)
Feature new feature matrix using selector	features_important = selector.fit_transform(features, target)
Train random forest using most important featres	model = randomforest.fit(features_important, target)
Handling Imbalanced Classes	Train a decision tree or random forest model with class_weight="balanced"
Create random forest classifier object	randomforest = RandomForestClassifier( random_state=0, n_jobs=-1, class_weight="balanced")
Controlling Tree Size
Create decision tree classifier object	decisiontree = DecisionTreeClassifier(random_state=0, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0, max_leaf_nodes=None, min_impurity_decrease=0)
Improving Performance Through Boosting	from sklearn.ensemble import AdaBoostClassifier
Create adaboost tree classifier object	adaboost = AdaBoostClassifier(random_state=0)
Evaluating Random Forests with Out-of- Bag Errors	You need to evaluate a random forest model without using cross-validation
Create random tree classifier object	randomforest = RandomForestClassifier( random_state=0, n_estimators=1000, oob_score=True, n_jobs=-1)
OOB scores of a random forest	oob_score_

Trees and Forests

Training a Decision Tree Classifier
Load libraries	from sklearn.tree import DecisionTreeClassifier
Create decision tree classifier object	decisiontree = DecisionTreeClassifier(random_state=0)
Train model	model = decisiontree.fit(features, target)
Training a Decision Tree Regressor
Use scikit-learn’s DecisionTreeRegressor	from sklearn.tree import DecisionTreeRegressor
Create decision tree classifier object	decisiontree = DecisionTreeRegressor(random_state=0)
Train model	model = decisiontree.fit(features, target)

Linear Regression

Fitting a Line
Load libraries	from sklearn.linear_model import LinearRegression
Create linear regression	regression = LinearRegression()
Fit the linear regression	model = regression.fit(features, target)
Handling Interactive Effects	You have a feature whose effect on the target variable depends on another feature.
Load libraries	from sklearn.preprocessing import PolynomialFeatures
Create interaction term	interaction = PolynomialFeatures( degree=3, include_bias=False, interaction_only=True)
	features_interaction = interaction.fit_transform(features)
Create linear regression	regression = LinearRegression()
Fit the linear regression	model = regression.fit(features_interaction, target)
Fitting a Nonlinear Relationship	Create a polynomial regression by including polynomial features in a linear regression model
Load library	from sklearn.preprocessing import PolynomialFeatures
Create polynomial features x^{2 and x}3	polynomial = PolynomialFeatures(degree=3, include_bias=False)
	features_polynomial = polynomial.fit_transform(features)
Create linear regression	regression = LinearRegression()
Fit the linear regression	model = regression.fit(features_polynomial, target)
Reducing Variance with Regularization
Use a learning algorithm that includes a shrinkage penalty (also called regularization) like ridge regression and lasso regression:
Load libraries	from sklearn.linear_model import Ridge
Create ridge regression with an alpha value	regression = Ridge(alpha=0.5)
Fit the linear regression	model = regression.fit(features_standardized, target)
Load library	from sklearn.linear_model import RidgeCV
Create ridge regression with three alpha values	regr_cv = RidgeCV(alphas=[0.1, 1.0, 10.0])
Fit the linear regression	model_cv = regr_cv.fit(features_standardized, target)
View coefficients	model_cv.coef_
View alpha	model_cv.alpha_
Reducing Features with Lasso Regression	You want to simplify your linear regression model by reducing the number of features.
Load library	from sklearn.linear_model import Lasso
Create lasso regression with alpha value	regression = Lasso(alpha=0.5)
Fit the linear regression	model = regression.fit(features_standardized, target)
Create lasso regression with a high alpha	regression_a10 = Lasso(alpha=10)
	model_a10 = regression_a10.fit(features_standardized, target)

interaction_only=True tells PolynomialFeatures to only return interaction terms
PolynomialFeatures will add a feature containing ones called a bias. We can prevent that with include_bias=False
Polynomial regression is an extension of linear regression to allow us to model nonlinear relationships.

Loading Data

Loading a Sample Dataset	from sklearn import datasets
	digits = datasets.load_digits()
	features = digits.data
	target = digits.target
Creating a Simulated Dataset for regression	from sklearn.datasets import make_regression
	features, target, coefficients = make_regression(n_samples = 100, n_features = 3, n_informative = 3, n_targets = 1, noise = 0.0, coef = True, random_state = 1)
Creating a Simulated Dataset for classification	from sklearn.datasets import make_classification
	features, target = make_classification(n_samples = 100, n_features = 3, n_informative = 3, n_redundant = 0, n_classes = 2, weights = [.25, .75], random_state = 1)
Creating a Simulated Dataset for clustering	from sklearn.datasets import make_blobs
	features, target = make_blobs(n_samples = 100, n_features = 2, centers = 3, cluster_std = 0.5, shuffle = True, random_state = 1)
Loading a CSV File	dataframe = pd.read_csv(data,sep=',')
Loading an Excel File	pd.read_excel(url, sheetname=0, header=1)
	If we need to load multiple sheets, include them as a list.
Loading a JSON File	pd.read_json(url, orient='columns')
	The key difference is the orient parameter, which indicates to pandas how the JSON file is structured. However, it might take some experimenting to figure out which argument (split, records, index, columns, and values) is the right one.
convert semistructured JSON data into a pandas DataFrame	json_normalize
Querying a SQL Database	from sqlalchemy import create_engine
	database_connection = create_engine('sqlite:///sample.db')
	pd.read_sql_query('SELECT * FROM data', database_connection)

In addition, make_classification contains a weights parameter that allows us to simulate datasets with imbalanced classes. For example, weights = [.25,.75]
For make_blobs, the centers parameter determines the number of clusters
generated.

Naive Bayes

Training a Classifier for Continuous Features	Use a Gaussian naive Bayes classifier
Load libraries	from sklearn.naive_bayes import GaussianNB
Create Gaussian Naive Bayes object	classifer = GaussianNB()
Train model	model = classifer.fit(features, target)
Create Gaussian Naive Bayes object with prior probabilities of each class	clf = GaussianNB(priors=[0.25, 0.25, 0.5])
Training a Classifier for Discrete and Count Features	Given discrete or count data
Load libraries	from sklearn.naive_bayes import MultinomialNB
	from sklearn.feature_extraction.text import CountVectorizer
Create bag of words	count = CountVectorizer()
	bag_of_words = count.fit_transform(text_data)
Create feature matrix	features = bag_of_words.toarray()
Create multinomial naive Bayes object with prior probabilities of each class	classifer = MultinomialNB(class_prior=[0.25, 0.5])
Training a Naive Bayes Classifier for Binary Features
Load libraries	from sklearn.naive_bayes import BernoulliNB
Create Bernoulli Naive Bayes object with prior probabilities of each class	classifer = BernoulliNB(class_prior=[0.25, 0.5])
Calibrating Predicted Probabilities	You want to calibrate the predicted probabilities from naive Bayes classifiers so they are interpretable.
Load libraries	from sklearn.calibration import CalibratedClassifierCV
Create calibrated cross-validation with sigmoid calibration	classifer_sigmoid = CalibratedClassifierCV(classifer, cv=2, method='sigmoid')
Calibrate probabilities	classifer_sigmoid.fit(features, target)
View calibrated probabilities	classifer_sigmoid.predict_proba(new_observation)

If class_prior is not specified, prior probabilities are learned using the data. However, if we want a uniform distribution to be used as the prior, we can set fit_prior=False.

Logistic Regression

Training a Binary Classifier	from sklearn.linear_model import LogisticRegression
	from sklearn.preprocessing import StandardScaler
Create logistic regression object	logistic_regression = LogisticRegression(random_state=0)
View predicted probabilities	model.predict_proba(new_observation)
Training a Multiclass Classifier
Create one-vs-rest logistic regression object	logistic_regression = LogisticRegression(random_state=0, multi_class="ovr")
Reducing Variance Through Regularization	Tune the regularization strength hyperparameter, C
Create decision tree classifier object	logistic_regression = LogisticRegressionCV( penalty='l2', Cs=10, random_state=0, n_jobs=-1)
Training a Classifier on Very Large Data
Create logistic regression object	logistic_regression = LogisticRegression(random_state=0, solver="sag")
Handling Imbalanced Classes
Create target vector indicating if class 0, otherwise 1	target = np.where((target == 0), 0, 1)
Create decision tree classifier object	logistic_regression = LogisticRegression(random_state=0, class_weight="balanced")

K-Nearest Neighbors

Finding an Observation’s Nearest Neighbors	from sklearn.neighbors import NearestNeighbors
Create standardizer	standardizer = StandardScaler()
Standardize features	features_standardized = standardizer.fit_transform(features)
Two nearest neighbors	nearest_neighbors = NearestNeighbors(n_neighbors=2).fit(features_standardized)
Create an observation	new_observation = [ 1, 1, 1, 1]
Find distances and indices of the observation's nearest neighbors	distances, indices = nearest_neighbors.kneighbors([new_observation])
View the nearest neighbors	eatures_standardized[indices]
Find two nearest neighbors based on euclidean distance	nearestneighbors_euclidean = NearestNeighbors( n_neighbors=2, metric='euclidean').fit(features_standardized)
create a matrix indicating each observation’s nearest neighbors
Find each observation's three nearest neighbors based on euclidean distance (including itself)	nearestneighbors_euclidean = NearestNeighbors( n_neighbors=3, metric="euclidean").fit(features_standardized)
List of lists indicating each observation's 3 nearest neighbors	nearest_neighbors_with_self = nearestneighbors_euclidean.kneighbors_graph( features_standardized).toarray()
Remove 1's marking an observation is a nearest neighbor to itself	for i, x in enumerate(nearest_neighbors_with_self):
	x[i] = 0
View first observation's two nearest neighbors	nearest_neighbors_with_self[0]
Creating a K-Nearest Neighbor Classifier
Train a KNN classifier with 5 neighbors	knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1).fit(X_std, y)
Identifying the Best Neighborhood Size
Load libraries	from sklearn.pipeline import Pipeline, FeatureUnion
	from sklearn.model_selection import GridSearchCV
Create a pipeline	pipe = Pipeline([("standardizer", standardizer), ("knn", knn)])
Create space of candidate values	search_space = [{"knn__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]
Create grid search	classifier = GridSearchCV( pipe, search_space, cv=5, verbose=0).fit(features_standardized, target)
Best neighborhood size (k)	classifier.best_estimator_.get_params()["knn__n_neighbors"]
Creating a Radius-Based Nearest Neighbor Classifier	from sklearn.neighbors import RadiusNeighborsClassifier
Train a radius neighbors classifier	rnn = RadiusNeighborsClassifier( radius=.5, n_jobs=-1).fit(features_standardized, target)

Model Selection

Selecting Best Models Using Exhaustive Search	from sklearn.model_selection import GridSearchCV
Create range of candidate penalty hyperparameter values	penalty = ['l1', 'l2']
Create range of candidate regularization hyperparameter values	C = np.logspace(0, 4, 10)
	numpy.logspace(start, stop, num=50, endpoint=True, base=10.0, dtype=None, axis=0)
Create dictionary hyperparameter candidates	hyperparameters = dict(C=C, penalty=penalty)
Create grid search	gridsearch = GridSearchCV(logistic, hyperparameters, cv=5, verbose=0)
Fit grid search	best_model = gridsearch.fit(features, target)
Predict target vector	best_model.predict(features)
Selecting Best Models Using Randomized Search
Load libraries	from sklearn.model_selection import RandomizedSearchCV
Create range of candidate regularization penalty hyperparameter values	penalty = ['l1', 'l2']
Create distribution of candidate regularization hyperparameter values	from scipy.stats import uniform
	C = uniform(loc=0, scale=4)
Create hyperparameter options	hyperparameters = dict(C=C, penalty=penalty)
Create randomized search	randomizedsearch = RandomizedSearchCV( logistic, hyperparameters, random_state=1, n_iter=100, cv=5, verbose=0, n_jobs=-1)
Fit randomized search	best_model = randomizedsearch.fit(features, target)
Predict target vector	best_model.predict(features)
Selecting Best Models from Multiple
Load libraries	from sklearn.model_selection import GridSearchCV
	from sklearn.pipeline import Pipeline
Create a pipeline	pipe = Pipeline([("classifier", RandomForestClassifier())])
Create dictionary with candidate learning algorithms and their hyperparameters	search_space = [{"classifier": [LogisticRegression()], "classifier__penalty": ['l1', 'l2'], "classifier__C": np.logspace(0, 4, 10)}, {"classifier": [RandomForestClassifier()], "classifier__n_estimators": [10, 100, 1000], "classifier__max_features": [1, 2, 3]}]
Create grid search	gridsearch = GridSearchCV(pipe, search_space, cv=5, verbose=0)
Fit grid search	best_model = gridsearch.fit(features, target)
View best model	best_model.best_estimator_.get_params()["classifier"]
Predict target vector	best_model.predict(features)
Selecting Best Models When Preprocessing
Load libraries	from sklearn.pipeline import Pipeline, FeatureUnion
Create a preprocessing object that includes StandardScaler features and PCA	preprocess = FeatureUnion([("std", StandardScaler()), ("pca", PCA())])
Create a pipeline	pipe = Pipeline([("preprocess", preprocess), ("classifier", LogisticRegression())])
Create space of candidate values	search_space = [{"preprocess__pca__n_components": [1, 2, 3], "classifier__penalty": ["l1", "l2"], "classifier__C": np.logspace(0, 4, 10)}]
Create grid search	clf = GridSearchCV(pipe, search_space, cv=5, verbose=0, n_jobs=-1)
Fit grid search	best_model = clf.fit(features, target)
Speeding Up Model Selection with Parallelization	Use all the cores in your machine by setting n_jobs=-1
	gridsearch = GridSearchCV(logistic, hyperparameters, cv=5, n_jobs=-1, verbose=1)
peeding Up Model Selection Using Algorithm-Specific Methods	If you are using a select number of learning algorithms, use scikit-learn’s modelspecific cross-validation hyperparameter tuning.
Create cross-validated logistic regression	logit = linear_model.LogisticRegressionCV(Cs=100)
Train model	logit.fit(features, target)
Evaluating Performance After Model Selection
Load libraries	from sklearn.model_selection import GridSearchCV, cross_val_score
Conduct nested cross-validation and outut the average score	cross_val_score(gridsearch, features, target).mean()

In scikit-learn, many learning algorithms (e.g., ridge,
lasso, and elastic net regression) have an algorithm-specific cross-validation
method to take advantage of this.

Handling Dates and Times

Create strings	date_strings = np.array(['03-04-2005 11:35 PM', '23-05-2010 12:01 AM', '04-09-2009 09:09 PM'])
Convert to datetimes	[pd.to_datetime(date, format='%d-%m-%Y %I:%M %p', errors="coerce") for date in date_strings]
Handling Time Zones
Create datetime	pd.Timestamp('2017-05-01 06:00:00', tz='Europe/London')
We can add a time zone to a previously created datetime	date_in_london = date.tz_localize('Europe/London')
convert to a different time zone	date_in_london.tz_convert('Africa/Abidjan')
tz_localize and tz_convert to every element	dates.dt.tz_localize('Africa/Abidjan')
importing all_timezones	from pytz import all_timezones
Create datetimes range	dataframe['date'] = pd.date_range('1/1/2001', periods=100000, freq='H')
Select observations between two datetimes	dataframe[(dataframe['date'] > '2002-1-1 01:00:00') & (dataframe['date'] <= '2002-1-1 04:00:00')]
Breaking Up Date Data into Multiple Features	dataframe['year'] = dataframe['date'].dt.year
	dataframe['month'] = dataframe['date'].dt.month
	dataframe['day'] = dataframe['date'].dt.day
	dataframe['hour'] = dataframe['date'].dt.hour
	dataframe['minute'] = dataframe['date'].dt.minute
Calculate duration between features	pd.Series(delta.days for delta in (dataframe['Left'] - dataframe['Arrived']))
Show days of the week	dates.dt.weekday_name
Show days of the week as numbers (Monday is 0)	dates.dt.weekday
Creating a Lagged Feature (Lagged values by one row)	dataframe["previous_days_stock_price"] = dataframe["stock_price"].shift(1)
Calculate rolling mean or moving average	dataframe.rolling(window=2).mean()
Handling Missing Data in Time Series
Interpolate missing values	dataframe.interpolate()
replace missing values with the last known value (i.e., forward-filling)	dataframe.ffill()
eplace missing values with the latest known value (i.e., backfilling)	dataframe.bfill()
If we believe the line between the two known points is nonlinear	dataframe.interpolate(method="quadratic")
Interpolate missing values	dataframe.interpolate(limit=1, limit_direction="forward")

Handling Numerical Data

Min Max scaler	from sklearn import preprocessing
Create scaler	minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1))
Scale feature	minmax_scale.fit_transform(feature)
Standardizing a Feature	from sklearn import preprocessing
Create scaler	scaler = preprocessing.StandardScaler()
Transform the feature	standardized = scaler.fit_transform(x)
Normalizing Observations (unit norm -> all values have values lower than one)	from sklearn.preprocessing import Normalizer
Create normalizer	normalizer = Normalizer(norm="l2")
Transform feature matrix	normalizer.transform(features)
	This type of rescaling is often used when we have many equivalent features (e.g., text classification)
Generating Polynomial and Interaction Features	from sklearn.preprocessing import PolynomialFeatures
Create PolynomialFeatures object	polynomial_interaction = PolynomialFeatures(degree=2, interaction_only=True,, include_bias=False)
Create polynomial features	polynomial_interaction.fit_transform(features)
Transforming Features	from sklearn.preprocessing import FunctionTransformer
	does the same as apply
Detecting Outliers	from sklearn.covariance import EllipticEnvelope
Create detector	outlier_detector = EllipticEnvelope(contamination=.1)
Fit detector	outlier_detector.fit(features)
Predict outliers	outlier_detector.predict(features)
IQR for outlier detection	def indicies_of_outliers(x):
	q1, q3 = np.percentile(x, [25, 75])
	iqr = q3 - q1
	lower_bound = q1 - (iqr * 1.5)
	upper_bound = q3 + (iqr * 1.5)
	return np.where((x > upper_bound) \| (x < lower_bound))
Handling Outliers	houses[houses['Bathrooms'] < 20]
Create feature based on boolean condition to detect outliers	houses["Outlier"] = np.where(houses["Bathrooms"] < 20, 0, 1)
Transform the feature to dampen the effect of the outlier	houses["Log_Of_Square_Feet"] = [np.log(x) for x in houses["Square_Feet"]]
Standardization if we have outliers	RobustScaler
Discretizating Features (binning)	from sklearn.preprocessing import Binarizer
Create binarizer	binarizer = Binarizer(18)
Transform feature	binarizer.fit_transform(age) array([[0], [0],
break up numerical features according to multiple thresholds	np.digitize(age, bins=[20,30,64], right=True (closes the right interval instead of the left))
Grouping Observations Using Clustering	from sklearn.cluster import KMeans
Make k-means clusterer	clusterer = KMeans(3, random_state=0)
Fit clusterer	clusterer.fit(features)
Predict values	dataframe["group"] = clusterer.predict(features)
Keep only observations that are not (denoted by ~) missing	features[~np.isnan(features).any(axis=1)]
drop missing observations using pandas	dataframe.dropna()
Predict the missing values in the feature matrix	features_knn_imputed = KNN(k=5, verbose=0).complete(standardized_features)
Imputer module to fill in missing values	from sklearn.preprocessing import Imputer
Create imputer	mean_imputer = Imputer(strategy="mean", axis=0)
Impute values	features_mean_imputed = mean_imputer.fit_transform(features)

One option is to use fit to calculate the minimum and maximum values of the feature, then
use transform to rescale the feature. The second option is to use fit_transform to do both operations at once. There is no mathematical difference between the two options, but there is sometimes a practical benefit to keeping the operations separate because it allows us to apply the same
transformation to different sets of the data.

Deep learning

Preprocessing Data for Neural Networks
Load libraries	from sklearn import preprocessing
Create scaler	scaler = preprocessing.StandardScaler()
Transform the feature	features_standardized = scaler.fit_transform(features) # Show feature features_standardized array([[-1.12541308, 1.96429418], [-1.15329466,
Designing a Neural Network
Load libraries	from keras import models
	from keras import layers
Start neural network	network = models.Sequential()
Add fully connected layer with a ReLU activation function	network.add(layers.Dense(units=16, activation="relu", input_shape=(10,)))
Add fully connected layer with a ReLU activation function	network.add(layers.Dense(units=16, activation="relu"))
Add fully connected layer with a sigmoid activation function	network.add(layers.Dense(units=1, activation="sigmoid"))
Compile neural network	network.compile(loss="binary_crossentropy", # Cross-entropy optimizer="rmsprop", # Root Mean Square Propagation metrics=["accuracy"]) # Accuracy performance metric
Training a Binary Classifier
Load libraries	from keras.datasets import imdb
	from keras.preprocessing.text import Tokenizer
	from keras import models
	from keras import layers
Set the number of features we want	number_of_features = 1000
Start neural network	network = models.Sequential()
Add fully connected layer with a ReLU activation function	network.add(layers.Dense(units=16, activation="relu", input_shape=( number_of_features,)))
Add fully connected layer with a ReLU activation function	network.add(layers.Dense(units=16, activation="relu"))
Add fully connected layer with a sigmoid activation function	network.add(layers.Dense(units=1, activation="sigmoid"))
Compile neural network	network.compile(loss="binary_crossentropy", # Cross-entropy optimizer="rmsprop", # Root Mean Square Propagation metrics=["accuracy"])
Train neural network	history = network.fit(features_train, # Features target_train, # Target vectorepochs=3, # Number of epochs verbose=1, # Print description after each epoch batch_size=100, # Number of observations per batch validation_data=(features_test, target_test)) # Test data

Model Evaluation

Cross-Validating Models	from sklearn.model_selection import KFold, cross_val_score
	from sklearn.pipeline import make_pipeline
Create a pipeline that standardizes, then runs logistic regression	pipeline = make_pipeline(standardizer, logit)
Create k-Fold cross-validation	kf = KFold(n_splits=10, shuffle=True, random_state=1)
Conduct k-fold cross-validation	cv_results = cross_val_score(pipeline, # Pipeline features, # Feature matrix target, # Target vector cv=kf, # Cross-validation technique scoring="accuracy", # Loss function n_jobs=-1) # Use all CPU scores
Calculate mean	cv_results.mean()
View score for all 10 folds	cv_results
Fit standardizer to training set	standardizer.fit(features_train)
Apply to both training and test sets	features_train_std = standardizer.transform(features_train)
	features_test_std = standardizer.transform(features_test)
Creating a Baseline Regression Model	from sklearn.dummy import DummyRegressor
Create a dummy regressor	dummy = DummyRegressor(strategy='mean')
"Train" dummy regressor	dummy.fit(features_train, target_train)
Get R-squared score	dummy.score(features_test, target_test)
Regression	from sklearn.linear_model import LinearRegression
Train simple linear regression model	ols = LinearRegression()
	ols.fit(features_train, target_train)
Get R-squared score	ols.score(features_test, target_test)
Create dummy regressor that predicts 20's for everything	clf = DummyRegressor(strategy='constant', constant=20)
	clf.fit(features_train, target_train)
Creating a Baseline Classification Model	from sklearn.dummy import DummyClassifier
Create dummy classifier	dummy = DummyClassifier(strategy='uniform', random_state=1)
"Train" model	dummy.fit(features_train, target_train)
Get accuracy score	dummy.score(features_test, target_test)
Evaluating Binary Classifier Predictions	from sklearn.model_selection import cross_val_score
	from sklearn.datasets import make_classification
Cross-validate model using accuracy	cross_val_score(logit, X, y, scoring="accuracy")
Cross-validate model using precision	cross_val_score(logit, X, y, scoring="precision")
Cross-validate model using recall	cross_val_score(logit, X, y, scoring="recall")
Cross-validate model using f1	cross_val_score(logit, X, y, scoring="f1")
alculate metrics like accuracy and recall directly	from sklearn.metrics import accuracy_score
Calculate accuracy	accuracy_score(y_test, y_hat)
Evaluating Binary Classifier Thresholds	from sklearn.metrics import roc_curve, roc_auc_score
Get predicted probabilities	target_probabilities = logit.predict_proba(features_test)[:,1]
Create true and false positive rates	false_positive_rate, true_positive_rate, threshold = roc_curve(target_test, target_probabilities)
Plot ROC curve	plt.title("Receiver Operating Characteristic")
	plt.plot(false_positive_rate, true_positive_rate)
	plt.plot([0, 1], ls="--")
	plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
	plt.ylabel("True Positive Rate")
	plt.xlabel("False Positive Rate")
	plt.show()
Evaluating Multiclass Classifier Predictions	cross_val_score(logit, features, target, scoring='f1_macro')
Visualizing a Classifier’s Performance
libraries	import matplotlib.pyplot as plt
	import seaborn as sns
	from sklearn.metrics import confusion_matrix
Create confusion matrix	matrix = confusion_matrix(target_test, target_predicted)
Create pandas dataframe	dataframe = pd.DataFrame(matrix, index=class_names, columns=class_names)
Create heatmap	sns.heatmap(dataframe, annot=True, cbar=None, cmap="Blues")
	plt.title("Confusion Matrix"), plt.tight_layout()
	plt.ylabel("True Class"), plt.xlabel("Predicted Class")
	plt.show()
Evaluating Regression Models
Cross-validate the linear regression using (negative) MSE cross_val_score(ols, features, target, scoring='neg_mean_squared_	cross_val_score(ols, features, target, scoring='neg_mean_squared_error')
Cross-validate the linear regression using R-squared	cross_val_score(ols, features, target, scoring='r2')
Evaluating Clustering Models	from sklearn.metrics import silhouette_score
	from sklearn.cluster import KMeans
Cluster data using k-means to predict classes	model = KMeans(n_clusters=2, random_state=1).fit(features)
Get predicted classes	target_predicted = model.labels_
Evaluate model	silhouette_score(features, target_predicted)
Creating a Custom Evaluation Metric	from sklearn.metrics import make_scorer, r2_score
	from sklearn.linear_model import Ridge
Create custom metric	def custom_metric(target_test, target_predicted):
	r2 = r2_score(target_test, target_predicted)
	return r2
Make scorer and define that higher scores are better	score = make_scorer(custom_metric, greater_is_better=True)
Create ridge regression object	classifier = Ridge()
Apply custom scorer	score(model, features_test, target_test)
Visualizing the Effect of Training Set Size	from sklearn.model_selection import learning_curve
Draw lines	plt.plot(train_sizes, train_mean, '--', color="#111111", label="Training score")
	plt.plot(train_sizes, test_mean, color="#111111", label="Cross-validation score")
Draw bands	plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, color="#DDDDDD")
	plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, color="#DDDDDD")
Create plot	plt.title("Learning Curve")
	plt.xlabel("Training Set Size"), plt.ylabel("Accuracy Score"),
	plt.legend(loc="best")
	plt.tight_layout()
	plt.show()
Creating a Text Report of Evaluation Metrics	from sklearn.metrics import classification_report
Create a classification report	print(classification_report(target_test, target_predicted, target_names=class_names))
Visualizing the Effect of Hyperparameter Values
Plot the validation curve	from sklearn.model_selection import validation_curve
Create range of values for parameter	param_range = np.arange(1, 250, 2)
Hyperparameter to examine	param_name="n_estimators",
Range of hyperparameter's values	param_range = np.arange(1, 250, 2)
Calculate accuracy on training and test set using range of parameter values	train_scores, test_scores = validation_curve( # Classifier RandomForestClassifier(), # Feature matrix features, # Target vector target, # Hyperparameter to examine param_name="n_estimators", # Range of hyperparameter's values param_range=param_range, # Number of folds cv=3, # Performance metric scoring="accuracy", # Use all computer cores n_jobs=-1)
Plot mean accuracy scores for training and test sets	plt.plot(param_range, train_mean, label="Training score", color="black")
	plt.plot(param_range, test_mean, label="Cross-validation score", color="dimgrey")
Plot accurancy bands for training and test sets	plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color="gray")
	plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color="gainsboro")
Create plot	plt.title("Validation Curve With Random Forest")
	plt.xlabel("Number Of Trees")
	plt.ylabel("Accuracy Score")
	plt.tight_layout()
	plt.legend(loc="best")
	plt.show()

Dimensionality Reduction Using Feature Selection

Thresholding Numerical Feature Variance	from sklearn.feature_selection import VarianceThreshold
Create thresholder	thresholder = VarianceThreshold(threshold=.5)
Create high variance feature matrix	features_high_variance = thresholder.fit_transform(features)
View variances	thresholder.fit(features).variances_

features with low variance are likely less interesting (and useful) than features with high variance.
the VT will not work when feature sets contain different units
If the features have been standardized (to mean zero and unit variance), then for obvious reasons variance thresholding will not work correctly

Handling Text

Strip whitespaces	strip_whitespace = [string.strip() for string in text_data]
Remove periods	remove_periods = [string.replace(".", "") for string in strip_whitespace]
Parsing and Cleaning HTML	from bs4 import BeautifulSoup
Parse html	soup = BeautifulSoup(html, "lxml")
Find the div with the class "full_name", show text	soup.find("div", { "class" : "full_name" }).text
Removing Punctuation	import unicodedata
	import sys
Create a dictionary of punctuation characters	punctuation = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
For each string, remove any punctuation characters	[string.translate(punctuation) for string in text_data]
Tokenizing Text (You have text and want to break it up into individual words)	from nltk.tokenize import word_tokenize
Tokenize words (string can't have full stops)	word_tokenize(string)
Tokenize sentences (string has to have full stops)	ent_tokenize(string)
Removing Stop Words	from nltk.corpus import stopwords
Load stop words	stop_words = stopwords.words('english')
Remove stop words	[word for word in tokenized_words if word not in stop_words]
Stemming Words	from nltk.stem.porter import PorterStemmer
Create stemmer	porter = PorterStemmer()
Apply stemmer	[porter.stem(word) for word in tokenized_words]
Tagging Parts of Speech	from nltk import pos_tag
Filter words	[word for word, tag in text_tagged if tag in ['NN','NNS','NNP','NNPS'] ]
Tag each word and each tweet	for tweet in tweets:
	tweet_tag = nltk.pos_tag(word_tokenize(tweet))
	tagged_tweets.append([tag for word, tag in tweet_tag])
Use one-hot encoding to convert the tags into features	one_hot_multi = MultiLabelBinarizer()
	one_hot_multi.fit_transform(tagged_tweets)
To examine the accuracy of our tagger, we split our text data into two parts	from nltk.corpus import brown
takes into account the previous two words	from nltk.tag import UnigramTagger
takes into account the previous word	from nltk.tag import BigramTagger
looks at the word itself	from nltk.tag import TrigramTagger
Get some text from the Brown Corpus, broken into sentences	sentences = brown.tagged_sents(categories='news')
Split into 4000 sentences for training and 623 for testing	train = sentences[:4000]
	test = sentences[4000:]
Create backoff tagger	unigram = UnigramTagger(train)
	bigram = BigramTagger(train, backoff=unigram)
	trigram = TrigramTagger(train, backoff=bigram)
Show accuracy	trigram.evaluate(test)
Encoding Text as a Bag of Words	from sklearn.feature_extraction.text import CountVectorizer
Create the bag of words feature matrix	count = CountVectorizer()
Sparse matrix of bag of words	bag_of_words = count.fit_transform(text_data)
Trun sparse matrix into array	bag_of_words.toarray()
Show feature (column) names	count.get_feature_names()
Create feature matrix with arguments	CountVectorizer(ngram_range=(1,2), stop_words="english", vocabulary=['brazil'])
	bag = count_2gram.fit_transform(text_data)
View the 1-grams and 2-grams	ount_2gram.vocabulary_
Weighting Word Importance	from sklearn.feature_extraction.text import TfidfVectorizer
Create the tf-idf (term frequency-document frequency) feature matrix	tfidf = TfidfVectorizer()
	feature_matrix = tfidf.fit_transform(text_data)
Show feature names	tfidf.vocabulary_

You will have to download the set of stop words the first time
import nltk
nltk.download('stopwords')

Note that NLTK’s stopwords assumes the tokenized words are all lowercased

Support Vector Machines

Training a Linear Classifier
Load libraries	from sklearn.svm import LinearSVC
Standardize features	scaler = StandardScaler()
	features_standardized = scaler.fit_transform(features)
Create support vector classifier	svc = LinearSVC(C=1.0)
Train model	model = svc.fit(features_standardized, target)
Plot data points and color using their class	color = ["black" if c == 0 else "lightgrey" for c in target]
	plt.scatter(features_standardized[:,0], features_standardized[:,1], c=color)
Create the hyperplane	w = svc.coef_[0]
	a = -w[0] / w[1]
Return evenly spaced numbers over a specified interval.	xx = np.linspace(-2.5, 2.5)
	yy = a * xx - (svc.intercept_[0]) / w[1]
Plot the hyperplane	plt.plot(xx, yy)
	plt.axis("off"), plt.show()
Handling Linearly Inseparable Classes Using Kernels
Create a support vector machine with a radial basis function kernel	svc = SVC(kernel="rbf", random_state=0, gamma=1, C=1)
Creating Predicted Probabilities
View predicted probabilities	model.predict_proba(new_observation)
Identifying Support Vectors
View support vectors	model.support_vectors_
Handling Imbalanced Classes	Increase the penalty for misclassifying the smaller class using class_weight
Create support vector classifier	svc = SVC(kernel="linear", class_weight="balanced", C=1.0, random_state=0)

visualization in page 321
In scikit-learn, the predicted probabilities must be generated when the model is being trained. We can do this by setting SVC’s probability to True. Then use the same method

Data Wrangling

Creating a series	pd.Series(['Molly Mooney', 40, True], index=['Name','Age','Driver'])
Appending to a data frame	dataframe.append(new_person, ignore_index=True)
First lines of the data	dataframe.head(2)
descriptive statistics	dataframe.describe()
Return row by index	dataframe.iloc[0]
Return row by name	dataframe.loc['Allen, Miss Elisabeth Walton']
Set index	dataframe = dataframe.set_index(dataframe['Name'])
Selecting Rows Based on Conditionals	dataframe[dataframe['Sex'] == 'female']
Replacing Values	dataframe['Sex'].replace("anterior", "posterior")
Replacing multiple values	dataframe['Sex'].replace(["female", "male"], ["Woman", "Man"])
Renaming Columns	dataframe.rename(columns={'PClass': 'Passenger Class'})
Minimum, max, sum, count	dataframe['Age'].min()
Finding Unique Values	dataframe['Sex'].unique()
display all unique values with the number of times each value appears	dataframe['Sex'].value_counts()
number of unique values	dataframe['PClass'].nunique()
return booleans indicating whether a value is missing	dataframe[dataframe['Age'].isnull()]
Replace missing values	dataframe['Sex'] = dataframe['Sex'].replace('male', np.nan)
Load data, set missing values	dataframe = pd.read_csv(url, na_values=[np.nan, 'NONE', -999])
Filling missing values	dataframe.fillna(value)
Deleting a Column	dataframe.drop(['Age', 'Sex'], axis=1).head(2)
Deleting a Row	dataframe[dataframe['Sex'] != 'male']
	or use drop
Dropping Duplicate Rows	dataframe.drop_duplicates()
Dropping Duplicate Rows, taking to account only a subset of rows	dataframe.drop_duplicates(subset=['Sex']keep='last' (optional argument to keep last observation instead of first))
Grouping Rows by Values	dataframe.groupby('Sex').mean()
	dataframe.groupby(['Sex','Survived'])['Age'].mean()
creating a date range	pd.date_range('06/06/2017', periods=100000, freq='30S')
Group rows by week	dataframe.resample('W').sum()
Group by two weeks	dataframe.resample('2W').mean()
Group by month	dataframe.resample('M',label='left' (the label returned is the first observation in the group)).count()
Looping Over a Column	for name in dataframe['Name'][0:2]:
Applying a Function Over All Elements in a Column	dataframe['Name'.apply(uppercase)]
Applying a Function to Groups	dataframe.groupby('Sex').apply(lambda x: x.count())
Concatenating DataFrames by rows	pd.concat([dataframe_a, dataframe_b], axis=0)
Concatenating DataFrames by columns	pd.concat([dataframe_a, dataframe_b], axis=1)
Merging DataFrames	pd.merge(dataframe_employees, dataframe_sales, on='employee_id, 'how='outer')
	left or right or inner
if the tables have columns with different names	pd.merge(dataframe_employees, dataframe_sales, left_on='employee_id', right_on='employee_id')

replace can accepts regular expressions
To have full functionality with NaN we need to import the NumPy library first
groupby needs to be paired with some operation we want to apply to each group, such as calculating an aggregate statistic

Saving and Loading Trained Models

Saving and Loading a scikit-learn Model
Load libraries	from sklearn.externals import joblib
Save model as pickle file	joblib.dump(model, "model.pkl")
Load model from file	classifer = joblib.load("model.pkl")
Get scikit-learn version	scikit_version = joblib.__version__
Save model as pickle file	joblib.dump(model, "model_{version}.pkl".format(version=scikit_version))
Saving and Loading a Keras Model
Load libraries	from keras.models import load_model
Save neural network	network.save("model.h5")
Load neural network	network = load_model("model.h5")

When saving scikit-learn models, be aware that saved models might not be compatible between versions of scikit-learn; therefore, it can be helpful to include the version of scikit-learn used in the model in the filename

Machine Learning with Python Cookbook Cheat Sheet (DRAFT) by Remidy08

Introduction

Clustering

Handling Categorical Data

Dimensionality Reduction Using Feature Extraction

Trees and Forests

Trees and Forests

Linear Regression

Loading Data

Naive Bayes

Logistic Regression

K-Nearest Neighbors

Model Selection

Handling Dates and Times

Handling Numerical Data

Deep learning

Model Evaluation

Dimensionality Reduction Using Feature Selection

Handling Text

Support Vector Machines

Data Wrangling

Saving and Loading Trained Models

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

Machine Learning with Python Cookbook Cheat Sheet (DRAFT) by Remidy08

Introd­uction

Clustering

Handling Catego­rical Data

Dimens­ion­ality Reduction Using Feature Extraction

Trees and Forests

Trees and Forests

Linear Regression

Loading Data

Naive Bayes

Logistic Regression

K-Nearest Neighbors

Model Selection

Handling Dates and Times

Handling Numerical Data

Deep learning

Model Evaluation

Dimens­ion­ality Reduction Using Feature Selection

Handling Text

Support Vector Machines

Data Wrangling

Saving and Loading Trained Models

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

Introduction

Handling Categorical Data

Dimensionality Reduction Using Feature Extraction

Dimensionality Reduction Using Feature Selection