Cheatography

# Data Science 101 Cheat Sheet (DRAFT) by bhaskar

It is a Cheat Sheet of data Science Topics.

This is a draft cheat sheet. It is a work in progress and is not finished yet.

### Linear Regression Cheat Sheet

 Linear Regression Overview Linear regression is a statis­tical technique used to model the relati­onship between a dependent variable and one or more indepe­ndent variables. It assumes a linear relati­onship between the indepe­ndent variables and the dependent variable. Simple Linear Regression Simple linear regression involves a single indepe­ndent variable (x) and a dependent variable (y) related by the equation: y = mx + c, where m is the slope and c is the intercept. Multiple Linear Regression Multiple linear regression involves more than one indepe­ndent variable (x1, x2, x3, etc.) and a dependent variable (y) related by the equation: y = b0 + b1x1 + b2x2 + ... + bnxn, where b0 is the intercept, and b1, b2, ..., bn are the coeffi­cients. Assump­tions of Linear Regression Linearity: There should be a linear relati­onship between the indepe­ndent and dependent variables. Independence: The observ­ations should be indepe­ndent of each other. Homoscedasticity: The variance of the residuals should be constant across all levels of the indepe­ndent variables. Normality: The residuals should be normally distri­buted. No multic­oll­ine­arity: The indepe­ndent variables should not be highly correlated with each other. Fitting the Model The goal is to find the best-f­itting line that minimizes the sum of squared residuals (diffe­rences between predicted and actual values). This is typically achieved using the method of least squares. Interp­reting Coeffi­cients The intercept (b0) represents the expected value of the dependent variable when all indepe­ndent variables are zero. The coeffi­cients (b1, b2, ..., bn) represent the change in the dependent variable associated with a one-unit change in the corres­ponding indepe­ndent variable, holding other variables constant. Evaluating Model Perfor­mance R-squared (R²): Indicates the proportion of variance in the dependent variable explained by the indepe­ndent variables. Higher values indicate a better fit. Adjusted R-squared: Similar to R-squared, but adjusts for the number of predictors in the model. Root Mean Squared Error (RMSE): Represents the average prediction error of the model. Lower values indicate better perfor­mance. Residual Analysis: Plotting residuals to check for patterns or outliers that violate assump­tions. Handling Nonlin­earity Polynomial Regres­sion: Transf­orming the indepe­ndent variables by adding polynomial terms (e.g., x2, x3) to capture nonlinear relati­ons­hips. Logarithmic Transf­orm­ation: Taking the logarithm of the dependent or indepe­ndent variables to handle expone­ntial growth or decay. Dealing with Multic­oll­ine­arity Check Correl­ation: Identify highly correlated indepe­ndent variables using correl­ation matrices or variance inflation factor (VIF) analysis. Remove or Combine Variables: Remove one of the highly correlated variables or combine them into a single variable. Regula­riz­ation Techniques Ridge Regres­sion: Adds a penalty term to the sum of squared residuals to shrink the coeffi­cients, reducing the impact of multic­oll­ine­arity. Lasso Regres­sion: Similar to Ridge regres­sion, but with a penalty that can shrink coeffi­cients to zero, effect­ively performing feature selection.

### Logistic Regression Cheat Sheet

 Logistic Regression Overview Logistic regression is a statis­tical technique used to model the relati­onship between a dependent variable and one or more indepe­ndent variables. It is primarily used for binary classi­fic­ation problems, where the dependent variable takes on two categories Binary Logistic Regression Binary logistic regression involves a binary dependent variable (y) and one or more indepe­ndent variables (x1, x2, x3, etc.). The logistic regression equation models the probab­ility of the dependent variable belonging to a specific category. Logistic Regression Equation The logistic regression equation is repres­ented as: p = 1 / (1 + e^(-z)), where p is the probab­ility of the event occurring, and z is the linear combin­ation of the indepe­ndent variables and their coeffi­cients. Link Function Logistic regression uses the logistic or sigmoid function as the link function to map the linear combin­ation of indepe­ndent variables to a probab­ility value between 0 and 1. Estimating Coeffi­cients Coeffi­cients are estimated using maximum likelihood estima­tion, which finds the values that maximize the likelihood of the observed data given the model. The coeffi­cients represent the log-odds ratio, indicating the change in the log-odds of the event occurring for a one-unit change in the indepe­ndent variable. Interp­reting Coeffi­cients The coeffi­cients can be expone­ntiated to obtain odds ratios, repres­enting the change in odds of the event occurring for a one-unit change in the indepe­ndent variable. Odds ratios greater than 1 indicate a positive associ­ation, while those less than 1 indicate a negative associ­ation. Evaluating Model Perfor­mance Accuracy: The proportion of correctly classified instances. Confusion Matrix: A table showing the true positives, true negatives, false positives, and false negatives. Precision: The proportion of true positives out of all positive predic­tions (TP / (TP + FP)). Recall (Sensi­tiv­ity): The proportion of true positives out of all actual positives (TP / (TP + FN)). Specificity: The proportion of true negatives out of all actual negatives (TN / (TN + FP)). F1 Score: A measure that combines precision and recall to balance their import­ance. Regula­riz­ation Techniques Ridge Regression (L2 regula­riz­ation): Adds a penalty term to the loss function to shrink the coeffi­cients, reducing overfi­tting. Lasso Regression (L1 regula­riz­ation): Similar to Ridge regression but can shrink coeffi­cients to zero, effect­ively performing feature selection. Multiclass Logistic Regression Multiclass logistic regression extends binary logistic regression to handle more than two catego­ries. One-vs-Rest (OvR) or One-vs-All (OvA) is a common approach where separate binary logistic regression models are trained for each class against the rest. Dealing with Imbalanced Data Adjust Class Weights: Assign higher weights to the minority class to address the class imbalance during model training. Resampling Techni­ques: Oversa­mpling the minority class or unders­ampling the majority class to create a balanced dataset.

### k-Nearest Neighbors Cheat Sheet

 k-Nearest Neighbors Overview k-Nearest Neighbors is a non-pa­ram­etric and instan­ce-­based machine learning algorithm used for classi­fic­ation and regression tasks. It predicts the class or value of a new data point based on the majority vote or average of its k nearest neighbors in the feature space. Choosing k The value of k represents the number of nearest neighbors to consider when making predic­tions. A small value of k (e.g., 1) may lead to overfi­tting, while a large value of k may lead to oversi­mpl­ifi­cation and loss of local patterns. The optimal value of k is typically determined through hyperp­ara­meter tuning using techniques like cross-­val­idation Distance Metrics Euclidean Distance: Calculates the straig­ht-line distance between two points in the feature space. Manhattan Distance: Calculates the sum of absolute differ­ences between the coordi­nates of two points. Other distance metrics like Minkowski, Cosine, and Hamming distance can also be used depending on the data type and problem domain. Feature Scaling It's crucial to scale the features before applying k-NN, as it is sensitive to the scale of the features. Standardization (mean = 0, standard deviation = 1) or normal­ization (scaling to a range) techniques like min-max scaling are commonly used. Handling Catego­rical Features Catego­rical features must be encoded into numerical values before applying k-NN. One-Hot Encoding: Creates binary dummy variables for each category, repres­enting their presence or absence. Label Encoding: Assigns a unique numerical label to each category. Classi­fying New Instances For classi­fic­ation tasks, the class of a new instance is determined by the majority class among its k nearest neighbors. Voting Mechan­isms: Simple majority vote, weighted vote (based on distance or confid­ence), or distan­ce-­wei­ghted vote (inverse of distance) can be used. Regression with k-NN For regression tasks, the predicted value of a new instance is typically the average (mean or median) of the target values of its k nearest neighbors. Model Evaluation Accuracy: Proportion of correctly classified instances for classi­fic­ation tasks. Mean Squared Error (MSE): Average of the squared differ­ences between the predicted and actual values for regression tasks. Cross-Validation: Technique to assess the perfor­mance of the k-NN model by splitting the data into multiple folds. Curse of Dimens­ion­ality As the number of features increases, the feature space becomes increa­singly sparse, making k-NN less effective. Feature selection or dimens­ion­ality reduction techniques (e.g., Principal Component Analysis) can help mitigate this issue. Advantages and Limita­tions Advant­ages: Simpli­city, no assump­tions about data distri­bution, and ability to capture complex patterns. Limitations: Comput­ati­onally expensive for large datasets, sensit­ivity to feature scaling, and inability to handle missing values well.

### Support Vector Machines Cheet Sheat

 Support Vector Machines Overview Support Vector Machines is a supervised machine learning algorithm used for classi­fic­ation and regression tasks. It finds an optimal hyperplane that maximally separates or fits the data points in the feature space. Linear SVM Linear SVM constructs a linear decision boundary to separate data points of different classes. It aims to maximize the margin, which is the perpen­dicular distance between the decision boundary and the nearest data points (support vectors). Kernel Trick The kernel trick allows SVMs to effici­ently handle non-li­nearly separable data by mapping the data to a higher­-di­men­sional feature space. Common kernel functions include Linear, Polyno­mial, Radial Basis Function (RBF), and Sigmoid. Soft Margin SVM Soft Margin SVM allows for some miscla­ssi­fic­ation in order to achieve a more flexible decision boundary. It introduces a regula­riz­ation parameter (C) to control the trade-off between maximizing the margin and minimizing miscla­ssi­fic­ation. Choosing the Right Kernel Linear Kernel: Suitable for linearly separable data or when the number of features is large compared to the number of samples. Polynomial Kernel: Suitable for problems with interm­ediate complexity and higher­-order polynomial relati­ons­hips. RBF Kernel: Suitable for complex and non-linear relati­ons­hips; the most commonly used kernel. Sigmoid Kernel: Suitable for problems influenced by logistic regression or neural networks. Model Training and Optimi­zation SVM training involves solving a quadratic progra­mming problem to find the optimal hyperp­lane. The optimi­zation process can be comput­ati­onally expensive for large datasets, but various optimi­zation techniques (e.g., Sequential Minimal Optimi­zation) can improve effici­ency. Tuning Parameters C (Regul­ari­zation Parame­ter): Controls the trade-off between miscla­ssi­fic­ation and the width of the margin. A smaller C allows more miscla­ssi­fic­ation, while a larger C enforces stricter classi­fic­ation. Gamma (Kernel Coeffi­cient): Influences the shape of the decision boundary. A higher gamma value leads to a more complex decision boundary. Multi-­Class Classi­fic­ation One-vs­-Rest (OvR) or One-vs-One (OvO) strategies can be used to extend SVM to multi-­class classi­fic­ation problems. OvR: Trains separate binary classi­fiers for each class against the rest. OvO: Trains a binary classifier for every pair of classes Handling Imbalanced Data Class imbalance can affect SVM perfor­mance. Techniques such as resampling (under­sam­pling or oversa­mpling) and adjusting class weights can help address this issue. Advantages and Limita­tions Advant­ages: Effective in high-d­ime­nsional spaces, robust against overfi­tting, and suitable for both linear and non-linear classi­fic­ation. Limitations: Comput­ati­onally intensive for large datasets, sensitive to hyperp­ara­meter tuning, and challe­nging to interpret complex models.

### Decision Tree Cheat Sheet

 Decision Tree Overview Decision Trees are a supervised machine learning algorithm used for classi­fic­ation and regression tasks. They learn a hierar­chical structure of decisions and conditions from the data to make predic­tions. Tree Constr­uction Decision Trees are constr­ucted through a top-down, recursive partit­ioning process called recursive binary splitting. The algorithm selects the best feature at each node to split the data based on certain criteria (e.g., inform­ation gain, Gini impurity). Splitting Criteria Inform­ation Gain: Measures the reduction in entropy (or increase in inform­ation) achieved by splitting on a particular feature. Gini Impurity: Measures the probab­ility of miscla­ssi­fying a randomly chosen element if it were labeled randomly according to the class distri­bution. Handling Continuous and Catego­rical Features For continuous features, decision tree algorithms use threshold values to split the data. For catego­rical features, each category forms a separate branch in the decision tree. Tree Pruning Pruning is a technique used to avoid overfi­tting by reducing the complexity of the decision tree. Pre-pruning: Setting constr­aints on tree depth, minimum samples per leaf, or maximum number of leaf nodes during tree constr­uction. Post-pruning: Removing or collapsing branches that provide little inform­ation gain or result in minimal improv­ements in perfor­mance. Handling Missing Values Decision Trees can handle missing values by treating them as a separate category or by imputing missing values before tree constr­uction. Handling Imbalanced Data Imbalanced class distri­butions can bias the decision tree. Techniques like class weighting, unders­amp­ling, or oversa­mpling can help address this issue. Feature Importance Decision Trees provide feature importance scores based on how much each feature contri­butes to the overall split decisions. Importance can be measured by the total reduction in impurity or the total inform­ation gain associated with a feature Ensemble Methods Random Forest: An ensemble of decision trees where each tree is trained on a random subset of the data with replac­ement. It reduces overfi­tting and improves perfor­mance. Gradient Boosting: Builds an ensemble by sequen­tially adding decision trees, with each tree correcting the mistakes made by the previous trees. Advantages and Limita­tions Advant­ages: Easy to understand and interpret, handles both numerical and catego­rical data, and can capture non-linear relati­ons­hips. Limitations: Prone to overfi­tting, sensitive to small changes in data, and may not generalize well to unseen data if the tree structure is too complex.

### Random Forest Cheat Sheet

 Random Forest Overview Random Forest is an ensemble learning algorithm that combines multiple decision trees to make predic­tions. It is used for both classi­fic­ation and regression tasks and improves upon the individual decision trees' perfor­mance and robust­ness. Ensemble of Decision Trees Random Forest creates an ensemble by constr­ucting a set of decision trees on random subsets of the training data (bootstrap sampling). Each decision tree is trained indepe­nde­ntly, making predic­tions based on majority voting (class­ifi­cation) or averaging (regre­ssion) of the individual tree predic­tions. Random Feature Subsets In addition to using random subsets of the training data, Random Forest also considers a random subset of features at each node for constr­ucting the decision trees. This randomness reduces the correl­ation between trees and promotes diversity, leading to improved genera­liz­ation. Building Decision Trees Each decision tree in the Random Forest is constr­ucted using a subset of the training data and a subset of the available features. Tree constr­uction follows the usual process of recursive binary splitting based on criteria like inform­ation gain or Gini impurity. Feature Importance Random Forest provides a measure of feature importance based on how much each feature contri­butes to the ensemble's predictive perfor­mance. Importance can be calculated by evaluating the average decrease in impurity or the average decrease in a split criterion (e.g., Gini index) caused by a feature. Out-of-Bag (OOB) Error Random Forest uses the out-of-bag samples (not included in the bootstrap sample) to estimate the model's perfor­mance without the need for cross-­val­ida­tion. OOB error provides a good estimate of the model's genera­liz­ation perfor­mance and can be used for model evaluation and hyperp­ara­meter tuning. Hyperp­ara­meter Tuning Important hyperp­ara­meters to consider when working with Random Forests include the number of trees (n_est­ima­tors), maximum depth of each tree (max_d­epth), minimum samples required to split a node (min_s­amp­les­_sp­lit), and maximum number of features to consider for each split (max_f­eat­ures). Handling Imbalanced Data Random Forests can handle imbalanced data by adjusting class weights during tree constr­uction or by using sampling techniques like oversa­mpling the minority class or unders­ampling the majority class. Advantages and Limita­tions Advant­ages: Robust to overfi­tting, can handle high-d­ime­nsional data, provides feature import­ance, and performs well on various types of problems. Limitations: Requires more comput­ational resources than individual decision trees, can be slower to train and predict, and may not perform well on extremely imbalanced datasets. Applic­ations Random Forests are commonly used in various domains, including classi­fic­ation tasks such as image recogn­ition, text classi­fic­ation, fraud detection, and regression tasks like predicting housing prices or stock market trends.

### Naive Bayes Cheat Sheet

 Naive Bayes Overview Naive Bayes is a probab­ilistic machine learning algorithm based on Bayes' theorem with the assumption of indepe­ndence between features. It is primarily used for classi­fic­ation tasks and is efficient, simple, and often works well in practice Bayes' Theorem Bayes' theorem calculates the posterior probab­ility of a class given the observed evidence. P(Class|Features) = (P(Fea­tur­es|­Class) * P(Class)) / P(Feat­ures) Assumption of Feature Indepe­ndence Naive Bayes assumes that the features are condit­ionally indepe­ndent given the class label, which is a simpli­fying assumption to make the calcul­ations more tractable. Despite this assumption rarely being true in reality, Naive Bayes can still perform well in practice Types of Naive Bayes Classi­fiers Gaussian Naive Bayes: Assumes a Gaussian distri­bution for continuous features and estimates the mean and variance for each class. Multinomial Naive Bayes: Suitable for discrete features, typically used for text classi­fic­ation tasks, where features represent word freque­ncies. Bernoulli Naive Bayes: Similar to multin­omial, but assumes binary features (presence or absence). Feature Probab­ility Estimation For continuous features, Gaussian Naive Bayes estimates the mean and variance for each class. For discrete features, Multin­omial Naive Bayes estimates the probab­ility of each feature occurring in each class. For binary features, Bernoulli Naive Bayes estimates the probab­ility of each feature being present in each class. Handling Zero Probab­ilities The Naive Bayes classifier may encounter zero probab­ilities if a particular feature does not occur in the training set for a specific class. To handle this, techniques like Laplace smoothing or add-one smoothing can be applied to avoid zero probab­ili­ties. Handling Continuous Features Gaussian Naive Bayes assumes a Gaussian distri­bution for continuous features. Continuous features can be discre­tized into bins or transf­ormed into catego­rical variables before using Naive Bayes. Text Classi­fic­ation with Naive Bayes Naive Bayes is commonly used for text classi­fic­ation tasks, such as spam detection or sentiment analysis. Text data is typically prepro­cessed by tokeni­zation, removing stop words, and applying techniques like TF-IDF or Bag-of­-Words repres­ent­ation before using Naive Bayes. Advantages and Limita­tions Advant­ages: Simpli­city, effici­ency, and can handle high-d­ime­nsional data well. Limitations: Strong indepe­ndence assumption may not hold in reality, and it can be sensitive to irrelevant features. It may struggle with rare or unseen combin­ations of features. Handling Imbalanced Data Naive Bayes can face challenges with imbalanced datasets where the class distri­bution is skewed. Techniques like class weighting or resampling (under­sam­pling or oversa­mpling) can help alleviate the impact of imbalanced data.

### Principal Component Analysis Cheat Sheet

 PCA Overview PCA is a dimens­ion­ality reduction technique used to transform a high-d­ime­nsional dataset into a lower-­dim­ens­ional space. It identifies the principal compon­ents, which are orthogonal directions that capture the maximum variance in the data. Variance and Covariance PCA is based on the varian­ce-­cov­ariance matrix or the correl­ation matrix of the dataset. Variance measures the spread of data along a specific axis, while covariance measures the relati­onship between two variables. Steps in PCA Standa­rdize the data: PCA works best with standa­rdized data to ensure equal importance across different variables. Calculate the covariance matrix or correl­ation matrix: This represents the relati­onships between the variables in the dataset. Compute the eigenv­ectors and eigenv­alues: These eigenv­ectors represent the principal compon­ents, and the corres­ponding eigenv­alues indicate the amount of variance explained by each component. Select the desired number of principal compon­ents: Choose the top components that explain the majority of the variance in the data. Transform the data: Project the original data onto the selected principal components to obtain the lower-­dim­ens­ional repres­ent­ation. Explained Variance and Scree Plot Explained variance ratio indicates the proportion of variance explained by each principal component. A scree plot visualizes the explained variance ratio for each component, helping to determine the number of components to retain. Dimens­ion­ality Reduction and Recons­tru­ction PCA reduces the dimens­ion­ality of the dataset by selecting a subset of principal compon­ents. Reconstruction of the original data is possible by projecting the lower-­dim­ens­ional repres­ent­ation back into the original feature space. Applic­ations of PCA Dimens­ion­ality reduction: PCA can help visualize high-d­ime­nsional data, reduce noise, and eliminate redundant or correlated features. Data compre­ssion: PCA can compress the data by retaining only the most important compon­ents. Feature extrac­tion: PCA can extract meaningful features from complex data, facili­tating subsequent analysis. Interp­ret­ation of Principal Components Principal components are linear combin­ations of the original features. The direction of a principal component represents the most signif­icant variation in the data. The magnitude of the compon­ent's loading on a particular feature indicates its contri­bution to that component. Assump­tions and Limita­tions PCA assumes linear relati­onships between variables and requires variables to be continuous or approx­imately contin­uous. It may not be suitable for datasets with nonlinear relati­onships or when interp­ret­ability of individual features is essential. Extensions to PCA Kernel PCA: An extension that allows nonlinear transf­orm­ations of the data. Sparse PCA: A variant that encourages sparsity in the loadings, resulting in a more interp­retable repres­ent­ation. Implem­ent­ation and Libraries PCA is implem­ented in various progra­mming languages. Commonly used libraries include scikit­-learn (Python), caret (R), and numpy (Python) for numerical comput­ations.

### Cluster Analysis Cheat Sheet

 Cluster Analysis Overview Cluster Analysis is an unsupe­rvised learning technique used to group similar objects or data points into clusters based on their charac­ter­istics or proximity. It helps discover hidden patterns, simila­rities, or structures within the data. Types of Cluster Analysis Hierar­chical Cluste­ring: Builds a hierarchy of clusters by recurs­ively merging or splitting clusters based on a similarity measure. K-means Cluste­ring: Divides the data into a predet­ermined number (k) of non-ov­erl­apping clusters by minimizing the within­-cl­uster sum of squares. Density-based Cluste­ring: Groups data points based on density and identifies regions with higher density as clusters. Model-based Cluste­ring: Assumes a specific statis­tical model for each cluster and estimates model parameters to assign data points to clusters. Similarity and Distance Measures Cluster analysis often relies on similarity or distance measures to determine the proximity between data points. Common distance measures include Euclidean distance, Manhattan distance, and cosine simila­rity. Hierar­chical Clustering Agglom­erative (Botto­m-Up): Starts with each data point as a separate cluster and iterat­ively merges the closest pairs of clusters until all points belong to a single cluster. Divisive (Top-D­own): Begins with all data points in one cluster and recurs­ively splits clusters until each data point is in its own cluster. K-means Clustering Randomly initia­lizes k cluster centroids, assigns each data point to the nearest centroid, recalc­ulates the centroids based on the mean of assigned points, and repeats until conver­gence. The choice of the number of clusters (k) is important and can impact the results. Densit­y-based Clustering (DBSCAN) Densit­y-based Spatial Clustering of Applic­ations with Noise (DBSCAN) groups data points based on density and identifies core points, border points, and noise points. It defines clusters as dense regions separated by sparser areas and does not require specifying the number of clusters in advance. Model-­based Clustering (Gaussian Mixture Models) Gaussian Mixture Models (GMM) assume that the data points are generated from a mixture of Gaussian distri­but­ions. It estimates the parameters of the Gaussian distri­butions and assigns data points to clusters based on the likeli­hood. Evaluation of Clustering Internal Evalua­tion: Measures the quality of clustering using intrinsic criteria such as the silhouette coeffi­cient or within­-cl­uster sum of squares. External Evalua­tion: Compares the clustering results to a known ground truth, if available, using external criteria like purity or F-measure. Handling Missing Data and Outliers Missing data can be handled by imputation techniques before cluste­ring. Outliers can signif­icantly impact clustering results. Techniques like outlier detection or prepro­cessing methods can be applied to mitigate their influence. Visual­ization of Clustering Results Dimens­ion­ality reduction techniques like PCA or t-SNE can be used to visualize high-d­ime­nsional clustering results in lower-­dim­ens­ional space. Scatter plots, heatmaps, or dendro­grams can provide insights into the clustering structure.

### Convol­utional Neural Networks Cheat Sheet

 Convol­utional Neural Networks Overview CNNs are a type of neural network specif­ically designed for processing grid-like data, such as images. They leverage the concept of convol­ution to extract relevant features from the input data. Convol­utional Layers Convol­utional layers perform the main feature extraction in CNNs. Each layer consists of multiple filters (also called kernels) that scan the input data through convol­ution operat­ions. Convolution applies a sliding window over the input and performs elemen­t-wise multip­lic­ation and summing to produce feature maps. Pooling Layers Pooling layers reduce the spatial dimensions of the feature maps, reducing comput­ational complexity and providing spatial invari­ance. Common types of pooling include Max Pooling (selecting the maximum value in each pooling region) and Average Pooling (taking the average). Activation Functions Activation functions introduce non-li­nearity to the CNN and enable modeling complex relati­ons­hips. ReLU (Rectified Linear Unit) is commonly used as the activation function in CNNs, promoting faster conver­gence and avoiding the vanishing gradient problem. Fully Connected Layers Fully connected layers, also known as dense layers, are tradit­ional neural network layers where each neuron is connected to every neuron in the previous layer. They provide the final classi­fic­ation or regression output by combining the learned features. Loss Functions Loss functions quantify the difference between predicted outputs and true labels in CNNs. Common loss functions include Mean Squared Error (MSE) for regression tasks and Cross-­Entropy for classi­fic­ation tasks. Training Techniques CNNs are typically trained using backpr­opa­gation and gradient descent optimi­zation methods. Techniques like Dropout (randomly deacti­vating neurons during training) and Batch Normal­ization (norma­lizing inputs to accelerate training) are commonly used to improve genera­liz­ation and perfor­mance. Data Augmen­tation Data augmen­tation techniques help increase the diversity of the training data by applying transf­orm­ations such as rotations, transl­ations, flips, or scaling. This helps improve the model's ability to generalize and reduces overfi­tting. Transfer Learning Transfer learning leverages pretrained CNN models on large datasets and adapts them to new tasks or smaller datasets. Pretrained models like VGGNet and ResNet are available, allowing transfer of learned features to new applic­ations. Object Locali­zation and Detection CNNs can be extended to perform object locali­zation and detection tasks using techniques like bounding box regression and region proposal networks (RPN). Semantic Segmen­tation Semantic segmen­tation assigns a label to each pixel or region in an image, allowing detailed object­-level unders­tan­ding. Fully Convol­utional Networks (FCNs) are commonly used for semantic segmen­tation. Hardware Accele­ration CNNs can benefit from specia­lized hardware like GPUs (Graphics Processing Units) or TPUs (Tensor Processing Units) for faster training and inference.

### Generative Advers­arial Networks Cheat Sheet

 Generative Advers­arial Networks (GAN) Basics GANs are a class of deep learning models composed of two compon­ents: a generator and a discri­min­ator. The generator learns to generate synthetic data samples that resemble real data, while the discri­minator tries to distin­guish between real and fake samples. Generator The generator takes random noise as input and generates synthetic samples. It typically consists of one or more layers of neural networks, often using transpose convol­utions for upsamp­ling. Discri­minator The discri­minator takes a sample as input and estimates the probab­ility of it being real or fake. It typically consists of one or more layers of neural networks, often using convol­utions for feature extrac­tion. Advers­arial Training The generator and discri­minator are trained in an advers­arial manner. The generator tries to generate samples that fool the discri­min­ator, while the discri­minator aims to correctly classify real and fake samples. Loss Functions The generator and discri­minator are trained using different loss functions. The genera­tor's loss function encourages the generated samples to be classified as real by the discri­min­ator. The discri­min­ator's loss function penalizes miscla­ssi­fying real and fake samples. Mode Collapse Mode collapse occurs when the generator produces limited and repetitive samples, failing to capture the diversity of the real data distri­bution. Techniques like minibatch discri­min­ation and feature matching can help alleviate mode collapse. Deep Convol­utional GAN (DCGAN) DCGAN is a popular GAN archit­ecture that uses convol­utional neural networks for both the generator and discri­min­ator. It leverages convol­utional and transpose convol­utional layers to generate and discri­minate images. Condit­ional GAN (cGAN) cGANs introduce additional inform­ation (such as class labels) to guide the generation process. The generator and discri­minator take both random noise and condit­ional inform­ation as input. Evaluation of GANs Evaluating GANs is challe­nging as there is no direct objective function to optimize. Common evaluation methods include visual inspec­tion, Inception Score, Fréchet Inception Distance (FID), and Precision and Recall curves. Unsupe­rvised Repres­ent­ation Learning GANs can learn meaningful repres­ent­ations without explicit labels. By training on a large unlabeled dataset, the generator can capture and generate high-level features. Variat­ional Autoen­coder (VAE) vs. GAN VAEs and GANs are both generative models but differ in their underlying princi­ples. VAEs focus on learning latent repres­ent­ations and recons­tru­ction, while GANs emphasize generating realistic samples. Applic­ations of GANs Image synthesis and genera­tion. Style transfer and image-­to-­image transl­ation. Data augmen­tation and synthesis for training other models. Text-to-image synthesis and genera­tion.

### Time Series Foreca­sting Cheat Sheet

 Time Series Basics Time series data is a sequence of observ­ations collected over time, typically at regular intervals. It exhibits temporal depend­encies, trends, season­ality, and may contain noise. Statio­narity Stationary time series have constant mean, variance, and autoco­var­iance over time. Stationarity is desirable for accurate foreca­sting. Trends and Season­ality Trend refers to the long-term upward or downward movement in a time series. Seasonality refers to patterns that repeat at fixed intervals. Identifying and handling trends and season­ality is important for accurate foreca­sting. Autoco­rre­lation Function (ACF) and Partial Autoco­rre­lation Function (PACF) ACF measures the correl­ation between a time series and its lagged values. PACF measures the correl­ation between a time series and its lagged values, excluding the interm­ediate lags. They help identify the order of autore­gre­ssive (AR) and moving average (MA) components in time series models. Time Series Models Autore­gre­ssive Integrated Moving Average (ARIMA): A linear model that combines AR and MA components to handle stationary time series. Seasonal ARIMA (SARIMA): Extends ARIMA to handle seasonal time series data. Exponential Smoothing Methods: Models that assign expone­ntially decreasing weights to past observ­ations. Prophet: An additive regression model that captures trend, season­ality, and holiday effects. Vector Autore­gre­ssion (VAR): A multiv­ariate time series model that captures the relati­onships between variables. Machine Learning for Time Series Regression Models: Linear regres­sion, random forest, support vector machines (SVM), or gradient boosting algorithms can be used with approp­riate feature engine­ering. Long Short-Term Memory (LSTM) Networks: A type of recurrent neural network (RNN) suitable for modeling sequential data. Convolutional Neural Networks (CNN): Can be applied to time series data by treating the series as an image. Feature Engine­ering Lagged Variables: Include lagged versions of the target variable or other relevant variables as features. Rolling Statis­tics: Compute rolling mean, standard deviation, or other statistics over a window of observ­ations. Seasonal Features: Extract features repres­enting day of the week, month, or other seasonal patterns. Fourier Transform: Convert time series data to frequency domain to identify periodic compon­ents. Validation and Evaluation Metrics Train-­Val­ida­tio­n-Test Split: Split the time series into training, valida­tion, and test sets. Evaluation Metrics: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and symmetric MAPE (sMAPE) are commonly used. Cross-­Val­idation for Time Series Time Series Cross-­Val­ida­tion: Use rolling window or expanding window techniques to simulate the real-time foreca­sting scenario. Ensemble Methods Combine forecasts from multiple models or model config­ura­tions to improve accuracy and robust­ness. Examples include model averaging, weighted averaging, and stacking Outliers and Anomalies Identify and handle outliers and anomalies to prevent their influence on the foreca­sting process. Techniques include moving averages, median filtering, or statis­tical tests. Handling Missing Data Imputation Techni­ques: Use interp­ola­tion, mean imputa­tion, or model-­based imputation to fill missing values.

### Hyperp­ara­meter Tuning Cheat Sheet

 What are Hyperp­ara­meters? Hyperp­ara­meters are config­uration settings that are not learned from the data but are set before the training process. They control the behavior and perfor­mance of machine learning models. Hyperp­ara­meter Tuning Techni­ques: Grid Search: Exhaus­tively searches all possible combin­ations of hyperp­ara­meters within predefined ranges. Random Search: Randomly samples hyperp­ara­meters from predefined ranges, allowing more efficient explor­ation. Bayesian Optimi­zation: Uses prior knowledge and statis­tical methods to intell­igently search the hyperp­ara­meter space. Genetic Algori­thms: Mimics natural selection to evolve a population of hyperp­ara­meter config­ura­tions over multiple iterat­ions. Automated Hyperp­ara­meter Tuning Libraries: Tools like Optuna, Hyperopt, or scikit­-le­arn's GridSe­archCV and Random­ize­dSe­archCV can automate the hyperp­ara­meter tuning process. Hyperp­ara­meters to Consider Learning Rate: Controls the step size during model training. Number of Hidden Units/­Layers: Determines the complexity and capacity of neural networks. Regularization Parame­ters: Control the trade-off between model complexity and overfi­tting. Batch Size: Determines the number of samples processed before updating model weights. Dropout Rate: Probab­ility of dropping out units during training to prevent overfi­tting. Activation Functions: Choices like sigmoid, tanh, ReLU, or Leaky ReLU impact the model's non-li­nea­rity. Optimizer: Algorithms like stochastic gradient descent (SGD), Adam, or RMSprop that update model weights during training. Number of Trees and Tree Depth: Parameters for ensemble methods like Random Forest or Gradient Boosting models. Kernel Type and Parame­ters: For models like Support Vector Machines (SVM) that use kernel functions. Define Hyperp­ara­meter Ranges Establish reasonable ranges for each hyperp­ara­meter based on prior knowledge, litera­ture, or experi­men­tation. Consider the scale and distri­bution of values (linear, logari­thmic) that make sense for each hyperp­ara­meter. Sequential vs. Parallel Tuning Sequential tuning explores hyperp­ara­meter combin­ations one by one, allowing feedback from each trial to inform the next. Parallel tuning performs multiple hyperp­ara­meter evalua­tions simult­ane­ously, making efficient use of comput­ational resources. Evaluate and Compare Models Define an evaluation metric (e.g., accuracy, F1-score, mean squared error) that reflects the perfor­mance of interest. Keep a record of the perfor­mance for each hyperp­ara­meter config­uration to compare the models later. Cross-­Val­idation Use techniques like k-fold cross-­val­idation to estimate the genera­liz­ation perfor­mance of different hyperp­ara­meter config­ura­tions. Avoid tuning hyperp­ara­meters on the test set to prevent overfi­tting and biased perfor­mance estima­tion. Early Stopping Monitor a validation metric during training and stop the training process early if perfor­mance deteri­orates consis­tently. Prevents overfi­tting and saves comput­ational resources. Feature Selection and Dimens­ion­ality Reduction Consider using techniques like feature selection or dimens­ion­ality reduction algorithms (e.g., PCA) as part of hyperp­ara­meter tuning. They can influence model perfor­mance and help improve effici­ency. Domain Knowledge Leverage domain knowledge to guide the selection of hyperp­ara­meters. Prior knowledge can help narrow down the search space and focus on hyperp­ara­meters likely to have a signif­icant impact. Regularize Hyperp­ara­meters Apply regula­riz­ation techniques like L1 or L2 regula­riz­ation to hyperp­ara­meters. Regularization helps control the complexity and prevent overfi­tting of the models. Docume­ntation and Reprod­uci­bility Keep a record of the hyperp­ara­meter config­ura­tions, evaluation metrics, and other relevant details for reprod­uci­bility. Document the lessons learned and insights gained during the hyperp­ara­meter tuning process.

### Model Evaluation and Metrics Cheat Sheet

 Confusion Matrix A table that summarizes the perfor­mance of a classi­fic­ation model. It shows the counts of true positives, true negatives, false positives, and false negatives. Accuracy The proportion of correct predic­tions over the total number of predic­tions. Accuracy = (TP + TN) / (TP + TN + FP + FN) Precision The proportion of true positive predic­tions over the total number of positive predic­tions. Precision = TP / (TP + FP) Recall (Sensi­tivity or True Positive Rate) The proportion of true positive predic­tions over the total number of actual positives. Recall = TP / (TP + FN) Specif­icity (True Negative Rate) The proportion of true negative predic­tions over the total number of actual negatives. Specificity = TN / (TN + FP) F1-Score The harmonic mean of precision and recall. F1-Score = 2 (Precision Recall) / (Precision + Recall) Receiver Operating Charac­ter­istic (ROC) Curve A plot of the true positive rate (sensi­tivity) against the false positive rate (1 - specif­icity) at various classi­fic­ation thresh­olds. It illust­rates the trade-off between sensit­ivity and specif­icity. Area Under the ROC Curve (AUC-ROC) A measure of the overall perfor­mance of a binary classi­fic­ation model. AUC-ROC ranges from 0 to 1, with higher values indicating better perfor­mance. Mean Squared Error (MSE) The average of the squared differ­ences between predicted and actual values. MSE = (1/n) * Σ(y_pred - y_actu­al)^2 Root Mean Squared Error (RMSE) The square root of the mean squared error. RMSE = √(MSE) Mean Absolute Error (MAE) The average of the absolute differ­ences between predicted and actual values. MAE = (1/n) * Σ|y_pred - y_actual| R-squared (Coeff­icient of Determ­ina­tion) A measure of how well the regression model fits the data. R-squared ranges from 0 to 1, with higher values indicating a better fit. Mean Average Percentage Error (MAPE) The average percentage difference between predicted and actual values. MAPE = (1/n) Σ(|y_pred - y_actual| / y_actual) 100 Cross-­Val­idation A technique to assess the perfor­mance of a model on unseen data by splitting the data into multiple folds. It helps estimate the model's genera­liz­ation perfor­mance and mitigate issues like overfi­tting. Bias-V­ariance Trade-off Bias refers to the error introduced by approx­imating a real-world problem with a simplified model. Variance refers to the model's sensit­ivity to fluctu­ations in the training data. Balancing bias and variance is crucial for building models that generalize well. Overfi­tting and Underf­itting Overfi­tting occurs when a model performs well on training data but poorly on unseen data. Underfitting occurs when a model is too simple to capture the underlying patterns in the data. Regularization techniques and proper model complexity selection can help address these issues. Feature Importance Techniques like feature importance scores, permut­ation import­ance, or SHAP values help identify the most influe­ntial features in a model. Model Selection Compare and select models based on evaluation metrics, cross-­val­idation results, and domain­-sp­ecific consid­era­tions. Avoid selecting models solely based on a single metric without consid­ering the context.