\documentclass[10pt,a4paper]{article}

% Packages
\usepackage{fancyhdr}           % For header and footer
\usepackage{multicol}           % Allows multicols in tables
\usepackage{tabularx}           % Intelligent column widths
\usepackage{tabulary}           % Used in header and footer
\usepackage{hhline}             % Border under tables
\usepackage{graphicx}           % For images
\usepackage{xcolor}             % For hex colours
%\usepackage[utf8x]{inputenc}    % For unicode character support
\usepackage[T1]{fontenc}        % Without this we get weird character replacements
\usepackage{colortbl}           % For coloured tables
\usepackage{setspace}           % For line height
\usepackage{lastpage}           % Needed for total page number
\usepackage{seqsplit}           % Splits long words.
%\usepackage{opensans}          % Can't make this work so far. Shame. Would be lovely.
\usepackage[normalem]{ulem}     % For underlining links
% Most of the following are not required for the majority
% of cheat sheets but are needed for some symbol support.
\usepackage{amsmath}            % Symbols
\usepackage{MnSymbol}           % Symbols
\usepackage{wasysym}            % Symbols
%\usepackage[english,german,french,spanish,italian]{babel}              % Languages

% Document Info
\author{sree017}
\pdfinfo{
  /Title (dl-cheat-sheet.pdf)
  /Creator (Cheatography)
  /Author (sree017)
  /Subject (DL Cheat Sheet)
}

% Lengths and widths
\addtolength{\textwidth}{6cm}
\addtolength{\textheight}{-1cm}
\addtolength{\hoffset}{-3cm}
\addtolength{\voffset}{-2cm}
\setlength{\tabcolsep}{0.2cm} % Space between columns
\setlength{\headsep}{-12pt} % Reduce space between header and content
\setlength{\headheight}{85pt} % If less, LaTeX automatically increases it
\renewcommand{\footrulewidth}{0pt} % Remove footer line
\renewcommand{\headrulewidth}{0pt} % Remove header line
\renewcommand{\seqinsert}{\ifmmode\allowbreak\else\-\fi} % Hyphens in seqsplit
% This two commands together give roughly
% the right line height in the tables
\renewcommand{\arraystretch}{1.3}
\onehalfspacing

% Commands
\newcommand{\SetRowColor}[1]{\noalign{\gdef\RowColorName{#1}}\rowcolor{\RowColorName}} % Shortcut for row colour
\newcommand{\mymulticolumn}[3]{\multicolumn{#1}{>{\columncolor{\RowColorName}}#2}{#3}} % For coloured multi-cols
\newcolumntype{x}[1]{>{\raggedright}p{#1}} % New column types for ragged-right paragraph columns
\newcommand{\tn}{\tabularnewline} % Required as custom column type in use

% Font and Colours
\definecolor{HeadBackground}{HTML}{333333}
\definecolor{FootBackground}{HTML}{666666}
\definecolor{TextColor}{HTML}{333333}
\definecolor{DarkBackground}{HTML}{A3A3A3}
\definecolor{LightBackground}{HTML}{F3F3F3}
\renewcommand{\familydefault}{\sfdefault}
\color{TextColor}

% Header and Footer
\pagestyle{fancy}
\fancyhead{} % Set header to blank
\fancyfoot{} % Set footer to blank
\fancyhead[L]{
\noindent
\begin{multicols}{3}
\begin{tabulary}{5.8cm}{C}
    \SetRowColor{DarkBackground}
    \vspace{-7pt}
    {\parbox{\dimexpr\textwidth-2\fboxsep\relax}{\noindent
        \hspace*{-6pt}\includegraphics[width=5.8cm]{/web/www.cheatography.com/public/images/cheatography_logo.pdf}}
    }
\end{tabulary}
\columnbreak
\begin{tabulary}{11cm}{L}
    \vspace{-2pt}\large{\bf{\textcolor{DarkBackground}{\textrm{DL Cheat Sheet}}}} \\
    \normalsize{by \textcolor{DarkBackground}{sree017} via \textcolor{DarkBackground}{\uline{cheatography.com/126402/cs/24602/}}}
\end{tabulary}
\end{multicols}}

\fancyfoot[L]{ \footnotesize
\noindent
\begin{multicols}{3}
\begin{tabulary}{5.8cm}{LL}
  \SetRowColor{FootBackground}
  \mymulticolumn{2}{p{5.377cm}}{\bf\textcolor{white}{Cheatographer}}  \\
  \vspace{-2pt}sree017 \\
  \uline{cheatography.com/sree017} \\
  \end{tabulary}
\vfill
\columnbreak
\begin{tabulary}{5.8cm}{L}
  \SetRowColor{FootBackground}
  \mymulticolumn{1}{p{5.377cm}}{\bf\textcolor{white}{Cheat Sheet}}  \\
   \vspace{-2pt}Published 3rd October, 2020.\\
   Updated 3rd October, 2020.\\
   Page {\thepage} of \pageref{LastPage}.
\end{tabulary}
\vfill
\columnbreak
\begin{tabulary}{5.8cm}{L}
  \SetRowColor{FootBackground}
  \mymulticolumn{1}{p{5.377cm}}{\bf\textcolor{white}{Sponsor}}  \\
  \SetRowColor{white}
  \vspace{-5pt}
  %\includegraphics[width=48px,height=48px]{dave.jpeg}
  Measure your website readability!\\
  www.readability-score.com
\end{tabulary}
\end{multicols}}


\begin{document}
\raggedright
\raggedcolumns

% Set font size to small. Switch to any value
% from this page to resize cheat sheet text:
% www.emerson.emory.edu/services/latex/latex_169.html
\footnotesize % Small font.

\begin{multicols*}{2}

\begin{tabularx}{8.4cm}{X}
\SetRowColor{DarkBackground}
\mymulticolumn{1}{x{8.4cm}}{\bf\textcolor{white}{Functions}}  \tn
\SetRowColor{LightBackground}
\mymulticolumn{1}{x{8.4cm}}{Activation Functions: \newline      Activation functions helps to determine the output of a neural network. These type of functions are attached to each neuron in the network, and determines whether it should be activated or not, based on whether each neuron's input is relevant for the model's prediction. \newline    They introduce non-linear properties to our network which have a degree more than one, which can help the network learn complex data, compute and learn almost any function representing a question, and provide accurate predictions. \newline    \newline Sigmod function: It is a activation function of form f(x) = 1 / 1 + exp(-x) . Its Range is between 0 and  \newline     1. It is an S-shaped curve. It is easy to understand. \newline   Adv: Smooth gradient, Output values bound b/w 0 and 1, clear predictions, i.e very close to 1 or 0. \newline   Dis Adv: Prone to gradient vanishing, Function output is not zero-centered, Power operations are relatively time consuming \newline  \newline tanh function:  hyperbolic tangent function \newline    mathematical formula is f(x) = 1— exp(-2x) / 1 + exp(-2x). Now it's the output is zero centred because its range in between -1 to 1 i.e. -1 \textless{} output \textless{} 1 . Hence optimisation is easier in this method; Hence in practice, it is always preferred over Sigmoid function. \newline  \newline RELU Function: It has become more popular in the past couple of years. It was recently proved that it has six times improvement in convergence from Tanh function. It's R(x) = max (0,x) i.e. if x \textless{} 0 , R(x) = 0 and if x \textgreater{}= 0 , R(x) = x. \newline   Adv: When the input is positive, there is no gradient saturation problem.  calculation speed faster. \newline   Dis adv: When the input is negative, ReLU is completely inactive, which means that once a negative number is entered, ReLU will die. In this way, in the forward propagation process, it is not a problem. Some areas are sensitive and some are insensitive. But in the backpropagation process, if you enter a negative number, the gradient will be completely zero, which has the same problem as the sigmod function and tanh function. \newline 2) We find that the output of the ReLU function is either 0 or a positive number, which means that the ReLU function is not a 0-centric function. \newline  \newline Leaky RELU Function: In order to solve the Dead ReLU Problem, people proposed to set the first half of ReLU 0.01x instead of 0.  \newline  \newline ELU (Exponential Linear Units) function: adv same as relu and no dead cell issue. \newline Softmax: \newline PRelu \newline Swish \newline Maxout \newline Soft Plus} \tn 
\hhline{>{\arrayrulecolor{DarkBackground}}-}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{8.4cm}{X}
\SetRowColor{DarkBackground}
\mymulticolumn{1}{x{8.4cm}}{\bf\textcolor{white}{Optimiziers}}  \tn
\SetRowColor{LightBackground}
\mymulticolumn{1}{x{8.4cm}}{Optimizers are algorithms or methods used to change the attributes of the neural network such as weights and learning rate to reduce the losses. Optimizers are used to solve optimization problems by minimizing the function. \newline  \newline What is GD:  it is an iterative machine learning optimisation algorithm to reduce the cost function, and help models to make accurate predictions. \newline  \newline Batch gradient descent: In the batch gradient, we use the entire dataset to compute the gradient of \newline the cost function for each iteration for gradient descent and then update the weights. \newline  \newline SGD (Stochastic gradient descent):  Stochastic gradient descent, we use a single data point or example to calculate the gradient and update the weights with every iteration. \newline  \newline Mini-batch gradient descent: Mini-batch gradients is a variation of stochastic gradient descent where \newline instead of a single training example, a mini-batch of samples are used. Mini -batch gradient descent is widely used and converges faster and is more stable. \newline As we take batches with different samples, it reduces the noise which is a variance of the weight updates. \newline  \newline Momentum: One disadvantage of the SGD method is that its update direction depends entirely on the current batch, so its update is very unstable. A simple way to solve this problem is to introduce  \newline momentum. \newline  \newline Momentum is momentum, which simulates the inertia of an object when it is moving, that is, the direction of the previous update is retained to a certain extent during the update, while the current update gradient is used to fine-tune the final update direction. In this way, you can increase the stability to a certain extent, so that you can learn faster, and also have the ability to get rid of local optimization. \newline  \newline Adagrad: Adagrad is an algorithm for gradient-based optimization which adapts the learning rate to the parameters, using low learning rates for parameters associated with frequently occurring features, and using high learning rates for parameters associated with infrequent features.  \newline  \newline Adadelta: Adadelta is an extension of Adagrad and it also tries to reduce Adagrad's aggressive, monotonically reducing the learning rate.** \newline  \newline RMSProp: \newline  \newline Adam:Adaptive Moment Estimation (Adam). \newline Adam can be viewed as a combination of Adagrad and RMSprop,(Adagrad) which works well on sparse gradients and (RMSProp) which works well in online and nonstationary settings repectively. \newline Adam implements the exponential moving average of the gradients to scale the learning rate instead of a simple average as in Adagrad. It keeps an exponentially decaying average of past gradients. \newline Adam is computationally efficient and has very less memory requirement. \newline Adam optimizer is one of the most popular and famous gradient descent optimization algorithms.} \tn 
\hhline{>{\arrayrulecolor{DarkBackground}}-}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{8.4cm}{X}
\SetRowColor{DarkBackground}
\mymulticolumn{1}{x{8.4cm}}{\bf\textcolor{white}{Loss Functions}}  \tn
\SetRowColor{LightBackground}
\mymulticolumn{1}{x{8.4cm}}{Loss functions are mainly used to minimize the error \newline  \newline L1 Loss function:  It is used to minimize the error which is the sum of all the absolute differences in between the true value and the predicted value. \newline     L1 =  sum of i=1, n |ytrue - ypredicted| \newline  \newline L2 Loss Function: It is used to minimize the error which is the sum of all the squared differences in between the true value and the pedicted value. \newline     L1 =  sum of i=1, n (ytrue - ypredicted)2 \newline  \newline Huber Loss:Huber Loss is often used in regression problems. Compared with L2 loss, Huber Loss is less sensitive to outliers(because if the residual is too large, it is a piecewise function, loss is a linear function of the residual). \newline  \newline Hinge Loss: Hinge loss is often used for binary classification problems, such as ground true: t = 1 or -1, predicted value y = wx + b \newline  \newline Cross-entropy loss: It is used to define a loss function in machine learning and optimization. Also called the log loss, measures the performance of the classification model whose output is a probability value between 0 and 1. \newline  \newline Sigmoid-Cross-entropy loss \newline  \newline Softmax-Cross-entropy loss} \tn 
\hhline{>{\arrayrulecolor{DarkBackground}}-}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{8.4cm}{X}
\SetRowColor{DarkBackground}
\mymulticolumn{1}{x{8.4cm}}{\bf\textcolor{white}{CNN}}  \tn
\SetRowColor{LightBackground}
\mymulticolumn{1}{x{8.4cm}}{Convolutional Neural Networks (ConvNets or CNNs) are a category of Neural Networks that have proven very effective in areas such as image recognition and classification. \newline  \newline Different Types of Layers in CNN: \newline  \newline 1. Input Layer: Holds the raw input of image with width(32), height(32) and depth(3) \newline  \newline 2. Convolution Layer:  It computes the output volume by computing dot products between all filters and image patches \newline  \newline 3. Activation Function Layer: This layer will apply the element-wise activation function to the output of the convolution layer.  \newline  \newline 4.Pool Layer: This layer is periodically inserted within the covnets, and its main function is to reduce the size of volume which makes thecomputation fast reduces memory and also prevents overfitting. Two common types of pooling layers are max pooling and average pooling. \newline  \newline 5. Fully Connected Layer: This layer is a regular neural network layer that takes input from the previous layer and computes the class scores and outputs the 1-D array of size equal to the number of classes. \newline  \newline Pooling, padding, filtering operations on CNN \newline  \newline Pooling: It is a down-sampling operation that is typically applied after a convolutional layer, which does some sort of spatial invariance, to reduce the spatial dimensions of the CNN.  \newline It creates a pooled feature map sliding a filter matrix over the input matrix. In particular, max and average pooling are special kinds of pooling where max and average values are taken, respectively. \newline  \newline Pooling layers are used to reduce the dimensions of the feature maps. Thus, it reduces the number of parameters to learn and the amount of computation performed in the network. \newline The pooling layer summarises the features present in a region of the feature map generated by a convolution layer. So, further operations are performed on summarised features instead of precisely positioned features generated by the convolution layer. This makes the model more robust to variations in the position of the features in the input image. \newline  \newline Padding: Padding is simply a process of adding layers of zeros to our input images so as to avoid the problems like  image shrinking every time a convolution operation is performed and also the pixels on the corners and the edges are used much less than those in the middle \newline   Valid Padding : It implies no padding at all. The input image is left in its valid/unaltered shape. \newline   Same Padding : In this case, we add 'p' padding layers such that the output image has the same dimensions as the input image.  \newline  \newline    n+2p-f/s+1    (n=size, p=padding, f=filter size, s=stride) (6, 1, 3, 1) = 6 \newline  \newline    n-f/s+1  (5, 3, 1) = 3 - with 1 stride \newline  \newline    (n-f)/s+1 (5, 3, 2) = 2 -  2 stride} \tn 
\hhline{>{\arrayrulecolor{DarkBackground}}-}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{8.4cm}{X}
\SetRowColor{DarkBackground}
\mymulticolumn{1}{x{8.4cm}}{\bf\textcolor{white}{CNN Architectures}}  \tn
\SetRowColor{LightBackground}
\mymulticolumn{1}{x{8.4cm}}{LeNet:  It is a very efficient 7-level convolutional neural network for handwritten character recognition. (32{\emph{32 pixel grayscale image, tanh activation function and softmax at last FC layer) \newline  Image -\textgreater{} Convolution(5}}5) -\textgreater{} Average Pooling(2{\emph{2) -\textgreater{} Conv(5}}5) -\textgreater{} Average P(2{\emph{2) -\textgreater{} Conv(5}}5) -\textgreater{} FC -\textgreater{} FC \newline  \newline AlexNet: winner of the 2012 ImageNet competition \newline  \newline Inception: Also known as GoogLeNet , it is a 22-layer network. There are four parallel channels in each inception module , and concat is performed at the end of the channel . \newline  \newline Imagenet: It has 1,000 image categories represent object classes that we encounter in our day-to-day lives, such as species of dogs, cats, various household objects, vehicle types, and much more. \newline  \newline ResNet: It also called as Residual Neural Network (ResNet). This architecture introduced a concept called "skip connections". Typically, the input matrix calculates in two linear transformations with ReLU activation function. In Residual network, it directly copies the input matrix to the second transformation output and sums the output in final ReLU function. \newline  \newline VGG: VGG-16 is a simpler architecture model since it's not using many hyperparameters. It always uses 3 x 3 filters with the stride of 1 in convolution layer and uses SAME padding in pooling layers 2 x 2 with a stride of 2. \newline Three fully connected layers follow the VGG convolutional layers. The width of the networks starts at the small value of 64 and increases by a factor of 2 after every sub-sampling/pooling layer. It achieves the top-5 accuracy of 92.3 \% on ImageNet.} \tn 
\hhline{>{\arrayrulecolor{DarkBackground}}-}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{8.4cm}{X}
\SetRowColor{DarkBackground}
\mymulticolumn{1}{x{8.4cm}}{\bf\textcolor{white}{Object Detections}}  \tn
\SetRowColor{LightBackground}
\mymulticolumn{1}{x{8.4cm}}{HAAR Cascade: It is the machine learning object detections algorithm used to identify the objects in \newline an image or the video and based on the concept of features. \newline It has 4 stages: Haar Feature Selection,  Creating Integral Images,  Adaboost Training,  Cascading Classifiers \newline  \newline It is well known for being able to detect faces and body parts in an image but can be trained to identify almost any object. \newline  \newline RCNN: To bypass the problem of selecting the huge number of regions, Ross Girshick et al. proposed a method where we use the selective search to extract just 2000 regions from the image, and he called them as region proposals. Therefore, instead of trying to classify the huge number of \newline regions, you can work with 2000 regions.  \newline  \newline Problems with R-CNN: \newline  It still takes the huge amount of time to train the network as we would have to classify \newline 2000 region proposals per image. \newline  It cannot be implemented real-time as it takes around 47 seconds for each test image. \newline  The selective search algorithm is the fixed algorithm. Therefore, no learning is happening \newline at that stage. This leads to the generation of the bad candidate region proposals. \newline  \newline Faster RCNN:  It has two networks: region proposal network (RPN) for generating region \newline proposals and a network using these proposals to detect objects. The main difference here with \newline the Fast R-CNN is that the later uses selective search to generate the region proposals. The time \newline cost of generating the region proposals is much smaller in the RPN than selective search, when \newline RPN shares the most computation with object detection network. In brief, RPN ranks region \newline boxes (called anchors) and proposes the ones most likely containing objects. \newline  \newline Anchors play an very important role in Faster R-CNN. An anchor is the box. In default \newline configuration of Faster R-CNN, there are nine anchors at the position of an image. \newline  \newline The output of the region proposal network is the bunch of boxes/proposals that will be examined \newline by a classifier and regressor to check the occurrence of objects eventually. To be more \newline precise, RPN predicts the possibility of an anchor being background or foreground, and refine \newline the anchor \newline  \newline  \newline DarkNet: DarkNet is a framework used to train neural networks; it is open source and written in C/CUDA and serves as the basis for YOLO. Darknet is also used as the framework for training YOLO, meaning it sets the architecture of the network. \newline Clone the repo locally, and you have it. To compile it, run a make. But first, if you intend to use the GPU capability, you need to edit the Makefile in the first two lines, where you tell it to compile for GPU usage with CUDA drivers. \newline  \newline YOLO: You look only once \newline YOLO is a network "inspired by" Google Net. It has 24 convolutional layers working as the feature extractors and two dense layers for making the predictions. The architecture works upon is called Darknet, a neural network framework created by the first author of the YOLO paper. \newline  Core Concept for YOLO: The algorithm works off by dividing the image into the grid of the cells, for each cell bounding boxes and their scores are predicted, alongside class probabilities. The confidence is given in terms of IOU (intersection over union), metric, which is measuring how much the detected object overlaps with the ground truth as a fraction of the total area spanned by the two together (the union). \newline   Yolo V3: 53 Convolutional Layers \newline  \newline  \newline Mask R-CNN architecture:Mask R-CNN was proposed by Kaiming He et al. in 2017. It is very similar to Faster R-CNN except there is another layer to predict segmented. The stage of region proposal generation is same in both the architecture the second stage which works in parallel predict class, generate bounding box as well as outputs a binary mask for each RoI. \newline  \newline Applications : \newline Due to its additional capability to generate segmented mask, it is used in many computer vision applications such as: Human Pose Estimation Self Driving Car Drone Image Mapping etc.} \tn 
\hhline{>{\arrayrulecolor{DarkBackground}}-}
\end{tabularx}
\par\addvspace{1.3em}


% That's all folks
\end{multicols*}

\end{document}