\documentclass[10pt,a4paper]{article}

% Packages
\usepackage{fancyhdr}           % For header and footer
\usepackage{multicol}           % Allows multicols in tables
\usepackage{tabularx}           % Intelligent column widths
\usepackage{tabulary}           % Used in header and footer
\usepackage{hhline}             % Border under tables
\usepackage{graphicx}           % For images
\usepackage{xcolor}             % For hex colours
%\usepackage[utf8x]{inputenc}    % For unicode character support
\usepackage[T1]{fontenc}        % Without this we get weird character replacements
\usepackage{colortbl}           % For coloured tables
\usepackage{setspace}           % For line height
\usepackage{lastpage}           % Needed for total page number
\usepackage{seqsplit}           % Splits long words.
%\usepackage{opensans}          % Can't make this work so far. Shame. Would be lovely.
\usepackage[normalem]{ulem}     % For underlining links
% Most of the following are not required for the majority
% of cheat sheets but are needed for some symbol support.
\usepackage{amsmath}            % Symbols
\usepackage{MnSymbol}           % Symbols
\usepackage{wasysym}            % Symbols
%\usepackage[english,german,french,spanish,italian]{babel}              % Languages

% Document Info
\author{Netsuiw}
\pdfinfo{
  /Title (deep-learning-quiz-1-cheat-sheet.pdf)
  /Creator (Cheatography)
  /Author (Netsuiw)
  /Subject (Deep Learning Quiz 1 Cheat Sheet)
}

% Lengths and widths
\addtolength{\textwidth}{6cm}
\addtolength{\textheight}{-1cm}
\addtolength{\hoffset}{-3cm}
\addtolength{\voffset}{-2cm}
\setlength{\tabcolsep}{0.2cm} % Space between columns
\setlength{\headsep}{-12pt} % Reduce space between header and content
\setlength{\headheight}{85pt} % If less, LaTeX automatically increases it
\renewcommand{\footrulewidth}{0pt} % Remove footer line
\renewcommand{\headrulewidth}{0pt} % Remove header line
\renewcommand{\seqinsert}{\ifmmode\allowbreak\else\-\fi} % Hyphens in seqsplit
% This two commands together give roughly
% the right line height in the tables
\renewcommand{\arraystretch}{1.3}
\onehalfspacing

% Commands
\newcommand{\SetRowColor}[1]{\noalign{\gdef\RowColorName{#1}}\rowcolor{\RowColorName}} % Shortcut for row colour
\newcommand{\mymulticolumn}[3]{\multicolumn{#1}{>{\columncolor{\RowColorName}}#2}{#3}} % For coloured multi-cols
\newcolumntype{x}[1]{>{\raggedright}p{#1}} % New column types for ragged-right paragraph columns
\newcommand{\tn}{\tabularnewline} % Required as custom column type in use

% Font and Colours
\definecolor{HeadBackground}{HTML}{333333}
\definecolor{FootBackground}{HTML}{666666}
\definecolor{TextColor}{HTML}{333333}
\definecolor{DarkBackground}{HTML}{A3A3A3}
\definecolor{LightBackground}{HTML}{F3F3F3}
\renewcommand{\familydefault}{\sfdefault}
\color{TextColor}

% Header and Footer
\pagestyle{fancy}
\fancyhead{} % Set header to blank
\fancyfoot{} % Set footer to blank
\fancyhead[L]{
\noindent
\begin{multicols}{3}
\begin{tabulary}{5.8cm}{C}
    \SetRowColor{DarkBackground}
    \vspace{-7pt}
    {\parbox{\dimexpr\textwidth-2\fboxsep\relax}{\noindent
        \hspace*{-6pt}\includegraphics[width=5.8cm]{/web/www.cheatography.com/public/images/cheatography_logo.pdf}}
    }
\end{tabulary}
\columnbreak
\begin{tabulary}{11cm}{L}
    \vspace{-2pt}\large{\bf{\textcolor{DarkBackground}{\textrm{Deep Learning Quiz 1 Cheat Sheet}}}} \\
    \normalsize{by \textcolor{DarkBackground}{Netsuiw} via \textcolor{DarkBackground}{\uline{cheatography.com/200763/cs/42404/}}}
\end{tabulary}
\end{multicols}}

\fancyfoot[L]{ \footnotesize
\noindent
\begin{multicols}{3}
\begin{tabulary}{5.8cm}{LL}
  \SetRowColor{FootBackground}
  \mymulticolumn{2}{p{5.377cm}}{\bf\textcolor{white}{Cheatographer}}  \\
  \vspace{-2pt}Netsuiw \\
  \uline{cheatography.com/netsuiw} \\
  \end{tabulary}
\vfill
\columnbreak
\begin{tabulary}{5.8cm}{L}
  \SetRowColor{FootBackground}
  \mymulticolumn{1}{p{5.377cm}}{\bf\textcolor{white}{Cheat Sheet}}  \\
   \vspace{-2pt}Not Yet Published.\\
   Updated 15th February, 2024.\\
   Page {\thepage} of \pageref{LastPage}.
\end{tabulary}
\vfill
\columnbreak
\begin{tabulary}{5.8cm}{L}
  \SetRowColor{FootBackground}
  \mymulticolumn{1}{p{5.377cm}}{\bf\textcolor{white}{Sponsor}}  \\
  \SetRowColor{white}
  \vspace{-5pt}
  %\includegraphics[width=48px,height=48px]{dave.jpeg}
  Measure your website readability!\\
  www.readability-score.com
\end{tabulary}
\end{multicols}}


\begin{document}
\raggedright
\raggedcolumns

% Set font size to small. Switch to any value
% from this page to resize cheat sheet text:
% www.emerson.emory.edu/services/latex/latex_169.html
\footnotesize % Small font.

\begin{multicols*}{4}

\begin{tabularx}{3.833cm}{x{1.47619 cm} x{1.95681 cm} }
\SetRowColor{DarkBackground}
\mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Supervised Learning}}  \tn
% Row 0
\SetRowColor{LightBackground}
Mapping from inputs to outputs & Need paired examples (x\_i,y\_i) to learn from \tn 
% Row Count 2 (+ 2)
% Row 1
\SetRowColor{white}
Examples are & Regression, Text Classification, Image Classification, etc. \tn 
% Row Count 5 (+ 3)
% Row 2
\SetRowColor{LightBackground}
Normally in the form of & Input -\textgreater{} Relate family of eqs to input -\textgreater{} Output prediction \tn 
% Row Count 8 (+ 3)
\hhline{>{\arrayrulecolor{DarkBackground}}--}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} }
\SetRowColor{DarkBackground}
\mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Double descent \& COD}}  \tn
% Row 0
\SetRowColor{LightBackground}
Double descent is the phenomenon when & the test error increases while the training error is nearing zero and then decreases sharply and back to normal \tn 
% Row Count 6 (+ 6)
% Row 1
\SetRowColor{white}
The tendency of high-dimensional space to overwhelm the number of data points is called the curse of dimensionality & Two randomly sampled data points from normal are at right angles to each other with high likelihood \tn 
% Row Count 12 (+ 6)
% Row 2
\SetRowColor{LightBackground}
But distance from the origin of random samples is roughly constant and  most of the volume of a high dimensional orange is in the peel not in the pulp & Volume of a diameter one hypersphere becomes zero and generate random points uniformly in hypercube, ratio of nearest to farthest becomes close to one \tn 
% Row Count 20 (+ 8)
\hhline{>{\arrayrulecolor{DarkBackground}}--}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{3.833cm}{x{1.40753 cm} x{2.02547 cm} }
\SetRowColor{DarkBackground}
\mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Loss/Cost function and Train/Test}}  \tn
% Row 0
\SetRowColor{LightBackground}
Measurement of & how bad a model performs \tn 
% Row Count 2 (+ 2)
% Row 1
\SetRowColor{white}
Trains on pair of data & Find argmin of this loss function \tn 
% Row Count 4 (+ 2)
% Row 2
\SetRowColor{LightBackground}
Test on seperate set of data & Measure the loss there and see its generalizing power \tn 
% Row Count 7 (+ 3)
% Row 3
\SetRowColor{white}
Different loss functions are & Squared Loss, log liklihood, ramp loss, etc \tn 
% Row Count 9 (+ 2)
\hhline{>{\arrayrulecolor{DarkBackground}}--}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{3.833cm}{X}
\SetRowColor{DarkBackground}
\mymulticolumn{1}{x{3.833cm}}{\bf\textcolor{white}{Counting number of parameters}}  \tn
\SetRowColor{LightBackground}
\mymulticolumn{1}{p{3.833cm}}{\vspace{1px}\centerline{\includegraphics[width=5.1cm]{/web/www.cheatography.com/public/uploads/netsuiw_1708027310_Screenshot from 2024-02-16 02-52-38.png}}} \tn 
\hhline{>{\arrayrulecolor{DarkBackground}}-}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} }
\SetRowColor{DarkBackground}
\mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Initialization}}  \tn
% Row 0
\SetRowColor{LightBackground}
If on initalization the variance is small or big & Then it can have floating point errors \tn 
% Row Count 3 (+ 3)
% Row 1
\SetRowColor{white}
So, we want to set variance same in forward and backward pass & He Initalization does this by setting variance 2/D\_h where variance of k or k+1 is same at layer k+1 or k \tn 
% Row Count 9 (+ 6)
\hhline{>{\arrayrulecolor{DarkBackground}}--}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{3.833cm}{X}
\SetRowColor{DarkBackground}
\mymulticolumn{1}{x{3.833cm}}{\bf\textcolor{white}{Counting number of parameters}}  \tn
\SetRowColor{LightBackground}
\mymulticolumn{1}{p{3.833cm}}{\vspace{1px}\centerline{\includegraphics[width=5.1cm]{/web/www.cheatography.com/public/uploads/netsuiw_1708029820_Screenshot from 2024-02-16 03-42-48.png}}} \tn 
\hhline{>{\arrayrulecolor{DarkBackground}}-}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} }
\SetRowColor{DarkBackground}
\mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Regularization techniques}}  \tn
% Row 0
\SetRowColor{LightBackground}
Explicit regularization is adding of a regularizing term to the loss function & This is also known as the prior in the probabilistic view. Normally L2 regularization is used where the square weights are added and controlled by a regularization term \tn 
% Row Count 9 (+ 9)
% Row 1
\SetRowColor{white}
Implicit regualrization is the natural tendencies of optimization algorithms and other aspects of the training process & That even without explicitly adding regularization techniques, help improve the generalization performance of a model eg SGD due to batch sizes (cause of randomness) \tn 
% Row Count 18 (+ 9)
% Row 2
\SetRowColor{LightBackground}
Early stopping is the process of stopping training early to not overfit the weights since they start small & Ensembling is the collague of different models and is averaged then (by mean or median). Different subset of data resampled is bagging \tn 
% Row Count 25 (+ 7)
% Row 3
\SetRowColor{white}
Dropout is the technique of killing random units. Can eliminate kinks in function that are far from data and don't contribute to training loss & Adding noise can also improve generalization \tn 
% Row Count 33 (+ 8)
\end{tabularx}
\par\addvspace{1.3em}

\vfill
\columnbreak
\begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} }
\SetRowColor{DarkBackground}
\mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Regularization techniques (cont)}}  \tn
% Row 4
\SetRowColor{LightBackground}
Can also use baysian inference to provide more information (to priors) & Transfer learning, multi-task learning, self-supervised learning, and data augmentation can be used too to improve generalization \tn 
% Row Count 7 (+ 7)
\hhline{>{\arrayrulecolor{DarkBackground}}--}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} }
\SetRowColor{DarkBackground}
\mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Bias Variance Tradeoff}}  \tn
% Row 0
\SetRowColor{LightBackground}
Variance is the uncertainty in fitted model due to choice of training set & Bias is systematic deviation from the mean of the function we are modeling due to limitations in our model \tn 
% Row Count 6 (+ 6)
% Row 1
\SetRowColor{white}
Noise is inherent uncertainty in the true mapping from input to output & Can reduce variance by adding more datapoints \tn 
% Row Count 10 (+ 4)
% Row 2
\SetRowColor{LightBackground}
Can reduce bias by making model more complex & But doing one or the other increases since more complex model = overfitting = more variance \tn 
% Row Count 15 (+ 5)
\hhline{>{\arrayrulecolor{DarkBackground}}--}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} }
\SetRowColor{DarkBackground}
\mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Momentum \& Adam}}  \tn
% Row 0
\SetRowColor{LightBackground}
Momentum is the weighted sum of the current gradient and previous gradients & We can think of momentum as a prediction on where we are stepping \tn 
% Row Count 4 (+ 4)
% Row 1
\SetRowColor{white}
Normalizing the gradients can lead to being stuck if we don't land on the optimal point excatly & Adam prevents that by computing mean and pointwise squared gradients with momentum and moderating near start of the sequence \tn 
% Row Count 11 (+ 7)
\hhline{>{\arrayrulecolor{DarkBackground}}--}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} }
\SetRowColor{DarkBackground}
\mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Schocastic Gradient Descent}}  \tn
% Row 0
\SetRowColor{LightBackground}
Gradient descent might be slow & And not all gradients needed to find optimal point \tn 
% Row Count 3 (+ 3)
% Row 1
\SetRowColor{white}
Compute gradient based on only a subset of points – a mini-batch & Work through dataset sampling without replacement \tn 
% Row Count 7 (+ 4)
% Row 2
\SetRowColor{LightBackground}
One pass though the data is called an epoch & This can escape from local minima, but adds noise. Uses all data equally but \tn 
% Row Count 11 (+ 4)
\hhline{>{\arrayrulecolor{DarkBackground}}--}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} }
\SetRowColor{DarkBackground}
\mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Backpropogation}}  \tn
% Row 0
\SetRowColor{LightBackground}
Two passes are done, forward pass, and backwards pass & Forward pass deals with knowing the activations at each layer and how it affects the loss and calculating inbetween values \tn 
% Row Count 7 (+ 7)
% Row 1
\SetRowColor{white}
We do not know gradients though so the loss cannot be modified (since units have a dependancy chain at update) & Backward pass calculates the gradients then of the loss function but in reverse \tn 
% Row Count 13 (+ 6)
% Row 2
\SetRowColor{LightBackground}
This is very efficient but is memory hungry & Also the problem is trying to split the computation proces apart (i.e. maybe parts exist in different computers) \tn 
% Row Count 19 (+ 6)
\hhline{>{\arrayrulecolor{DarkBackground}}--}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} }
\SetRowColor{DarkBackground}
\mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Gradient Descent}}  \tn
% Row 0
\SetRowColor{LightBackground}
Gradient descent finds optimal point (for convex function) & by step walking towards it, i.e., goes against the gradient calculated \tn 
% Row Count 4 (+ 4)
% Row 1
\SetRowColor{white}
So derivative of loss function wrt to parameters is calculated & And then params are updated by subtracting. A learning rate is applied to speed/slow it down \tn 
% Row Count 9 (+ 5)
\hhline{>{\arrayrulecolor{DarkBackground}}--}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} }
\SetRowColor{DarkBackground}
\mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Deep neural networks}}  \tn
% Row 0
\SetRowColor{LightBackground}
Simply neural networks with more than one hidden layer & Better than simply transposing the output of one shallow network to another (less params and regions) \tn 
% Row Count 6 (+ 6)
% Row 1
\SetRowColor{white}
Basically outputs from hidden units & Go into another hidden layer as inputs \tn 
% Row Count 8 (+ 2)
% Row 2
\SetRowColor{LightBackground}
Also obeys the universal approximation theorem & Difference from shallow network is more regions per parameters \tn 
% Row Count 12 (+ 4)
% Row 3
\SetRowColor{white}
The hyperparameters are K for width of network and D\_i for number of units of the network at layer i & There exists problems where shallow networks would need way too many units to approximate \tn 
% Row Count 17 (+ 5)
\hhline{>{\arrayrulecolor{DarkBackground}}--}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} }
\SetRowColor{DarkBackground}
\mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Convolutional networks}}  \tn
% Row 0
\SetRowColor{LightBackground}
Parameters only look at local image patches and so share parameters across image & The convolutional operation averages together the inputs \tn 
% Row Count 4 (+ 4)
% Row 1
\SetRowColor{white}
Stride = shift by k positions for each output, Kernel size = weight a different number of inputs for each output, Dilated or atrous convolutions = intersperse kernel values with zeros & Stride decreases output size, Kernel size combines info from larger area, while the last one uses few params while combine info \tn 
% Row Count 14 (+ 10)
% Row 2
\SetRowColor{LightBackground}
But we want to lose information: done by  apply several convolutions and stack them in channels (feature maps) & Receptive fields is the the region in the input space that a particular CNN's feature is affected by \tn 
% Row Count 20 (+ 6)
% Row 3
\SetRowColor{white}
Benifit of CNN is better inductive bias, forcing the network to process each location similarly, share info, search from small family of input/output maps, etc & Downsampling is the reducing of positions in data (max pooling most common ie take max), while upsampling is the increase \tn 
% Row Count 28 (+ 8)
\hhline{>{\arrayrulecolor{DarkBackground}}--}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} }
\SetRowColor{DarkBackground}
\mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Reinforcement Learning}}  \tn
% Row 0
\SetRowColor{LightBackground}
Create a set of states, actions, and rewards & Goal is to maximize reward by finding correct states \tn 
% Row Count 3 (+ 3)
% Row 1
\SetRowColor{white}
No data involved & Is recieved by the world build and explored \tn 
% Row Count 6 (+ 3)
% Row 2
\SetRowColor{LightBackground}
Examples are & Chess, Video games, etc \tn 
% Row Count 8 (+ 2)
% Row 3
\SetRowColor{white}
Flaws are that it is & Schocastic, temporial credit assignment, i.e., reward achieved by move or past moves, and \seqsplit{Exploration-exploitation} trade-off, i.e. when to explore and when to not \tn 
% Row Count 17 (+ 9)
\hhline{>{\arrayrulecolor{DarkBackground}}--}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} }
\SetRowColor{DarkBackground}
\mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Shallow neural network}}  \tn
% Row 0
\SetRowColor{LightBackground}
Use non convex (activation function) & to mold family of functions into dataset \tn 
% Row Count 2 (+ 2)
% Row 1
\SetRowColor{white}
Common activation functions are & ReLU, sigmoid/softmax (as final layer), tanh function (kinda like sigmoid), etc \tn 
% Row Count 6 (+ 4)
% Row 2
\SetRowColor{LightBackground}
Pass a set of linear func normally and activation function transforms it (known as hidden layer) & So that a specific weight is activated or not depending on that function \tn 
% Row Count 11 (+ 5)
% Row 3
\SetRowColor{white}
Called shallow since only one hidden layer & Universal approximation theorem states that enough hidden layers can approximate to any continuous function on a compact subset \tn 
% Row Count 18 (+ 7)
\hhline{>{\arrayrulecolor{DarkBackground}}--}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} }
\SetRowColor{DarkBackground}
\mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Maximum likelihood}}  \tn
% Row 0
\SetRowColor{LightBackground}
Points in a database can be from an underlying distribution & The main idea of using likelihood function is to estimate this distribution \tn 
% Row Count 4 (+ 4)
% Row 1
\SetRowColor{white}
Model predicts a conditional probability Pr(y|x)=Pr(y|θ)=Pr(y|f{[}x,ϕ{]}) & Here the loss function aims to have correct outputs have high probability \tn 
% Row Count 8 (+ 4)
% Row 2
\SetRowColor{LightBackground}
So find argmax for ϕ (or argmin if we negative the objective function) & Product can be very small value so log is taken to make it a summation \tn 
% Row Count 12 (+ 4)
% Row 3
\SetRowColor{white}
Softmax is used in the case of multiclass categorization & It converts a vector of K real numbers into a probability distribution of K possible outcomes \tn 
% Row Count 17 (+ 5)
\hhline{>{\arrayrulecolor{DarkBackground}}--}
\end{tabularx}
\par\addvspace{1.3em}

\begin{tabularx}{3.833cm}{x{1.64784 cm} x{1.78516 cm} }
\SetRowColor{DarkBackground}
\mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Unsupervised Learning}}  \tn
% Row 0
\SetRowColor{LightBackground}
Learning a dataset without any labels & So dataset is orgnaized in input only fashion \tn 
% Row Count 3 (+ 3)
% Row 1
\SetRowColor{white}
Examples are & Clustering, Outlier Finding, Generating examples, fill missing data \tn 
% Row Count 7 (+ 4)
% Row 2
\SetRowColor{LightBackground}
There are generative models & like generative adversal networks \tn 
% Row Count 9 (+ 2)
% Row 3
\SetRowColor{white}
Also probabilistic generative models & Who learn the dist over data. Examples are autoencoders, normalizing flows, and diffusion models \tn 
% Row Count 14 (+ 5)
\hhline{>{\arrayrulecolor{DarkBackground}}--}
\end{tabularx}
\par\addvspace{1.3em}


% That's all folks
\end{multicols*}

\end{document}