\documentclass[10pt,a4paper]{article} % Packages \usepackage{fancyhdr} % For header and footer \usepackage{multicol} % Allows multicols in tables \usepackage{tabularx} % Intelligent column widths \usepackage{tabulary} % Used in header and footer \usepackage{hhline} % Border under tables \usepackage{graphicx} % For images \usepackage{xcolor} % For hex colours %\usepackage[utf8x]{inputenc} % For unicode character support \usepackage[T1]{fontenc} % Without this we get weird character replacements \usepackage{colortbl} % For coloured tables \usepackage{setspace} % For line height \usepackage{lastpage} % Needed for total page number \usepackage{seqsplit} % Splits long words. %\usepackage{opensans} % Can't make this work so far. Shame. Would be lovely. \usepackage[normalem]{ulem} % For underlining links % Most of the following are not required for the majority % of cheat sheets but are needed for some symbol support. \usepackage{amsmath} % Symbols \usepackage{MnSymbol} % Symbols \usepackage{wasysym} % Symbols %\usepackage[english,german,french,spanish,italian]{babel} % Languages % Document Info \author{Netsuiw} \pdfinfo{ /Title (deep-learning-quiz-1-cheat-sheet.pdf) /Creator (Cheatography) /Author (Netsuiw) /Subject (Deep Learning Quiz 1 Cheat Sheet) } % Lengths and widths \addtolength{\textwidth}{6cm} \addtolength{\textheight}{-1cm} \addtolength{\hoffset}{-3cm} \addtolength{\voffset}{-2cm} \setlength{\tabcolsep}{0.2cm} % Space between columns \setlength{\headsep}{-12pt} % Reduce space between header and content \setlength{\headheight}{85pt} % If less, LaTeX automatically increases it \renewcommand{\footrulewidth}{0pt} % Remove footer line \renewcommand{\headrulewidth}{0pt} % Remove header line \renewcommand{\seqinsert}{\ifmmode\allowbreak\else\-\fi} % Hyphens in seqsplit % This two commands together give roughly % the right line height in the tables \renewcommand{\arraystretch}{1.3} \onehalfspacing % Commands \newcommand{\SetRowColor}[1]{\noalign{\gdef\RowColorName{#1}}\rowcolor{\RowColorName}} % Shortcut for row colour \newcommand{\mymulticolumn}[3]{\multicolumn{#1}{>{\columncolor{\RowColorName}}#2}{#3}} % For coloured multi-cols \newcolumntype{x}[1]{>{\raggedright}p{#1}} % New column types for ragged-right paragraph columns \newcommand{\tn}{\tabularnewline} % Required as custom column type in use % Font and Colours \definecolor{HeadBackground}{HTML}{333333} \definecolor{FootBackground}{HTML}{666666} \definecolor{TextColor}{HTML}{333333} \definecolor{DarkBackground}{HTML}{A3A3A3} \definecolor{LightBackground}{HTML}{F3F3F3} \renewcommand{\familydefault}{\sfdefault} \color{TextColor} % Header and Footer \pagestyle{fancy} \fancyhead{} % Set header to blank \fancyfoot{} % Set footer to blank \fancyhead[L]{ \noindent \begin{multicols}{3} \begin{tabulary}{5.8cm}{C} \SetRowColor{DarkBackground} \vspace{-7pt} {\parbox{\dimexpr\textwidth-2\fboxsep\relax}{\noindent \hspace*{-6pt}\includegraphics[width=5.8cm]{/web/www.cheatography.com/public/images/cheatography_logo.pdf}} } \end{tabulary} \columnbreak \begin{tabulary}{11cm}{L} \vspace{-2pt}\large{\bf{\textcolor{DarkBackground}{\textrm{Deep Learning Quiz 1 Cheat Sheet}}}} \\ \normalsize{by \textcolor{DarkBackground}{Netsuiw} via \textcolor{DarkBackground}{\uline{cheatography.com/200763/cs/42404/}}} \end{tabulary} \end{multicols}} \fancyfoot[L]{ \footnotesize \noindent \begin{multicols}{3} \begin{tabulary}{5.8cm}{LL} \SetRowColor{FootBackground} \mymulticolumn{2}{p{5.377cm}}{\bf\textcolor{white}{Cheatographer}} \\ \vspace{-2pt}Netsuiw \\ \uline{cheatography.com/netsuiw} \\ \end{tabulary} \vfill \columnbreak \begin{tabulary}{5.8cm}{L} \SetRowColor{FootBackground} \mymulticolumn{1}{p{5.377cm}}{\bf\textcolor{white}{Cheat Sheet}} \\ \vspace{-2pt}Not Yet Published.\\ Updated 15th February, 2024.\\ Page {\thepage} of \pageref{LastPage}. \end{tabulary} \vfill \columnbreak \begin{tabulary}{5.8cm}{L} \SetRowColor{FootBackground} \mymulticolumn{1}{p{5.377cm}}{\bf\textcolor{white}{Sponsor}} \\ \SetRowColor{white} \vspace{-5pt} %\includegraphics[width=48px,height=48px]{dave.jpeg} Measure your website readability!\\ www.readability-score.com \end{tabulary} \end{multicols}} \begin{document} \raggedright \raggedcolumns % Set font size to small. Switch to any value % from this page to resize cheat sheet text: % www.emerson.emory.edu/services/latex/latex_169.html \footnotesize % Small font. \begin{multicols*}{4} \begin{tabularx}{3.833cm}{x{1.47619 cm} x{1.95681 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Supervised Learning}} \tn % Row 0 \SetRowColor{LightBackground} Mapping from inputs to outputs & Need paired examples (x\_i,y\_i) to learn from \tn % Row Count 2 (+ 2) % Row 1 \SetRowColor{white} Examples are & Regression, Text Classification, Image Classification, etc. \tn % Row Count 5 (+ 3) % Row 2 \SetRowColor{LightBackground} Normally in the form of & Input -\textgreater{} Relate family of eqs to input -\textgreater{} Output prediction \tn % Row Count 8 (+ 3) \hhline{>{\arrayrulecolor{DarkBackground}}--} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Double descent \& COD}} \tn % Row 0 \SetRowColor{LightBackground} Double descent is the phenomenon when & the test error increases while the training error is nearing zero and then decreases sharply and back to normal \tn % Row Count 6 (+ 6) % Row 1 \SetRowColor{white} The tendency of high-dimensional space to overwhelm the number of data points is called the curse of dimensionality & Two randomly sampled data points from normal are at right angles to each other with high likelihood \tn % Row Count 12 (+ 6) % Row 2 \SetRowColor{LightBackground} But distance from the origin of random samples is roughly constant and most of the volume of a high dimensional orange is in the peel not in the pulp & Volume of a diameter one hypersphere becomes zero and generate random points uniformly in hypercube, ratio of nearest to farthest becomes close to one \tn % Row Count 20 (+ 8) \hhline{>{\arrayrulecolor{DarkBackground}}--} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{3.833cm}{x{1.40753 cm} x{2.02547 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Loss/Cost function and Train/Test}} \tn % Row 0 \SetRowColor{LightBackground} Measurement of & how bad a model performs \tn % Row Count 2 (+ 2) % Row 1 \SetRowColor{white} Trains on pair of data & Find argmin of this loss function \tn % Row Count 4 (+ 2) % Row 2 \SetRowColor{LightBackground} Test on seperate set of data & Measure the loss there and see its generalizing power \tn % Row Count 7 (+ 3) % Row 3 \SetRowColor{white} Different loss functions are & Squared Loss, log liklihood, ramp loss, etc \tn % Row Count 9 (+ 2) \hhline{>{\arrayrulecolor{DarkBackground}}--} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{3.833cm}{X} \SetRowColor{DarkBackground} \mymulticolumn{1}{x{3.833cm}}{\bf\textcolor{white}{Counting number of parameters}} \tn \SetRowColor{LightBackground} \mymulticolumn{1}{p{3.833cm}}{\vspace{1px}\centerline{\includegraphics[width=5.1cm]{/web/www.cheatography.com/public/uploads/netsuiw_1708027310_Screenshot from 2024-02-16 02-52-38.png}}} \tn \hhline{>{\arrayrulecolor{DarkBackground}}-} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Initialization}} \tn % Row 0 \SetRowColor{LightBackground} If on initalization the variance is small or big & Then it can have floating point errors \tn % Row Count 3 (+ 3) % Row 1 \SetRowColor{white} So, we want to set variance same in forward and backward pass & He Initalization does this by setting variance 2/D\_h where variance of k or k+1 is same at layer k+1 or k \tn % Row Count 9 (+ 6) \hhline{>{\arrayrulecolor{DarkBackground}}--} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{3.833cm}{X} \SetRowColor{DarkBackground} \mymulticolumn{1}{x{3.833cm}}{\bf\textcolor{white}{Counting number of parameters}} \tn \SetRowColor{LightBackground} \mymulticolumn{1}{p{3.833cm}}{\vspace{1px}\centerline{\includegraphics[width=5.1cm]{/web/www.cheatography.com/public/uploads/netsuiw_1708029820_Screenshot from 2024-02-16 03-42-48.png}}} \tn \hhline{>{\arrayrulecolor{DarkBackground}}-} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Regularization techniques}} \tn % Row 0 \SetRowColor{LightBackground} Explicit regularization is adding of a regularizing term to the loss function & This is also known as the prior in the probabilistic view. Normally L2 regularization is used where the square weights are added and controlled by a regularization term \tn % Row Count 9 (+ 9) % Row 1 \SetRowColor{white} Implicit regualrization is the natural tendencies of optimization algorithms and other aspects of the training process & That even without explicitly adding regularization techniques, help improve the generalization performance of a model eg SGD due to batch sizes (cause of randomness) \tn % Row Count 18 (+ 9) % Row 2 \SetRowColor{LightBackground} Early stopping is the process of stopping training early to not overfit the weights since they start small & Ensembling is the collague of different models and is averaged then (by mean or median). Different subset of data resampled is bagging \tn % Row Count 25 (+ 7) % Row 3 \SetRowColor{white} Dropout is the technique of killing random units. Can eliminate kinks in function that are far from data and don't contribute to training loss & Adding noise can also improve generalization \tn % Row Count 33 (+ 8) \end{tabularx} \par\addvspace{1.3em} \vfill \columnbreak \begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Regularization techniques (cont)}} \tn % Row 4 \SetRowColor{LightBackground} Can also use baysian inference to provide more information (to priors) & Transfer learning, multi-task learning, self-supervised learning, and data augmentation can be used too to improve generalization \tn % Row Count 7 (+ 7) \hhline{>{\arrayrulecolor{DarkBackground}}--} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Bias Variance Tradeoff}} \tn % Row 0 \SetRowColor{LightBackground} Variance is the uncertainty in fitted model due to choice of training set & Bias is systematic deviation from the mean of the function we are modeling due to limitations in our model \tn % Row Count 6 (+ 6) % Row 1 \SetRowColor{white} Noise is inherent uncertainty in the true mapping from input to output & Can reduce variance by adding more datapoints \tn % Row Count 10 (+ 4) % Row 2 \SetRowColor{LightBackground} Can reduce bias by making model more complex & But doing one or the other increases since more complex model = overfitting = more variance \tn % Row Count 15 (+ 5) \hhline{>{\arrayrulecolor{DarkBackground}}--} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Momentum \& Adam}} \tn % Row 0 \SetRowColor{LightBackground} Momentum is the weighted sum of the current gradient and previous gradients & We can think of momentum as a prediction on where we are stepping \tn % Row Count 4 (+ 4) % Row 1 \SetRowColor{white} Normalizing the gradients can lead to being stuck if we don't land on the optimal point excatly & Adam prevents that by computing mean and pointwise squared gradients with momentum and moderating near start of the sequence \tn % Row Count 11 (+ 7) \hhline{>{\arrayrulecolor{DarkBackground}}--} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Schocastic Gradient Descent}} \tn % Row 0 \SetRowColor{LightBackground} Gradient descent might be slow & And not all gradients needed to find optimal point \tn % Row Count 3 (+ 3) % Row 1 \SetRowColor{white} Compute gradient based on only a subset of points – a mini-batch & Work through dataset sampling without replacement \tn % Row Count 7 (+ 4) % Row 2 \SetRowColor{LightBackground} One pass though the data is called an epoch & This can escape from local minima, but adds noise. Uses all data equally but \tn % Row Count 11 (+ 4) \hhline{>{\arrayrulecolor{DarkBackground}}--} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Backpropogation}} \tn % Row 0 \SetRowColor{LightBackground} Two passes are done, forward pass, and backwards pass & Forward pass deals with knowing the activations at each layer and how it affects the loss and calculating inbetween values \tn % Row Count 7 (+ 7) % Row 1 \SetRowColor{white} We do not know gradients though so the loss cannot be modified (since units have a dependancy chain at update) & Backward pass calculates the gradients then of the loss function but in reverse \tn % Row Count 13 (+ 6) % Row 2 \SetRowColor{LightBackground} This is very efficient but is memory hungry & Also the problem is trying to split the computation proces apart (i.e. maybe parts exist in different computers) \tn % Row Count 19 (+ 6) \hhline{>{\arrayrulecolor{DarkBackground}}--} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Gradient Descent}} \tn % Row 0 \SetRowColor{LightBackground} Gradient descent finds optimal point (for convex function) & by step walking towards it, i.e., goes against the gradient calculated \tn % Row Count 4 (+ 4) % Row 1 \SetRowColor{white} So derivative of loss function wrt to parameters is calculated & And then params are updated by subtracting. A learning rate is applied to speed/slow it down \tn % Row Count 9 (+ 5) \hhline{>{\arrayrulecolor{DarkBackground}}--} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Deep neural networks}} \tn % Row 0 \SetRowColor{LightBackground} Simply neural networks with more than one hidden layer & Better than simply transposing the output of one shallow network to another (less params and regions) \tn % Row Count 6 (+ 6) % Row 1 \SetRowColor{white} Basically outputs from hidden units & Go into another hidden layer as inputs \tn % Row Count 8 (+ 2) % Row 2 \SetRowColor{LightBackground} Also obeys the universal approximation theorem & Difference from shallow network is more regions per parameters \tn % Row Count 12 (+ 4) % Row 3 \SetRowColor{white} The hyperparameters are K for width of network and D\_i for number of units of the network at layer i & There exists problems where shallow networks would need way too many units to approximate \tn % Row Count 17 (+ 5) \hhline{>{\arrayrulecolor{DarkBackground}}--} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Convolutional networks}} \tn % Row 0 \SetRowColor{LightBackground} Parameters only look at local image patches and so share parameters across image & The convolutional operation averages together the inputs \tn % Row Count 4 (+ 4) % Row 1 \SetRowColor{white} Stride = shift by k positions for each output, Kernel size = weight a different number of inputs for each output, Dilated or atrous convolutions = intersperse kernel values with zeros & Stride decreases output size, Kernel size combines info from larger area, while the last one uses few params while combine info \tn % Row Count 14 (+ 10) % Row 2 \SetRowColor{LightBackground} But we want to lose information: done by apply several convolutions and stack them in channels (feature maps) & Receptive fields is the the region in the input space that a particular CNN's feature is affected by \tn % Row Count 20 (+ 6) % Row 3 \SetRowColor{white} Benifit of CNN is better inductive bias, forcing the network to process each location similarly, share info, search from small family of input/output maps, etc & Downsampling is the reducing of positions in data (max pooling most common ie take max), while upsampling is the increase \tn % Row Count 28 (+ 8) \hhline{>{\arrayrulecolor{DarkBackground}}--} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Reinforcement Learning}} \tn % Row 0 \SetRowColor{LightBackground} Create a set of states, actions, and rewards & Goal is to maximize reward by finding correct states \tn % Row Count 3 (+ 3) % Row 1 \SetRowColor{white} No data involved & Is recieved by the world build and explored \tn % Row Count 6 (+ 3) % Row 2 \SetRowColor{LightBackground} Examples are & Chess, Video games, etc \tn % Row Count 8 (+ 2) % Row 3 \SetRowColor{white} Flaws are that it is & Schocastic, temporial credit assignment, i.e., reward achieved by move or past moves, and \seqsplit{Exploration-exploitation} trade-off, i.e. when to explore and when to not \tn % Row Count 17 (+ 9) \hhline{>{\arrayrulecolor{DarkBackground}}--} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Shallow neural network}} \tn % Row 0 \SetRowColor{LightBackground} Use non convex (activation function) & to mold family of functions into dataset \tn % Row Count 2 (+ 2) % Row 1 \SetRowColor{white} Common activation functions are & ReLU, sigmoid/softmax (as final layer), tanh function (kinda like sigmoid), etc \tn % Row Count 6 (+ 4) % Row 2 \SetRowColor{LightBackground} Pass a set of linear func normally and activation function transforms it (known as hidden layer) & So that a specific weight is activated or not depending on that function \tn % Row Count 11 (+ 5) % Row 3 \SetRowColor{white} Called shallow since only one hidden layer & Universal approximation theorem states that enough hidden layers can approximate to any continuous function on a compact subset \tn % Row Count 18 (+ 7) \hhline{>{\arrayrulecolor{DarkBackground}}--} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{3.833cm}{x{1.7165 cm} x{1.7165 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Maximum likelihood}} \tn % Row 0 \SetRowColor{LightBackground} Points in a database can be from an underlying distribution & The main idea of using likelihood function is to estimate this distribution \tn % Row Count 4 (+ 4) % Row 1 \SetRowColor{white} Model predicts a conditional probability Pr(y|x)=Pr(y|θ)=Pr(y|f{[}x,ϕ{]}) & Here the loss function aims to have correct outputs have high probability \tn % Row Count 8 (+ 4) % Row 2 \SetRowColor{LightBackground} So find argmax for ϕ (or argmin if we negative the objective function) & Product can be very small value so log is taken to make it a summation \tn % Row Count 12 (+ 4) % Row 3 \SetRowColor{white} Softmax is used in the case of multiclass categorization & It converts a vector of K real numbers into a probability distribution of K possible outcomes \tn % Row Count 17 (+ 5) \hhline{>{\arrayrulecolor{DarkBackground}}--} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{3.833cm}{x{1.64784 cm} x{1.78516 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{3.833cm}}{\bf\textcolor{white}{Unsupervised Learning}} \tn % Row 0 \SetRowColor{LightBackground} Learning a dataset without any labels & So dataset is orgnaized in input only fashion \tn % Row Count 3 (+ 3) % Row 1 \SetRowColor{white} Examples are & Clustering, Outlier Finding, Generating examples, fill missing data \tn % Row Count 7 (+ 4) % Row 2 \SetRowColor{LightBackground} There are generative models & like generative adversal networks \tn % Row Count 9 (+ 2) % Row 3 \SetRowColor{white} Also probabilistic generative models & Who learn the dist over data. Examples are autoencoders, normalizing flows, and diffusion models \tn % Row Count 14 (+ 5) \hhline{>{\arrayrulecolor{DarkBackground}}--} \end{tabularx} \par\addvspace{1.3em} % That's all folks \end{multicols*} \end{document}