Show Menu

Deep Learning Quiz 1 Cheat Sheet Cheat Sheet (DRAFT) by

A cheat sheet for DL quiz tomorrow

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Supervised Learning

Mapping from inputs to outputs
Need paired examples (x_i,y_i) to learn from
Examples are
Regres­sion, Text Classi­fic­ation, Image Classi­fic­ation, etc.
Normally in the form of
Input -> Relate family of eqs to input -> Output prediction

Double descent & COD

Double descent is the phenomenon when
the test error increases while the training error is nearing zero and then decreases sharply and back to normal
The tendency of high-d­ime­nsional space to overwhelm the number of data points is called the curse of dimens­ion­ality
Two randomly sampled data points from normal are at right angles to each other with high likelihood
But distance from the origin of random samples is roughly constant and most of the volume of a high dimens­ional orange is in the peel not in the pulp
Volume of a diameter one hypers­phere becomes zero and generate random points uniformly in hypercube, ratio of nearest to farthest becomes close to one

Loss/Cost function and Train/Test

Measur­ement of
how bad a model performs
Trains on pair of data
Find argmin of this loss function
Test on seperate set of data
Measure the loss there and see its genera­lizing power
Different loss functions are
Squared Loss, log liklihood, ramp loss, etc

Counting number of parameters


If on inital­ization the variance is small or big
Then it can have floating point errors
So, we want to set variance same in forward and backward pass
He Inital­ization does this by setting variance 2/D_h where variance of k or k+1 is same at layer k+1 or k

Counting number of parameters


Regula­riz­ation techniques

Explicit regula­riz­ation is adding of a regula­rizing term to the loss function
This is also known as the prior in the probab­ilistic view. Normally L2 regula­riz­ation is used where the square weights are added and controlled by a regula­riz­ation term
Implicit regual­riz­ation is the natural tendencies of optimi­zation algorithms and other aspects of the training process
That even without explicitly adding regula­riz­ation techni­ques, help improve the genera­liz­ation perfor­mance of a model eg SGD due to batch sizes (cause of random­ness)
Early stopping is the process of stopping training early to not overfit the weights since they start small
Ensembling is the collague of different models and is averaged then (by mean or median). Different subset of data resampled is bagging
Dropout is the technique of killing random units. Can eliminate kinks in function that are far from data and don’t contribute to training loss
Adding noise can also improve genera­liz­ation
Can also use baysian inference to provide more inform­ation (to priors)
Transfer learning, multi-task learning, self-s­upe­rvised learning, and data augmen­tation can be used too to improve genera­liz­ation

Bias Variance Tradeoff

Variance is the uncert­ainty in fitted model due to choice of training set
Bias is systematic deviation from the mean of the function we are modeling due to limita­tions in our model
Noise is inherent uncert­ainty in the true mapping from input to output
Can reduce variance by adding more datapoints
Can reduce bias by making model more complex
But doing one or the other increases since more complex model = overfi­tting = more variance

Momentum & Adam

Momentum is the weighted sum of the current gradient and previous gradients
We can think of momentum as a prediction on where we are stepping
Normal­izing the gradients can lead to being stuck if we don't land on the optimal point excatly
Adam prevents that by computing mean and pointwise squared gradients with momentum and moderating near start of the sequence

Schocastic Gradient Descent

Gradient descent might be slow
And not all gradients needed to find optimal point
Compute gradient based on only a subset of points – a mini-batch
Work through dataset sampling without replac­ement
One pass though the data is called an epoch
This can escape from local minima, but adds noise. Uses all data equally but


Two passes are done, forward pass, and backwards pass
Forward pass deals with knowing the activa­tions at each layer and how it affects the loss and calcul­ating inbetween values
We do not know gradients though so the loss cannot be modified (since units have a dependancy chain at update)
Backward pass calculates the gradients then of the loss function but in reverse
This is very efficient but is memory hungry
Also the problem is trying to split the comput­ation proces apart (i.e. maybe parts exist in different computers)

Gradient Descent

Gradient descent finds optimal point (for convex function)
by step walking towards it, i.e., goes against the gradient calculated
So derivative of loss function wrt to parameters is calculated
And then params are updated by subtra­cting. A learning rate is applied to speed/slow it down

Deep neural networks

Simply neural networks with more than one hidden layer
Better than simply transp­osing the output of one shallow network to another (less params and regions)
Basically outputs from hidden units
Go into another hidden layer as inputs
Also obeys the universal approx­imation theorem
Difference from shallow network is more regions per parameters
The hyperp­ara­meters are K for width of network and D_i for number of units of the network at layer i
There exists problems where shallow networks would need way too many units to approx­imate

Convol­utional networks

Parameters only look at local image patches and so share parameters across image
The convol­utional operation averages together the inputs
Stride = shift by k positions for each output, Kernel size = weight a different number of inputs for each output, Dilated or atrous convol­utions = inters­perse kernel values with zeros
Stride decreases output size, Kernel size combines info from larger area, while the last one uses few params while combine info
But we want to lose inform­ation: done by apply several convol­utions and stack them in channels (feature maps)
Receptive fields is the the region in the input space that a particular CNN's feature is affected by
Benifit of CNN is better inductive bias, forcing the network to process each location similarly, share info, search from small family of input/­output maps, etc
Downsa­mpling is the reducing of positions in data (max pooling most common ie take max), while upsampling is the increase

Reinfo­rcement Learning

Create a set of states, actions, and rewards
Goal is to maximize reward by finding correct states
No data involved
Is recieved by the world build and explored
Examples are
Chess, Video games, etc
Flaws are that it is
Schoca­stic, temporial credit assign­ment, i.e., reward achieved by move or past moves, and Explor­ati­on-­exp­loi­tation trade-off, i.e. when to explore and when to not

Shallow neural network

Use non convex (activ­ation function)
to mold family of functions into dataset
Common activation functions are
ReLU, sigmoi­d/s­oftmax (as final layer), tanh function (kinda like sigmoid), etc
Pass a set of linear func normally and activation function transforms it (known as hidden layer)
So that a specific weight is activated or not depending on that function
Called shallow since only one hidden layer
Universal approx­imation theorem states that enough hidden layers can approx­imate to any continuous function on a compact subset

Maximum likelihood

Points in a database can be from an underlying distri­bution
The main idea of using likelihood function is to estimate this distri­bution
Model predicts a condit­ional probab­ility Pr(y|x­)=P­r(y­|θ)­=Pr­(y|­f[x,ϕ])
Here the loss function aims to have correct outputs have high probab­ility
So find argmax for ϕ (or argmin if we negative the objective function)
Product can be very small value so log is taken to make it a summation
Softmax is used in the case of multiclass catego­riz­ation
It converts a vector of K real numbers into a probab­ility distri­bution of K possible outcomes

Unsupe­rvised Learning

Learning a dataset without any labels
So dataset is orgnaized in input only fashion
Examples are
Cluste­ring, Outlier Finding, Generating examples, fill missing data
There are generative models
like generative adversal networks
Also probab­ilistic generative models
Who learn the dist over data. Examples are autoen­coders, normal­izing flows, and diffusion models