Supervised Learning
Mapping from inputs to outputs 
Need paired examples (x_i,y_i) to learn from 
Examples are 
Regression, Text Classification, Image Classification, etc. 
Normally in the form of 
Input > Relate family of eqs to input > Output prediction 
Double descent & COD
Double descent is the phenomenon when 
the test error increases while the training error is nearing zero and then decreases sharply and back to normal 
The tendency of highdimensional space to overwhelm the number of data points is called the curse of dimensionality 
Two randomly sampled data points from normal are at right angles to each other with high likelihood 
But distance from the origin of random samples is roughly constant and most of the volume of a high dimensional orange is in the peel not in the pulp 
Volume of a diameter one hypersphere becomes zero and generate random points uniformly in hypercube, ratio of nearest to farthest becomes close to one 
Loss/Cost function and Train/Test
Measurement of 
how bad a model performs 
Trains on pair of data 
Find argmin of this loss function 
Test on seperate set of data 
Measure the loss there and see its generalizing power 
Different loss functions are 
Squared Loss, log liklihood, ramp loss, etc 
Counting number of parameters
Initialization
If on initalization the variance is small or big 
Then it can have floating point errors 
So, we want to set variance same in forward and backward pass 
He Initalization does this by setting variance 2/D_h where variance of k or k+1 is same at layer k+1 or k 
Counting number of parameters


Regularization techniques
Explicit regularization is adding of a regularizing term to the loss function 
This is also known as the prior in the probabilistic view. Normally L2 regularization is used where the square weights are added and controlled by a regularization term 
Implicit regualrization is the natural tendencies of optimization algorithms and other aspects of the training process 
That even without explicitly adding regularization techniques, help improve the generalization performance of a model eg SGD due to batch sizes (cause of randomness) 
Early stopping is the process of stopping training early to not overfit the weights since they start small 
Ensembling is the collague of different models and is averaged then (by mean or median). Different subset of data resampled is bagging 
Dropout is the technique of killing random units. Can eliminate kinks in function that are far from data and don’t contribute to training loss 
Adding noise can also improve generalization 
Can also use baysian inference to provide more information (to priors) 
Transfer learning, multitask learning, selfsupervised learning, and data augmentation can be used too to improve generalization 
Bias Variance Tradeoff
Variance is the uncertainty in fitted model due to choice of training set 
Bias is systematic deviation from the mean of the function we are modeling due to limitations in our model 
Noise is inherent uncertainty in the true mapping from input to output 
Can reduce variance by adding more datapoints 
Can reduce bias by making model more complex 
But doing one or the other increases since more complex model = overfitting = more variance 
Momentum & Adam
Momentum is the weighted sum of the current gradient and previous gradients 
We can think of momentum as a prediction on where we are stepping 
Normalizing the gradients can lead to being stuck if we don't land on the optimal point excatly 
Adam prevents that by computing mean and pointwise squared gradients with momentum and moderating near start of the sequence 


Schocastic Gradient Descent
Gradient descent might be slow 
And not all gradients needed to find optimal point 
Compute gradient based on only a subset of points – a minibatch 
Work through dataset sampling without replacement 
One pass though the data is called an epoch 
This can escape from local minima, but adds noise. Uses all data equally but 
Backpropogation
Two passes are done, forward pass, and backwards pass 
Forward pass deals with knowing the activations at each layer and how it affects the loss and calculating inbetween values 
We do not know gradients though so the loss cannot be modified (since units have a dependancy chain at update) 
Backward pass calculates the gradients then of the loss function but in reverse 
This is very efficient but is memory hungry 
Also the problem is trying to split the computation proces apart (i.e. maybe parts exist in different computers) 
Gradient Descent
Gradient descent finds optimal point (for convex function) 
by step walking towards it, i.e., goes against the gradient calculated 
So derivative of loss function wrt to parameters is calculated 
And then params are updated by subtracting. A learning rate is applied to speed/slow it down 
Deep neural networks
Simply neural networks with more than one hidden layer 
Better than simply transposing the output of one shallow network to another (less params and regions) 
Basically outputs from hidden units 
Go into another hidden layer as inputs 
Also obeys the universal approximation theorem 
Difference from shallow network is more regions per parameters 
The hyperparameters are K for width of network and D_i for number of units of the network at layer i 
There exists problems where shallow networks would need way too many units to approximate 
Convolutional networks
Parameters only look at local image patches and so share parameters across image 
The convolutional operation averages together the inputs 
Stride = shift by k positions for each output, Kernel size = weight a different number of inputs for each output, Dilated or atrous convolutions = intersperse kernel values with zeros 
Stride decreases output size, Kernel size combines info from larger area, while the last one uses few params while combine info 
But we want to lose information: done by apply several convolutions and stack them in channels (feature maps) 
Receptive fields is the the region in the input space that a particular CNN's feature is affected by 
Benifit of CNN is better inductive bias, forcing the network to process each location similarly, share info, search from small family of input/output maps, etc 
Downsampling is the reducing of positions in data (max pooling most common ie take max), while upsampling is the increase 


Reinforcement Learning
Create a set of states, actions, and rewards 
Goal is to maximize reward by finding correct states 
No data involved 
Is recieved by the world build and explored 
Examples are 
Chess, Video games, etc 
Flaws are that it is 
Schocastic, temporial credit assignment, i.e., reward achieved by move or past moves, and Explorationexploitation tradeoff, i.e. when to explore and when to not 
Shallow neural network
Use non convex (activation function) 
to mold family of functions into dataset 
Common activation functions are 
ReLU, sigmoid/softmax (as final layer), tanh function (kinda like sigmoid), etc 
Pass a set of linear func normally and activation function transforms it (known as hidden layer) 
So that a specific weight is activated or not depending on that function 
Called shallow since only one hidden layer 
Universal approximation theorem states that enough hidden layers can approximate to any continuous function on a compact subset 
Maximum likelihood
Points in a database can be from an underlying distribution 
The main idea of using likelihood function is to estimate this distribution 
Model predicts a conditional probability Pr(yx)=Pr(yθ)=Pr(yf[x,ϕ]) 
Here the loss function aims to have correct outputs have high probability 
So find argmax for ϕ (or argmin if we negative the objective function) 
Product can be very small value so log is taken to make it a summation 
Softmax is used in the case of multiclass categorization 
It converts a vector of K real numbers into a probability distribution of K possible outcomes 
Unsupervised Learning
Learning a dataset without any labels 
So dataset is orgnaized in input only fashion 
Examples are 
Clustering, Outlier Finding, Generating examples, fill missing data 
There are generative models 
like generative adversal networks 
Also probabilistic generative models 
Who learn the dist over data. Examples are autoencoders, normalizing flows, and diffusion models 
