Supervised Learning
        
                        
                                                                                    
                                                                                            Mapping from inputs to outputs  | 
                                                                                                                        Need paired examples (x_i,y_i) to learn from  | 
                                                                                 
                                                                                            
                                                                                            Examples are  | 
                                                                                                                        Regression, Text Classification, Image Classification, etc.  | 
                                                                                 
                                                                                            
                                                                                            Normally in the form of  | 
                                                                                                                        Input -> Relate family of eqs to input -> Output prediction  | 
                                                                                 
                                                                         
                             
    
    
            Double descent & COD
        
                        
                                                                                    
                                                                                            Double descent is the phenomenon when  | 
                                                                                                                        the test error increases while the training error is nearing zero and then decreases sharply and back to normal  | 
                                                                                 
                                                                                            
                                                                                            The tendency of high-dimensional space to overwhelm the number of data points is called the curse of dimensionality  | 
                                                                                                                        Two randomly sampled data points from normal are at right angles to each other with high likelihood  | 
                                                                                 
                                                                                            
                                                                                            But distance from the origin of random samples is roughly constant and  most of the volume of a high dimensional orange is in the peel not in the pulp  | 
                                                                                                                        Volume of a diameter one hypersphere becomes zero and generate random points uniformly in hypercube, ratio of nearest to farthest becomes close to one  | 
                                                                                 
                                                                         
                             
    
    
            Loss/Cost function and Train/Test
        
                        
                                                                                    
                                                                                            Measurement of  | 
                                                                                                                        how bad a model performs  | 
                                                                                 
                                                                                            
                                                                                            Trains on pair of data  | 
                                                                                                                        Find argmin of this loss function  | 
                                                                                 
                                                                                            
                                                                                            Test on seperate set of data  | 
                                                                                                                        Measure the loss there and see its generalizing power  | 
                                                                                 
                                                                                            
                                                                                            Different loss functions are  | 
                                                                                                                        Squared Loss, log liklihood, ramp loss, etc  | 
                                                                                 
                                                                         
                             
    
    
            Counting number of parameters
        
    
    
            Initialization
        
                        
                                                                                    
                                                                                            If on initalization the variance is small or big  | 
                                                                                                                        Then it can have floating point errors  | 
                                                                                 
                                                                                            
                                                                                            So, we want to set variance same in forward and backward pass  | 
                                                                                                                        He Initalization does this by setting variance 2/D_h where variance of k or k+1 is same at layer k+1 or k  | 
                                                                                 
                                                                         
                             
    
    
            Counting number of parameters
        
                             | 
                                                                              | 
                                                        
                                
    
    
            Regularization techniques
        
                        
                                                                                    
                                                                                            Explicit regularization is adding of a regularizing term to the loss function  | 
                                                                                                                        This is also known as the prior in the probabilistic view. Normally L2 regularization is used where the square weights are added and controlled by a regularization term  | 
                                                                                 
                                                                                            
                                                                                            Implicit regualrization is the natural tendencies of optimization algorithms and other aspects of the training process  | 
                                                                                                                        That even without explicitly adding regularization techniques, help improve the generalization performance of a model eg SGD due to batch sizes (cause of randomness)  | 
                                                                                 
                                                                                            
                                                                                            Early stopping is the process of stopping training early to not overfit the weights since they start small  | 
                                                                                                                        Ensembling is the collague of different models and is averaged then (by mean or median). Different subset of data resampled is bagging  | 
                                                                                 
                                                                                            
                                                                                            Dropout is the technique of killing random units. Can eliminate kinks in function that are far from data and don’t contribute to training loss  | 
                                                                                                                        Adding noise can also improve generalization  | 
                                                                                 
                                                                                            
                                                                                            Can also use baysian inference to provide more information (to priors)  | 
                                                                                                                        Transfer learning, multi-task learning, self-supervised learning, and data augmentation can be used too to improve generalization  | 
                                                                                 
                                                                         
                             
    
    
            Bias Variance Tradeoff
        
                        
                                                                                    
                                                                                            Variance is the uncertainty in fitted model due to choice of training set  | 
                                                                                                                        Bias is systematic deviation from the mean of the function we are modeling due to limitations in our model  | 
                                                                                 
                                                                                            
                                                                                            Noise is inherent uncertainty in the true mapping from input to output  | 
                                                                                                                        Can reduce variance by adding more datapoints  | 
                                                                                 
                                                                                            
                                                                                            Can reduce bias by making model more complex  | 
                                                                                                                        But doing one or the other increases since more complex model = overfitting = more variance  | 
                                                                                 
                                                                         
                             
    
    
            Momentum & Adam
        
                        
                                                                                    
                                                                                            Momentum is the weighted sum of the current gradient and previous gradients  | 
                                                                                                                        We can think of momentum as a prediction on where we are stepping  | 
                                                                                 
                                                                                            
                                                                                            Normalizing the gradients can lead to being stuck if we don't land on the optimal point excatly  | 
                                                                                                                        Adam prevents that by computing mean and pointwise squared gradients with momentum and moderating near start of the sequence  | 
                                                                                 
                                                                         
                             
                             | 
                                                                              | 
                                                        
                                
    
    
            Schocastic Gradient Descent
        
                        
                                                                                    
                                                                                            Gradient descent might be slow  | 
                                                                                                                        And not all gradients needed to find optimal point  | 
                                                                                 
                                                                                            
                                                                                            Compute gradient based on only a subset of points – a mini-batch  | 
                                                                                                                        Work through dataset sampling without replacement  | 
                                                                                 
                                                                                            
                                                                                            One pass though the data is called an epoch  | 
                                                                                                                        This can escape from local minima, but adds noise. Uses all data equally but  | 
                                                                                 
                                                                         
                             
    
    
            Backpropogation
        
                        
                                                                                    
                                                                                            Two passes are done, forward pass, and backwards pass  | 
                                                                                                                        Forward pass deals with knowing the activations at each layer and how it affects the loss and calculating inbetween values  | 
                                                                                 
                                                                                            
                                                                                            We do not know gradients though so the loss cannot be modified (since units have a dependancy chain at update)  | 
                                                                                                                        Backward pass calculates the gradients then of the loss function but in reverse  | 
                                                                                 
                                                                                            
                                                                                            This is very efficient but is memory hungry  | 
                                                                                                                        Also the problem is trying to split the computation proces apart (i.e. maybe parts exist in different computers)  | 
                                                                                 
                                                                         
                             
    
    
            Gradient Descent
        
                        
                                                                                    
                                                                                            Gradient descent finds optimal point (for convex function)  | 
                                                                                                                        by step walking towards it, i.e., goes against the gradient calculated  | 
                                                                                 
                                                                                            
                                                                                            So derivative of loss function wrt to parameters is calculated  | 
                                                                                                                        And then params are updated by subtracting. A learning rate is applied to speed/slow it down  | 
                                                                                 
                                                                         
                             
    
    
            Deep neural networks
        
                        
                                                                                    
                                                                                            Simply neural networks with more than one hidden layer  | 
                                                                                                                        Better than simply transposing the output of one shallow network to another (less params and regions)  | 
                                                                                 
                                                                                            
                                                                                            Basically outputs from hidden units  | 
                                                                                                                        Go into another hidden layer as inputs  | 
                                                                                 
                                                                                            
                                                                                            Also obeys the universal approximation theorem  | 
                                                                                                                        Difference from shallow network is more regions per parameters  | 
                                                                                 
                                                                                            
                                                                                            The hyperparameters are K for width of network and D_i for number of units of the network at layer i  | 
                                                                                                                        There exists problems where shallow networks would need way too many units to approximate  | 
                                                                                 
                                                                         
                             
    
    
            Convolutional networks
        
                        
                                                                                    
                                                                                            Parameters only look at local image patches and so share parameters across image  | 
                                                                                                                        The convolutional operation averages together the inputs  | 
                                                                                 
                                                                                            
                                                                                            Stride = shift by k positions for each output, Kernel size = weight a different number of inputs for each output, Dilated or atrous convolutions = intersperse kernel values with zeros  | 
                                                                                                                        Stride decreases output size, Kernel size combines info from larger area, while the last one uses few params while combine info  | 
                                                                                 
                                                                                            
                                                                                            But we want to lose information: done by  apply several convolutions and stack them in channels (feature maps)  | 
                                                                                                                        Receptive fields is the the region in the input space that a particular CNN's feature is affected by  | 
                                                                                 
                                                                                            
                                                                                            Benifit of CNN is better inductive bias, forcing the network to process each location similarly, share info, search from small family of input/output maps, etc  | 
                                                                                                                        Downsampling is the reducing of positions in data (max pooling most common ie take max), while upsampling is the increase  | 
                                                                                 
                                                                         
                             
                             | 
                                                                              | 
                                                        
                                
    
    
            Reinforcement Learning
        
                        
                                                                                    
                                                                                            Create a set of states, actions, and rewards  | 
                                                                                                                        Goal is to maximize reward by finding correct states  | 
                                                                                 
                                                                                            
                                                                                            No data involved  | 
                                                                                                                        Is recieved by the world build and explored  | 
                                                                                 
                                                                                            
                                                                                            Examples are  | 
                                                                                                                        Chess, Video games, etc  | 
                                                                                 
                                                                                            
                                                                                            Flaws are that it is  | 
                                                                                                                        Schocastic, temporial credit assignment, i.e., reward achieved by move or past moves, and Exploration-exploitation trade-off, i.e. when to explore and when to not  | 
                                                                                 
                                                                         
                             
    
    
            Shallow neural network
        
                        
                                                                                    
                                                                                            Use non convex (activation function)  | 
                                                                                                                        to mold family of functions into dataset  | 
                                                                                 
                                                                                            
                                                                                            Common activation functions are  | 
                                                                                                                        ReLU, sigmoid/softmax (as final layer), tanh function (kinda like sigmoid), etc  | 
                                                                                 
                                                                                            
                                                                                            Pass a set of linear func normally and activation function transforms it (known as hidden layer)  | 
                                                                                                                        So that a specific weight is activated or not depending on that function  | 
                                                                                 
                                                                                            
                                                                                            Called shallow since only one hidden layer  | 
                                                                                                                        Universal approximation theorem states that enough hidden layers can approximate to any continuous function on a compact subset  | 
                                                                                 
                                                                         
                             
    
    
            Maximum likelihood
        
                        
                                                                                    
                                                                                            Points in a database can be from an underlying distribution  | 
                                                                                                                        The main idea of using likelihood function is to estimate this distribution  | 
                                                                                 
                                                                                            
                                                                                            Model predicts a conditional probability Pr(y|x)=Pr(y|θ)=Pr(y|f[x,ϕ])  | 
                                                                                                                        Here the loss function aims to have correct outputs have high probability  | 
                                                                                 
                                                                                            
                                                                                            So find argmax for ϕ (or argmin if we negative the objective function)  | 
                                                                                                                        Product can be very small value so log is taken to make it a summation  | 
                                                                                 
                                                                                            
                                                                                            Softmax is used in the case of multiclass categorization  | 
                                                                                                                        It converts a vector of K real numbers into a probability distribution of K possible outcomes  | 
                                                                                 
                                                                         
                             
    
    
            Unsupervised Learning
        
                        
                                                                                    
                                                                                            Learning a dataset without any labels  | 
                                                                                                                        So dataset is orgnaized in input only fashion  | 
                                                                                 
                                                                                            
                                                                                            Examples are  | 
                                                                                                                        Clustering, Outlier Finding, Generating examples, fill missing data  | 
                                                                                 
                                                                                            
                                                                                            There are generative models  | 
                                                                                                                        like generative adversal networks  | 
                                                                                 
                                                                                            
                                                                                            Also probabilistic generative models  | 
                                                                                                                        Who learn the dist over data. Examples are autoencoders, normalizing flows, and diffusion models  | 
                                                                                 
                                                                         
                             
                             |