Show Menu

Supervised Learning in R: Regression Cheat Sheet by

Supervised Learning in R: Regression

What is a Regres­sion?

From a machine learning perspe­ctive, the term regression generally encomp­asses the prediction of continuous values. Statis­tic­ally, these predic­tions are the expected value, or the average value one would observe for the given input values.

In R, use the lm() function, shown above, to code an OLS model. You can simply print the model to observe the coeffi­cients' sign and magnitude.

After estimating the coeffi­cients, you can always make predic­tions on your training data to assess the actual values v. predicted values. Use ggplot2() and a reference line of slope 1 to see if the points are close to the line or not.

You can also plot a gain curve to measure how well your model sorts the outcome. Useful when sorting instances is more important than predicting the exact outcome.

An example of the Gain Curve.

The diagonal line represents the gain curve if the outcome was sorted randomly. A perfect model would trace out the green curve. And our model chases out the blue curve. Per the above graph, the model correctly identified the top 30% highest home values and sorted them by price correctly.

We can also see that the top 30% of highest priced homes are 50% of total home sales ($).

Basic Regression Functions to get started

lm(formula = dependent variable ~ indepe­ndent variable1 + indepe­ndent variable2 + ..., data).
summary() - To get more details about the model results.
broom:­:gl­ance() - See the model details in a tidier form.
predic­t(m­odel, newdata)
ggplot­(df­rame, aes(x = pred, y = outcome)) + geom_p­oint() + geom_a­bline()
GainCu­rve­Plo­t(data, "­pre­dic­tio­n", "­pri­ce", "­mod­el")

Evaluating a regression model

RMSE = sqrt(m­ean(res2)). It is the typical prediction error of your model on the data. Want to minimize this.
R2 = 1-(RSS­/TSS). A measure of how well the model fits or explains the data.
One heuristic is to compare the RMSE to the standard deviation of the outcome. With a good model, the RMSE should be smaller.

RSS = sum(res2).
TSS = sum(ou­tcome value - mean of outcome)2

Important things to remember about Regression

Pros of Regression
Easy to fit and apply
Less prone to overfit
Interp­retable - One can easily observe the signs and magnitude of each coeffi­cient.
Things to look out for
Cannot express complex, non-li­near, non-ad­ditive relati­ons­hips. Non-linear relati­onships can be made linear by applying transf­orm­ations.
Collin­earity is when input variables are partially correl­ated. When indepe­ndent variables are correl­ated, signs of the variables may not be what you expect. Moreover, it can be difficult to separate out the individual effects of collinear variables on the response.. Try manually removing offending variables, dimension reduction, or regula­riz­ation.
Variance among residuals should be constant. Plot the residuals on the y-axis and predicted values on the x-axis. The errors should be evenly distri­buted between positive and negative, and have about the same magnitude above and below.
Autoco­rre­lation - Errors are indepe­ndent and uncorr­elated. Plot the row_nu­mber() on the x-axis and residuals on the y-axis.

Training a regression model

It is crucial to split your data into training and test sets to evaluate how your model does on unseen data.

k-fold cv

k-fold cross-­val­idation (aka k-fold CV) is a resampling method that randomly divides the training data into k groups (aka folds) of approx­imately equal size. The model is fit on k −1 folds and then the remaining fold is used to compute model perfor­mance. This procedure is repeated k times; each time, a different fold is treated as the validation set. Thus, the k-fold CV estimate is computed by averaging the k test errors, providing us with an approx­imation of the error we might expect on unseen data.

Training and testing using k-fold cv.

# Returns a list of nSplits.
# Each sub-list will contain two vectors; train and app. 

splitPlan <- vtreat::kWayCrossValidation(nRows, nSplits, NULL, NULL)

# Initialize a column of the appropriate length
dframe$ <- 0 

# k is the number of folds
# splitPlan is the cross validation plan

for(i in 1:k) {
  # Get the ith split
  split <- splitPlan[[i]]

 # Build a model on the training data from this split (lm, in this case)
  model <- lm(fmla, data = dframe[split$train,])

  # make predictions on the application data from this split
  dframe$[split$app] <- predict(model, newdata = dframe[split$app,])
nRows - number of rows in training data
nSplits - number of folds.


No comments yet. Add yours below!

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.

          Related Cheat Sheets

          Weights and Measures Cheat Sheet
          Translate_Stats_ML Cheat Sheet
          Master Measures & Weights with this Cheat Sheet Cheat Sheet

          More Cheat Sheets by patelivan

          Intermediate Spreadsheets Cheat Sheet
          Introductory Statistics in R Cheat Sheet
          Introduction to Regression in R Cheat Sheet