Supervised Learning in R: Regression Cheat Sheet

What is a Regression?

From a machine learning perspective, the term regression generally encompasses the prediction of continuous values. Statistically, these predictions are the expected value, or the average value one would observe for the given input values.

In R, use the lm() function, shown above, to code an OLS model. You can simply print the model to observe the coefficients' sign and magnitude.

After estimating the coefficients, you can always make predictions on your training data to assess the actual values v. predicted values. Use ggplot2() and a reference line of slope 1 to see if the points are close to the line or not.

You can also plot a gain curve to measure how well your model sorts the outcome. Useful when sorting instances is more important than predicting the exact outcome.

An example of the Gain Curve.

The diagonal line represents the gain curve if the outcome was sorted randomly. A perfect model would trace out the green curve. And our model chases out the blue curve. Per the above graph, the model correctly identified the top 30% highest home values and sorted them by price correctly.

We can also see that the top 30% of highest priced homes are 50% of total home sales ($).

Basic Regression Functions to get started

lm(formula = dependent variable ~ independent variable1 + independent variable2 + ..., data).

summary() - To get more details about the model results.

broom::glance() - See the model details in a tidier form.

predict(model, newdata)

ggplot(dframe, aes(x = pred, y = outcome)) + geom_point() + geom_abline()

GainCurvePlot(data, "prediction", "price", "model")

Evaluating a regression model

RMSE = sqrt(mean(res²)). It is the typical prediction error of your model on the data. Want to minimize this.

R² = 1-(RSS/TSS). A measure of how well the model fits or explains the data.

One heuristic is to compare the RMSE to the standard deviation of the outcome. With a good model, the RMSE should be smaller.

RSS = sum(res²).
TSS = sum(outcome value - mean of outcome)²

Important things to remember about Regression

Pros of Regression

Easy to fit and apply

Concise

Less prone to overfit

Interpretable - One can easily observe the signs and magnitude of each coefficient.

Things to look out for

Cannot express complex, non-linear, non-additive relationships. Non-linear relationships can be made linear by applying transformations.

Collinearity is when input variables are partially correlated. When independent variables are correlated, signs of the variables may not be what you expect. Moreover, it can be difficult to separate out the individual effects of collinear variables on the response.. Try manually removing offending variables, dimension reduction, or regularization.

Variance among residuals should be constant. Plot the residuals on the y-axis and predicted values on the x-axis. The errors should be evenly distributed between positive and negative, and have about the same magnitude above and below.

Autocorrelation - Errors are independent and uncorrelated. Plot the row_number() on the x-axis and residuals on the y-axis.

Training a regression model

It is crucial to split your data into training and test sets to evaluate how your model does on unseen data.

k-fold cv

k-fold cross-validation (aka k-fold CV) is a resampling method that randomly divides the training data into k groups (aka folds) of approximately equal size. The model is fit on k −1 folds and then the remaining fold is used to compute model performance. This procedure is repeated k times; each time, a different fold is treated as the validation set. Thus, the k-fold CV estimate is computed by averaging the k test errors, providing us with an approximation of the error we might expect on unseen data.

Training and testing using k-fold cv.

# Returns a list of nSplits.
# Each sub-list will contain two vectors; train and app. 

splitPlan <- vtreat::kWayCrossValidation(nRows, nSplits, NULL, NULL)

# Initialize a column of the appropriate length
dframe$pred.cv <- 0 

# k is the number of folds
# splitPlan is the cross validation plan

for(i in 1:k) {
  # Get the ith split
  split <- splitPlan[[i]]

 # Build a model on the training data from this split (lm, in this case)
  model <- lm(fmla, data = dframe[split$train,])

  # make predictions on the application data from this split
  dframe$pred.cv[split$app] <- predict(model, newdata = dframe[split$app,])
}

nRows - number of rows in training data
nSplits - number of folds.

Created By

patelivan

Metadata

Languages: English

Published: 28th May, 2021

Comments

No comments yet. Add yours below!

Add a Comment

Related Cheat Sheets

Weights and Measures Cheat Sheet

DATA ANALYTICS FOR CYBER Cheat Sheet

Master Measures & Weights with this Cheat Sheet Cheat Sheet

More Cheat Sheets by patelivan

Intermediate Spreadsheets Cheat Sheet

Introductory Statistics in R Cheat Sheet

Introduction to Regression in R Cheat Sheet

Supervised Learning in R: Regression Cheat Sheet by patelivan

What is a Regression?

An example of the Gain Curve.

Basic Regression Functions to get started

Evaluating a regression model

Important things to remember about Regression

Training a regression model

k-fold cv

Training and testing using k-fold cv.

Created By

Metadata

Comments

Add a Comment

Related Cheat Sheets

More Cheat Sheets by patelivan

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

Supervised Learning in R: Regression Cheat Sheet by patelivan

What is a Regres­sion?

An example of the Gain Curve.

Basic Regression Functions to get started

Evaluating a regression model

Important things to remember about Regression

Training a regression model

k-fold cv

Training and testing using k-fold cv.

Created By

Metadata

Comments

Add a Comment

Related Cheat Sheets

More Cheat Sheets by patelivan

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

What is a Regression?