\documentclass[10pt,a4paper]{article} % Packages \usepackage{fancyhdr} % For header and footer \usepackage{multicol} % Allows multicols in tables \usepackage{tabularx} % Intelligent column widths \usepackage{tabulary} % Used in header and footer \usepackage{hhline} % Border under tables \usepackage{graphicx} % For images \usepackage{xcolor} % For hex colours %\usepackage[utf8x]{inputenc} % For unicode character support \usepackage[T1]{fontenc} % Without this we get weird character replacements \usepackage{colortbl} % For coloured tables \usepackage{setspace} % For line height \usepackage{lastpage} % Needed for total page number \usepackage{seqsplit} % Splits long words. %\usepackage{opensans} % Can't make this work so far. Shame. Would be lovely. \usepackage[normalem]{ulem} % For underlining links % Most of the following are not required for the majority % of cheat sheets but are needed for some symbol support. \usepackage{amsmath} % Symbols \usepackage{MnSymbol} % Symbols \usepackage{wasysym} % Symbols %\usepackage[english,german,french,spanish,italian]{babel} % Languages % Document Info \author{Ivan Patel (patelivan)} \pdfinfo{ /Title (introductory-statistics-in-r.pdf) /Creator (Cheatography) /Author (Ivan Patel (patelivan)) /Subject (Introductory Statistics in R Cheat Sheet) } % Lengths and widths \addtolength{\textwidth}{6cm} \addtolength{\textheight}{-1cm} \addtolength{\hoffset}{-3cm} \addtolength{\voffset}{-2cm} \setlength{\tabcolsep}{0.2cm} % Space between columns \setlength{\headsep}{-12pt} % Reduce space between header and content \setlength{\headheight}{85pt} % If less, LaTeX automatically increases it \renewcommand{\footrulewidth}{0pt} % Remove footer line \renewcommand{\headrulewidth}{0pt} % Remove header line \renewcommand{\seqinsert}{\ifmmode\allowbreak\else\-\fi} % Hyphens in seqsplit % This two commands together give roughly % the right line height in the tables \renewcommand{\arraystretch}{1.3} \onehalfspacing % Commands \newcommand{\SetRowColor}[1]{\noalign{\gdef\RowColorName{#1}}\rowcolor{\RowColorName}} % Shortcut for row colour \newcommand{\mymulticolumn}[3]{\multicolumn{#1}{>{\columncolor{\RowColorName}}#2}{#3}} % For coloured multi-cols \newcolumntype{x}[1]{>{\raggedright}p{#1}} % New column types for ragged-right paragraph columns \newcommand{\tn}{\tabularnewline} % Required as custom column type in use % Font and Colours \definecolor{HeadBackground}{HTML}{333333} \definecolor{FootBackground}{HTML}{666666} \definecolor{TextColor}{HTML}{333333} \definecolor{DarkBackground}{HTML}{6495ED} \definecolor{LightBackground}{HTML}{F5F8FD} \renewcommand{\familydefault}{\sfdefault} \color{TextColor} % Header and Footer \pagestyle{fancy} \fancyhead{} % Set header to blank \fancyfoot{} % Set footer to blank \fancyhead[L]{ \noindent \begin{multicols}{3} \begin{tabulary}{5.8cm}{C} \SetRowColor{DarkBackground} \vspace{-7pt} {\parbox{\dimexpr\textwidth-2\fboxsep\relax}{\noindent \hspace*{-6pt}\includegraphics[width=5.8cm]{/web/www.cheatography.com/public/images/cheatography_logo.pdf}} } \end{tabulary} \columnbreak \begin{tabulary}{11cm}{L} \vspace{-2pt}\large{\bf{\textcolor{DarkBackground}{\textrm{Introductory Statistics in R Cheat Sheet}}}} \\ \normalsize{by \textcolor{DarkBackground}{Ivan Patel (patelivan)} via \textcolor{DarkBackground}{\uline{cheatography.com/135316/cs/28534/}}} \end{tabulary} \end{multicols}} \fancyfoot[L]{ \footnotesize \noindent \begin{multicols}{3} \begin{tabulary}{5.8cm}{LL} \SetRowColor{FootBackground} \mymulticolumn{2}{p{5.377cm}}{\bf\textcolor{white}{Cheatographer}} \\ \vspace{-2pt}Ivan Patel (patelivan) \\ \uline{cheatography.com/patelivan} \\ \end{tabulary} \vfill \columnbreak \begin{tabulary}{5.8cm}{L} \SetRowColor{FootBackground} \mymulticolumn{1}{p{5.377cm}}{\bf\textcolor{white}{Cheat Sheet}} \\ \vspace{-2pt}Published 14th July, 2021.\\ Updated 14th July, 2021.\\ Page {\thepage} of \pageref{LastPage}. \end{tabulary} \vfill \columnbreak \begin{tabulary}{5.8cm}{L} \SetRowColor{FootBackground} \mymulticolumn{1}{p{5.377cm}}{\bf\textcolor{white}{Sponsor}} \\ \SetRowColor{white} \vspace{-5pt} %\includegraphics[width=48px,height=48px]{dave.jpeg} Measure your website readability!\\ www.readability-score.com \end{tabulary} \end{multicols}} \begin{document} \raggedright \raggedcolumns % Set font size to small. Switch to any value % from this page to resize cheat sheet text: % www.emerson.emory.edu/services/latex/latex_169.html \footnotesize % Small font. \begin{tabularx}{17.67cm}{x{8.635 cm} x{8.635 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{17.67cm}}{\bf\textcolor{white}{Summary Statistics}} \tn % Row 0 \SetRowColor{LightBackground} Descriptive statistics summarize the data at hand. & Inferential statistics uses sample data to make inferences or conclusions about a larger population. \tn % Row Count 5 (+ 5) % Row 1 \SetRowColor{white} Continuous numeric data can be measured. Discrete numeric data is is usually count data like number of pets. & Nominal categorical data does not have any inherent ordering such as gender or marital status. Ordinal does have an ordering. \tn % Row Count 12 (+ 7) % Row 2 \SetRowColor{LightBackground} Mean, Median, and Mode are the typical measures of center. Mean is sensitive to outliers so use median when data is skewed. But always note the distribution and explain why you chose one measure over another. & Variance is the average, squared distance of each data point to the data's mean. For sample variance, divide the sum of squared distances by number of data points - 1. \tn % Row Count 23 (+ 11) % Row 3 \SetRowColor{white} M.A.D is the mean absolute deviation of distances to the mean. & Standard deviation is the square root of variance. \tn % Row Count 27 (+ 4) % Row 4 \SetRowColor{LightBackground} Quartiles split the data into four equal parts. 0-25-50-75-100. Thus, the second quartile is median. You can use boxplots to visualize quartiles. & Quantiles can split the data into n pieces as it is a generalized version of quartiles. \tn % Row Count 35 (+ 8) \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{17.67cm}{x{8.635 cm} x{8.635 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{17.67cm}}{\bf\textcolor{white}{Summary Statistics (cont)}} \tn % Row 5 \SetRowColor{LightBackground} Interquartile range is the distance between the 75th and 25th percentile. & Outliers are "substantially" different data points from others. {\emph{data \textless{} q1-1.5*IQR}} or {\emph{data \textgreater{} q3+1.5*IQR.}} \tn % Row Count 6 (+ 6) \hhline{>{\arrayrulecolor{DarkBackground}}--} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{17.67cm}{X} \SetRowColor{DarkBackground} \mymulticolumn{1}{x{17.67cm}}{\bf\textcolor{white}{Calculating summary stats in R}} \tn \SetRowColor{LightBackground} \mymulticolumn{1}{x{17.67cm}}{\# Using food consumption data to show how to use dplyr verbs and calculate a column's summary stats. \newline \newline \# Calculate Belgium's and USA's "typical" food consumption and its spread. \newline food\_consumption \%\textgreater{}\% \newline filter(country \%in\% c('Belgium', 'USA')) \%\textgreater{}\% \newline group\_by(country) \%\textgreater{}\% \newline \seqsplit{summarize(mean\_consumption} = mean(consumption), \newline median\_consumption = median(consumption) \newline sd\_consumption = sd(consumption)) \newline \newline \# Make a histogram to compare the distribution of rice's carbon footprint. A great way to understand how skewed is the variable. \newline \newline food\_consumption \%\textgreater{}\% \newline \# Filter for rice food category \newline filter(food\_category == "rice") \%\textgreater{}\% \newline \# Create histogram of co2\_emission \newline \seqsplit{ggplot(aes(co2\_emission))} + \newline geom\_histogram() \newline \newline \# Calculate the quartiles of co2 emission \newline quantile(food\_consumption\$co2\_emission) \newline \newline \# If you want to split the data into n pieces. This is equivalent of splitting the data into n+1 quantiles. \newline quantile(food\_consumption\$co2\_emission, probs = seq(0, 1, 1/n). \newline \newline \# Calculate variance and sd of co2\_emission for each food\_category \newline food\_consumption \%\textgreater{}\% \newline \seqsplit{group\_by(food\_category)} \%\textgreater{}\% \newline summarize(var\_co2 = var(co2\_emission), \newline sd\_co2 = sd(co2\_emission)) \newline \newline \# Plot food\_consumption with co2\_emission on x-axis \newline ggplot(data = food\_consumption, aes(co2\_emission)) + \newline \# Create a histogram \newline geom\_histogram() + \newline \# Create a separate sub-graph for each food\_category \newline facet\_wrap(\textasciitilde{} food\_category)} \tn \hhline{>{\arrayrulecolor{DarkBackground}}-} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{17.67cm}{x{8.635 cm} x{8.635 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{17.67cm}}{\bf\textcolor{white}{Random Numbers and probability}} \tn % Row 0 \SetRowColor{LightBackground} p(event) = \# ways event can happen / total \# of possible outcomes. & Sampling can be done with our without replacement. \tn % Row Count 4 (+ 4) % Row 1 \SetRowColor{white} Two events are independent if the p(second event) is not affected by the outcome of first event. & A probability distribution describes the probability of each outcome in a scenario. \tn % Row Count 9 (+ 5) % Row 2 \SetRowColor{LightBackground} The expected value is the mean of a probability distribution. & Discrete random variables can take on discrete outcomes. Thus, they have a discrete probability distribution. \tn % Row Count 15 (+ 6) % Row 3 \SetRowColor{white} A bernouli trial is an independent trial with only two possible outcomes, a success or a failure. & A binomial distribution is a probability distribution of the number of successes in a sequence of n bernoulli trials. Described by two parameters: number of trials (n) and pr(success) (p). \tn % Row Count 25 (+ 10) % Row 4 \SetRowColor{LightBackground} The expected value of a binomial distribution is n * p. & Ensure that the trials are independent to use the binomial distribution. \tn % Row Count 29 (+ 4) \hhline{>{\arrayrulecolor{DarkBackground}}--} \SetRowColor{LightBackground} \mymulticolumn{2}{x{17.67cm}}{- When sampling with replacement, you are ensuring that p(event) stays the same in different trials. In other words, each pick is independent. \newline - Expected value is calculated by multiplying each value a random variable can take by its probability. and summing those products. \newline - Uniform distribution is when all outcomes have the same probability.} \tn \hhline{>{\arrayrulecolor{DarkBackground}}--} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{17.67cm}{X} \SetRowColor{DarkBackground} \mymulticolumn{1}{x{17.67cm}}{\bf\textcolor{white}{Sampling and Distributions in R}} \tn \SetRowColor{LightBackground} \mymulticolumn{1}{x{17.67cm}}{\# Randomly select n observations with or without replacement \newline df \%\textgreater{}\% sample\_n(\# of obvs to sample, replace=TRUE or FALSE). \newline \newline \# Say you assume that the probability distribution of a random variable (wait time for ex.) is uniform, where it takes a min value and a max value. Then, the probability that this variable will take on a value less than x can be calculated as: \newline punif(x, min, max) \newline \newline \# To generate 1000 wait times between min and max. \newline runif(1000, min, max). \newline \newline \# Binomial distribution -{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}- \newline rbinom(\# of trials, \# of coins, pr(success)) \newline \newline rbinom(1, 1, 0.5) \# To simulate a single coin flip \newline rbinom(8, 1, 0.5) \# Eight flips of one coin. \newline rbinom(1, 8, 0.5) \# 1 flip of eight coins. Gives us the total \# of successes. \newline \newline dbinom(\# of successes, \# of trial, pr(success)). \newline dbinom(7, 10, 0.5) \# The chances of getting 7 successes when you flip 10 coins. \newline pbinom(7, 10, 0.5) \# Chances of getting 7 successes or less when you flip 10 coins. \newline pbinom(7, 10, 0.5, lower.tail = FALSE) \# Chances of getting more than 7 successes when you flip 10 coins.} \tn \hhline{>{\arrayrulecolor{DarkBackground}}-} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{17.67cm}{x{8.635 cm} x{8.635 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{17.67cm}}{\bf\textcolor{white}{More distributions and the CLT}} \tn % Row 0 \SetRowColor{LightBackground} The Normal distribution is a cotinuous distribution that is symmetrical and has an area beaneath the curve is 1. & It is described by its mean and standard deviation. The standard normal distribution has a mean of 0 and an sd of 1. \tn % Row Count 6 (+ 6) % Row 1 \SetRowColor{white} Regardless of the shape of the distribution you're taking sample means from, the central limit theorem will apply if the sampling distribution contains enough sample means. & The sampling distribution is a distribution of a sampling statistic obtained by randomly sampling from a larger population. \tn % Row Count 15 (+ 9) % Row 2 \SetRowColor{LightBackground} To determine what kind of distribution a variable follows, plot its histogram. & The sampling distribution of a statistic becomes closer to normal distribution as the number of trials increase. This is known as the CLT, and the sample must be random and independent. \tn % Row Count 25 (+ 10) % Row 3 \SetRowColor{white} A Poisson process is when events happen at a certain, and a known, rate but completely at random. & For example, we know that there are 2 earthquakes every month in a certain area, but the timing of the earthquake is completely random. \tn % Row Count 32 (+ 7) \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{17.67cm}{x{8.635 cm} x{8.635 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{17.67cm}}{\bf\textcolor{white}{More distributions and the CLT (cont)}} \tn % Row 4 \SetRowColor{LightBackground} Thus, the poisson distribution shows us the probability of some \# of events happening over a fixed period of time. & The poisson distribution is described by lambda which is the average number of events per time interval. \tn % Row Count 6 (+ 6) % Row 5 \SetRowColor{white} The exponential distribution allows us to calculate the probability of time between poisson events; Probability of more than 1 day between pet adoptions. It is a continuous distribution and uses the same lambda value. & The expected value of an exponential distribution is 1/lambda. This is the rate. \tn % Row Count 17 (+ 11) % Row 6 \SetRowColor{LightBackground} (Student's) t-distribution has a similar shape as the normal distribution but has fatter tails. & Degrees of freedom (df) affect the t-distribution's tail thickness. \tn % Row Count 22 (+ 5) % Row 7 \SetRowColor{white} Variables that follow a log-normal distribution have a logarithm that is normally distributed. & There are lots of others. \tn % Row Count 27 (+ 5) \hhline{>{\arrayrulecolor{DarkBackground}}--} \SetRowColor{LightBackground} \mymulticolumn{2}{x{17.67cm}}{-The peak of Poisson distribution is at its lambda. \newline -Because we are counting the \# of events, the Poisson distribution is a discrete distribution. Thus, we can use dpois(), and other probability functions we have seen so far. \newline -Lower df = thicker tails and higher sd.} \tn \hhline{>{\arrayrulecolor{DarkBackground}}--} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{17.67cm}{X} \SetRowColor{DarkBackground} \mymulticolumn{1}{x{17.67cm}}{\bf\textcolor{white}{More distributions and the CLT in R}} \tn \SetRowColor{LightBackground} \mymulticolumn{1}{x{17.67cm}}{\# Say you're a salesman and each deal you worked on was worth different amount of money. You tracked every deal you worked on, and the amount column follows a normal distribution with a mean of \$5000 and sd of \$2000. \newline \newline \# Pr(deal \textless{} \$7500): \newline pnorm(7500, mean=5000, sd=2000) \newline \newline \# Pr(deal \textgreater{} 1000) \newline pnorm(1000, mean=5000, sd=2000, lower.tail=FALSE) \newline \newline \# Pr(deal between 3000 and 7000) \newline pnorm(7000, mean=5000, sd=2000) - pnorm(3000, mean=5000, sd=2000) \newline \newline \# How much money will 75\% of your deals will be worth more than? \newline qnorm(0.75, mean=5000, sd=2000) \newline \newline \# Simulate 36 deals. \newline rnorm(36, mean=5000, sd=2000) \newline \newline \newline CLT in action-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}-{}- \newline \# Say you also tracked how many users used the product you sold in num\_users column. The CLT, in this case, says that the sampling distribution of the average number of users approaches the normal distribution as you take more samples. \newline \newline \# Set seed to 104 \newline set.seed(104) \newline \newline \# Sample 20 num\_users from amir\_deals and take mean \newline sample(amir\_deals\$num\_users, size = 20, replace = TRUE) \%\textgreater{}\% \newline mean() \newline \newline \# Repeat the above 100 times \newline sample\_means \textless{}- replicate(100, \seqsplit{sample(amir\_deals\$num\_users}, size = 20, replace = TRUE) \%\textgreater{}\% mean()) \newline \newline \# Create data frame for plotting \newline samples \textless{}- data.frame(mean = sample\_means) \newline \newline \# Histogram of sample means \newline samples \%\textgreater{}\% ggplot(aes(x=mean)) + geom\_histogram(bins=10)} \tn \hhline{>{\arrayrulecolor{DarkBackground}}-} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{17.67cm}{x{8.635 cm} x{8.635 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{17.67cm}}{\bf\textcolor{white}{Correlation and Experimental Design}} \tn % Row 0 \SetRowColor{LightBackground} The correlation coefficient quantifies a linear relationship between two variables. Its magnitude corresponds to strength of relationship. & The number is between -1 and 1, and the sign corresponds to the relationship's direction. \tn % Row Count 7 (+ 7) % Row 1 \SetRowColor{white} The most common measure of correlation is the Pearson product-moment correlation (r). & Don't just calculate r blindly. Visualize the relationship first. \tn % Row Count 12 (+ 5) % Row 2 \SetRowColor{LightBackground} Sometimes , you must transform one or both variables to make a relationship linear and then calculate r. & The transformation choice will depend on your data. \tn % Row Count 18 (+ 6) % Row 3 \SetRowColor{white} And as always, correlation does not imply causation. You must always think of confounding or hidden variables. & Experiments try to understand what is the effect of the treatment of the response. \tn % Row Count 24 (+ 6) % Row 4 \SetRowColor{LightBackground} In a randomized control trial, participants are randomly assigned by researchers to treatment or contol group. & In observational studies, participants are not randomly assigned to groups. Thus, they establis causation. \tn % Row Count 30 (+ 6) \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{17.67cm}{x{8.635 cm} x{8.635 cm} } \SetRowColor{DarkBackground} \mymulticolumn{2}{x{17.67cm}}{\bf\textcolor{white}{Correlation and Experimental Design (cont)}} \tn % Row 5 \SetRowColor{LightBackground} In a longitudinal study, participants are followed over a period of time to examine the treatment's effect. & In a cross-sectional study, data is collected from a single snapshot of time. \tn % Row Count 6 (+ 6) \hhline{>{\arrayrulecolor{DarkBackground}}--} \SetRowColor{LightBackground} \mymulticolumn{2}{x{17.67cm}}{- Measures the strength of only linear relationship. \newline - Use a scatterplot and add a linear trend line to see a relationship between two variables. \newline - Other transformations include taking square root, taking reciprocal, Box-Cox transformation, etc.} \tn \hhline{>{\arrayrulecolor{DarkBackground}}--} \end{tabularx} \par\addvspace{1.3em} \begin{tabularx}{17.67cm}{X} \SetRowColor{DarkBackground} \mymulticolumn{1}{x{17.67cm}}{\bf\textcolor{white}{Correlation and design in R}} \tn \SetRowColor{LightBackground} \mymulticolumn{1}{x{17.67cm}}{\# Make a scatter plot to view a bi-variate relationship \newline df \%\textgreater{}\% ggplot(aes(x=col\_1, y=col\_2)) + geom\_point() + \newline geom\_smooth(method='lm', se=FALSE (usually)). \newline \newline \# Measure the correlation between two data frame columns \newline cor(df\$col\_1, df\$col\_2) \newline \newline \# Transform the x variable to log. \newline df \%\textgreater{}\% mutate(log\_x = log(col\_x)) \%\textgreater{}\% \# Natural log by default \newline ggplot(aes(x=log\_x, y=col\_y)) + geom\_point() + \newline geom\_smooth(method='lm', se=FALSE).} \tn \hhline{>{\arrayrulecolor{DarkBackground}}-} \end{tabularx} \par\addvspace{1.3em} \end{document}