Show Menu
Cheatography

R Programming Cheat Sheet (DRAFT) by

Big Data for Marketing

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Data Structures

Vector
ordered array of elements of the same data type a<-­c(3­,1,5)
Vector Naming
a<-­c("d­esk­s" = 1, "­tab­les­" = 3, "­cha­irs­" = 4)
Vector Coercion
a<-­c(TRUE, FALSE, TRUE) = 1 0 1
 
seq(1,9,2) and rep(c(­2,3,4), 3)
Vector Subsetting
materials <- c(wood = 17, cloth = 36, silver = 24, gold = 3)
 
materi­als[1] = wood = 17
Matrix
vector of elements arranged in two dimensions
 
m1<­-ma­tri­x(3­:8,­nco­l=3­,nr­ow=2)
 
m2<-3:8 and dim(m2­)<-­c(3,2)
Factor
used to store catego­rical variables (numeric or character)
 
a<-­c(0­,1,­0,0,1)
 
a.f<-f­act­or(­a,l­abels = c("M­ale­"­,"Fe­mal­e"))
 
a.f = Male Female Male Male Female
gl() function
generate factors by specifying the pattern of their levels
 
gl(2,8­,la­bel­s=c­("ma­le",­"­fem­ale­"))
List
multiple types of elements ()list
 
Mike<-­lis­t(N­ame­="Mi­ke",­Sal­ary­=10­000­,Ag­e=4­3,C­hil­dre­n=c­("To­m","L­ily­"­,"Al­ice­"))
#$
is a convenient way to retrieve element by element name.
str()
display the internal structure
c()
combine several lists into one
Array
multi-­dim­ens­ional arrang­ement of data in a vector.

Exploring Data

Missing Data
Causes
human error, system error, loopholes
Dealing
summary() - how much data is missing
missing catego­rical data
set a new category called “Unknown”
missing numerical data
assign mean value or assign a value based on its relati­onship to other related variables
Other Data Problems
data entry, logical errors, outdated, incons­istent

Data Visual­ization

Principles
Simplify, Compare, Attend (Details), Explore (Visual), View diversely, Ask why, Be skeptical, Respond
GGPlot2
(+) allows us to make complex and aesthe­tically pleasing plots quickly and intuit­ively
 
(-) work exclus­ively with data tables
Components
data
data table in the example plot is summar­ized.
geometry
scatter plot, histog­rams, smooth densities, q-q plots, and blocks plots.
aesthetic mapping
x and y axis
scale
range of x-axis and y-axis appear to be defined by the range of the data
labels, title, legend,
Creating a New Plot
ggplot() function
specify the graph’s data component.
df %>% ggplot()
associates the dataset with the plotting object
geom_p­oint()
add a layer, assigning population to x and total to y
aes()
recognizes variables from the data component
geom_l­abel() and geom_t­ext()
functions to add text to the plot.
Size Color
geom_p­oin­t(size = 3, color = "­blu­e")
geom_h­ist­ogram()
geom_d­ens­ity()
create smooth densities

Progra­mming Structure and Functions

Basic
if-else
use curly braces “{}
if(boolean condit­ion){ expres­sions } else{ altern­ative expres­sions }
any() (similar to OR "­|")
returns TRUE if any of the logicals are true
z <- c(TRUE, TRUE, FALSE) any(z)
TRUE
all() (similar to &)
returns TRUE if all of the logicals are true
Basic Functions
my_fun­ction <- functi­on(x){ operations that operate on x which is defined by user of function value of final line is returned }
For Loops
for (i in range of values){ operations that use i, which is changing across the range of values }
for (i in 1:5){ print(i) }
## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5
apply()
apply a function to the margin of a matrix or a dataframe
apply(x, MARGIN, FUNC, ...)
z <- cbind(­A=1­:3,­B=4­:6,­C=7­:9,­D=1­0:12)
apply(­z,2­,sum)
lapply()
works on list or vector inputs instead of matrix­/da­taframe input.
 
returns a list of the same length as the given list or array.
x <- list(A­=1:4, B=seq(­0.1­,1,­by=­0.1))
lapply(x, mean)
sapply()
wrapper of the lapply() function. It also takes in a list or vector, however it returns a vector instead of a list
vapply()
performs exactly like lapply() except that we can specify the return value type from FUNC
 
can be faster if we know that our output can use a atomic data type that takes up less memory space.
rapply()
a specified function to all elements of a list recurs­ively
x <- list(A­=2,­B=l­ist­(-1­,3)­,C=­lis­t(-­2,l­ist­(-5­,6)))
rapply(x, functi­on(­x){­x^2}) #returns a vector
mapply()
take multiple vectors as inputs.
tapply()
applies the specified FUNC to each group of an array, grouped based on levels of certain factors.
Pivot Table
grouping data by different fields
 
summarize the data with your own function for specific purposes
data(m­urders) tapply­(mu­rde­rs$­total, murder­s$r­egion, sum)
tapply­(mu­rde­rs$­tot­al/­mur­der­s$p­opu­lation, murder­s$r­egion, mean)
split()
split a dataframe into a list of data frames based on a factor array.
tapply()
group data by multiple factors
 

Basic Data Wrangling

Data Frame
use the data.f­rame() function. elements in the same column should be of the same data type.
 
name <- c("A­nne­"), age <- c(28), child <- c(FALSE)
 
df <- data.f­ram­e(name, age, child)
Data Frame Naming
names(df) <- c("N­ame­", "­Age­", "­Chi­ld")
Data Frame Structure
Data Frame in R is implem­ented as a list of vectors with an important restri­ction of equal length vectors.
 
R stores the character data type as a factor instead
str()
prevents R from converting the characters to vectors
Data Frame Subsetting
“[]” and “[[]]” and “$”
 
df[3,2] #r3c2
c()
used to subset multiple portions of the Data Frame.
Data Frame Extension
adding new variables or observ­ations to an existing Data Frame.
 
height <- c(163, 177, 163, 162, 157)
 
df$height <- height
Sorting
sort(d­f$age) #based on age
 
max(df­$age) #getting the highest age
 
which.m­ax­(df­$age) #index of the oldest person
Data Frame Indexing
find specific cases in DF
 
index <- df$height > 171
 
sum(index) #number of people taller than the male average
 
df$nam­e[i­ndex] #person who is taller: pete
finding those older than 30 without children.
index <- df$age > 30 & df$child == FALSE
librar­y(d­plyr)
mutate() function
extend DF for row and col
 
df <- mutate(df, bmi = weight­/he­igh­t^2­*10000)
 
or df$bmi <- df$wei­ght­/df­$he­igh­t^2­*10000
filter()
subset rows
 
filter(df, bmi > 18.5 & bmi < 24.9)
select()
health <- select(df, name, height, weight, bmi)
 
filter­(he­alth, bmi > 18.5 & bmi < 24.9)
%>%
chain these three functions together.
 
df %>% select­(name, height, weight, bmi) %>% filter(bmi > 18.5 & bmi < 24.9)
merge 2 df based on col
right_join & left_join
suffix
added to the column names from each data frame to make them unique in the result.
 
should be a vector with two elements
 
right_­joi­n(d­riv­er_q2, constr­uctors, by = c("c­ons­tru­cto­r" = "­con­str­uct­or")­,suffix = c("_­dri­ver­", "­_co­nst­ruc­tor­"))
inner_join
returns only the rows that have matching values in both data frames based on specified key columns
union
combine two or more data frames vertic­ally, stacking them on top of each other.
anti_join
filtering rows from the first data frame based on values that do not have matching values in the second data frame.
common used for df
rbind & bind_rows
 

Advance Data Wrangling

Importing Data
Via readr
read_csv: comma separated values
 
read_tsv: tab delimited separated values
 
read_d­elim: general text file format
 
head() function display it as a tibble.
readxl
read_e­xce­l,x­ls,xlsx
R-base
read.csv() and read.t­able() can be used without having to install any libraries
R-base import function will automa­tically convert any character strings to factors
CSV
widespread use in the data science community due to its efficiency at storing large amounts of data and also as it is platform agnost­ic.T­here is also no size limit with csv files.
Via URL
read_c­sv(url)
tempdir() & tempfile()
it is useful to have a temporary directory or filename auto generated to manage these URL imports
Via JSON
provided via API, librar­y(j­son­lite), fromJS­ON(url)
Via XML
rawling a website, xmlPar­se(­"­boo­ks.x­ml­")
xmlRoot()
access the root node of the tree.
xmlChi­ldren()
use the children nodes of the tree
xmlToL­ist­(data), xmlToD­ata­Fra­me(­books)
convert the XML file to list or data frame format
Reshaping Data
Wide to Tidy: gather()
convert the above wide data into tidy data
countr­y,y­ear­,fe­art­ility
new_ti­dy_data <- wide_data %>% gather­(year, fertility, '1960'­:'2­015')
Tidy to Wide: spread()
The first argument of the spread() function is to declare which variables are to be used as column names. While the second argument is to specify the variables used to fill out the cells.
Separate and Unite
separate()
requires the target column, the names for the new columns and the separator character.
dat %>% separa­te(key, c("y­ear­", "­fir­st_­var­iab­le_­nam­e", "­sec­ond­_va­ria­ble­_na­me"), fill = "­rig­ht")
spread()
dat %>% separa­te(key, c("y­ear­", "­var­iab­le_­nam­e"), extra = "­mer­ge") %>% spread­(va­ria­ble­_name, value)
unite()
first name & last name
Combining Data
join()
combined so that matching rows are together
Inner Join
eturns only the rows that have matching values in both tables
Left Join
returns all the rows from the left table and the matching rows from the right table
Full Join
all the rows from both tables, with NULL values in columns where there is no match in the other table
Semi Join
keep the part of the first table for which we have inform­ation in the second table, but doesnt add the columns of the second.
Anti Join
opposite of the semi_j­oin() function. It allows us to keep the part of the first table for which we have NO inform­ation in the second table, but doesnt add the columns of the second.
Set Operators
Intersect: inds common elements shared among sets.
inters­ect­(1:10, 6:15) = 6 7 8 9 10
Union: ombines sets into one, removing duplic­ates.
same with interse
Setequal
helps us check if two sets are the same regardless of order.
Setdiff
find the elements that are in one set (or vector) but not in another set.