R Programming Cheat Sheet

Data Structures

Vector	ordered array of elements of the same data type a<-c(3,1,5)
Vector Naming	a<-c("desks" = 1, "tables" = 3, "chairs" = 4)
Vector Coercion	a<-c(TRUE, FALSE, TRUE) = 1 0 1
	seq(1,9,2) and rep(c(2,3,4), 3)
Vector Subsetting	materials <- c(wood = 17, cloth = 36, silver = 24, gold = 3)
	materials[1] = wood = 17
Matrix	vector of elements arranged in two dimensions
	m1<-matrix(3:8,ncol=3,nrow=2)
	m2<-3:8 and dim(m2)<-c(3,2)
Factor	used to store categorical variables (numeric or character)
	a<-c(0,1,0,0,1)
	a.f<-factor(a,labels = c("Male","Female"))
	a.f = Male Female Male Male Female
gl() function	generate factors by specifying the pattern of their levels
	gl(2,8,labels=c("male","female"))
List	multiple types of elements ()list
	Mike<-list(Name="Mike",Salary=10000,Age=43,Children=c("Tom","Lily","Alice"))
#$	is a convenient way to retrieve element by element name.
str()	display the internal structure
c()	combine several lists into one
Array	multi-dimensional arrangement of data in a vector.

Exploring Data

Missing Data
Causes	human error, system error, loopholes
Dealing	summary() - how much data is missing
missing categorical data	set a new category called “Unknown”
missing numerical data	assign mean value or assign a value based on its relationship to other related variables
Other Data Problems	data entry, logical errors, outdated, inconsistent

Data Visualization

Principles	Simplify, Compare, Attend (Details), Explore (Visual), View diversely, Ask why, Be skeptical, Respond
GGPlot2	(+) allows us to make complex and aesthetically pleasing plots quickly and intuitively
	(-) work exclusively with data tables
Components
data	data table in the example plot is summarized.
geometry	scatter plot, histograms, smooth densities, q-q plots, and blocks plots.
aesthetic mapping	x and y axis
scale	range of x-axis and y-axis appear to be defined by the range of the data
labels, title, legend,
Creating a New Plot
ggplot() function	specify the graph’s data component.
df %>% ggplot()	associates the dataset with the plotting object
geom_point()	add a layer, assigning population to x and total to y
aes()	recognizes variables from the data component
geom_label() and geom_text()	functions to add text to the plot.
Size Color	geom_point(size = 3, color = "blue")
geom_histogram()
geom_density()	create smooth densities

Programming Structure and Functions

Basic
if-else	use curly braces “{}
if(boolean condition){ expressions } else{ alternative expressions }
any() (similar to OR "\|")	returns TRUE if any of the logicals are true
z <- c(TRUE, TRUE, FALSE) any(z)	TRUE
all() (similar to &)	returns TRUE if all of the logicals are true
Basic Functions
my_function <- function(x){ operations that operate on x which is defined by user of function value of final line is returned }
For Loops
for (i in range of values){ operations that use i, which is changing across the range of values }
for (i in 1:5){ print(i) }	## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5
apply()	apply a function to the margin of a matrix or a dataframe
apply(x, MARGIN, FUNC, ...)
z <- cbind(A=1:3,B=4:6,C=7:9,D=10:12)
apply(z,2,sum)
lapply()	works on list or vector inputs instead of matrix/dataframe input.
	returns a list of the same length as the given list or array.
x <- list(A=1:4, B=seq(0.1,1,by=0.1))
lapply(x, mean)
sapply()	wrapper of the lapply() function. It also takes in a list or vector, however it returns a vector instead of a list
vapply()	performs exactly like lapply() except that we can specify the return value type from FUNC
	can be faster if we know that our output can use a atomic data type that takes up less memory space.
rapply()	a specified function to all elements of a list recursively
x <- list(A=2,B=list(-1,3),C=list(-2,list(-5,6)))
rapply(x, function(x){x^2}) #returns a vector
mapply()	take multiple vectors as inputs.
tapply()	applies the specified FUNC to each group of an array, grouped based on levels of certain factors.
Pivot Table	grouping data by different fields
	summarize the data with your own function for specific purposes
data(murders) tapply(murders$total, murders$region, sum)
tapply(murders$total/murders$population, murders$region, mean)
split()	split a dataframe into a list of data frames based on a factor array.
tapply()	group data by multiple factors

Basic Data Wrangling

Data Frame	use the data.frame() function. elements in the same column should be of the same data type.
	name <- c("Anne"), age <- c(28), child <- c(FALSE)
	df <- data.frame(name, age, child)
Data Frame Naming	names(df) <- c("Name", "Age", "Child")
Data Frame Structure	Data Frame in R is implemented as a list of vectors with an important restriction of equal length vectors.
	R stores the character data type as a factor instead
str()	prevents R from converting the characters to vectors
Data Frame Subsetting	“[]” and “[[]]” and “$”
	df[3,2] #r3c2
c()	used to subset multiple portions of the Data Frame.
Data Frame Extension	adding new variables or observations to an existing Data Frame.
	height <- c(163, 177, 163, 162, 157)
	df$height <- height
Sorting	sort(df$age) #based on age
	max(df$age) #getting the highest age
	which.max(df$age) #index of the oldest person
Data Frame Indexing	find specific cases in DF
	index <- df$height > 171
	sum(index) #number of people taller than the male average
	df$name[index] #person who is taller: pete
finding those older than 30 without children.	index <- df$age > 30 & df$child == FALSE
library(dplyr)
mutate() function	extend DF for row and col
	df <- mutate(df, bmi = weight/height^2*10000)
	or df$bmi <- df$weight/df$height^2*10000
filter()	subset rows
	filter(df, bmi > 18.5 & bmi < 24.9)
select()	health <- select(df, name, height, weight, bmi)
	filter(health, bmi > 18.5 & bmi < 24.9)
%>%	chain these three functions together.
	df %>% select(name, height, weight, bmi) %>% filter(bmi > 18.5 & bmi < 24.9)
merge 2 df based on col	right_join & left_join
suffix	added to the column names from each data frame to make them unique in the result.
	should be a vector with two elements
	right_join(driver_q2, constructors, by = c("constructor" = "constructor"),suffix = c("_driver", "_constructor"))
inner_join	returns only the rows that have matching values in both data frames based on specified key columns
union	combine two or more data frames vertically, stacking them on top of each other.
anti_join	filtering rows from the first data frame based on values that do not have matching values in the second data frame.
common used for df	rbind & bind_rows

Advance Data Wrangling

Importing Data
Via readr	read_csv: comma separated values
	read_tsv: tab delimited separated values
	read_delim: general text file format
	head() function display it as a tibble.
readxl	read_excel,xls,xlsx
R-base	read.csv() and read.table() can be used without having to install any libraries
R-base import function will automatically convert any character strings to factors
CSV	widespread use in the data science community due to its efficiency at storing large amounts of data and also as it is platform agnostic.There is also no size limit with csv files.
Via URL	read_csv(url)
tempdir() & tempfile()	it is useful to have a temporary directory or filename auto generated to manage these URL imports
Via JSON	provided via API, library(jsonlite), fromJSON(url)
Via XML	rawling a website, xmlParse("books.xml")
xmlRoot()	access the root node of the tree.
xmlChildren()	use the children nodes of the tree
xmlToList(data), xmlToDataFrame(books)	convert the XML file to list or data frame format
Reshaping Data
Wide to Tidy: gather()	convert the above wide data into tidy data
country,year,feartility	new_tidy_data <- wide_data %>% gather(year, fertility, '1960':'2015')
Tidy to Wide: spread()	The first argument of the spread() function is to declare which variables are to be used as column names. While the second argument is to specify the variables used to fill out the cells.
Separate and Unite
separate()	requires the target column, the names for the new columns and the separator character.
dat %>% separate(key, c("year", "first_variable_name", "second_variable_name"), fill = "right")
spread()
dat %>% separate(key, c("year", "variable_name"), extra = "merge") %>% spread(variable_name, value)
unite()	first name & last name
Combining Data
join()	combined so that matching rows are together
Inner Join	eturns only the rows that have matching values in both tables
Left Join	returns all the rows from the left table and the matching rows from the right table
Full Join	all the rows from both tables, with NULL values in columns where there is no match in the other table
Semi Join	keep the part of the first table for which we have information in the second table, but doesnt add the columns of the second.
Anti Join	opposite of the semi_join() function. It allows us to keep the part of the first table for which we have NO information in the second table, but doesnt add the columns of the second.
Set Operators
Intersect: inds common elements shared among sets.	intersect(1:10, 6:15) = 6 7 8 9 10
Union: ombines sets into one, removing duplicates.	same with interse
Setequal	helps us check if two sets are the same regardless of order.
Setdiff	find the elements that are in one set (or vector) but not in another set.

R Programming Cheat Sheet (DRAFT) by skydlins

Data Structures

Exploring Data

Data Visualization

Programming Structure and Functions

Basic Data Wrangling

Advance Data Wrangling

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

R Programming Cheat Sheet (DRAFT) by skydlins

Data Structures

Exploring Data

Data Visual­ization

Progra­mming Structure and Functions

Basic Data Wrangling

Advance Data Wrangling

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

Data Visualization

Programming Structure and Functions