Show Menu
Cheatography

Data Science Cheat Sheet (DRAFT) by

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Pandas

import pandas as pd
df.ilo­c[:5,:]
return slice of data:all columns first 5 rows
type(df)
DataFrame
df.shape
(len, #ofcols)
df.columns
name of cols
df.index
return index column
df.head(3)
return first 3 rows
df.ilo­c[-­5:,:]
return last 5 rows
df.tail()
return last 5 rows
df.info()
return index, column types, # of row, # of not null cols
type(d­f['­low'])
Series
type(d­f['­low­'].v­alues)
numpy.n­darray
np.log­10(­df[­'low'])
return data frame
np.log­10(­df[­'lo­w'].va­lues)
return list of list
Each column in pandas is a Series.
You can run numpy on df or a col of df

Statis­tical Data Analysis

df.des­cribe()
count, mean,s­td,max, quartiles for each col of non-null rows
df['lo­w'].co­unt()
return # of not null rows
df[col­s].c­ount()
return a series
df['lo­w'].mean()
return mean ignoring nulls
df.std()
df.med­ian()
df.qua­nti­le(q)
q=.5:m­edian q=[.25­,.7­5]:­IQrange
df['lo­w'].min()
alphabetic order for non-nu­merics
df['lo­w'].max()
alphabetic order for non-nu­merics
df.mea­n(a­xis­='c­olu­mns')
mean of all columns for each row
df.low
df['low']

Time series

index_­col­='D­ate', parse_­dat­e=True
df.loc­['2­015-2']
return for all days
df.loc­['2­015­-2-20']
return all rows with this date
df.loc­['2­015­-2-20': '2015-3']
range
newD = pd.to_­dat­eti­me(­'Date' )
y-m-d h:m:s
df.rei­nde­x(newD)
reindexing with matching dates. if doesn't match,new rows w. null value
df.rei­nde­x(n­ewD­,me­tho­d='­ffill')
fill empty values forward fill:value of previous rows
method­='b­fill'
backward fill: value of later rows
df.res­amp­le(­'D'­).m­ean()
daily mean
'H', 'min', '2W'
hour, minute, 2 weekshour, minute, 2 weeks
'Y', 'Q', 'M', 'B'
year, quarter, month, business day
df.res­amp­le(­'W'­).s­um(­).max()
max of weekly sum
df.res­amp­le(­'4h­').f­fill()
every 4hours. fill nan w. previous values­every 4hours. fill nan w. previous values
df1+df2
df['Te­mpe­rat­ure­'][­'20­10-­aug­ust']
select temp col of aug.
df['Te­mpe­rat­ure­'][­'20­10-2']
select temp col of feb.
unsmoo­th.r­ol­lin­g(w­ind­ow=­24).mean()
moving average 24h
df['ty­pe'­].s­tr.u­pper()
return a column converted to uppercase
df['pr­odu­ct'­].s­tr.c­on­tai­ns(­'ware')
return boolean if substring 'ware' exists
True+True
2
False + False
0
df['pr­odu­ct'­].s­tr.c­on­tai­ns(­'wa­re'­).sum()
# of rows contains substring 'ware'
df['da­te'­].d­t.hour
return hour of each row 0-23
df['da­te'­].d­t.t­z_l­oca­liz­e('­US/­Cen­tral')
set timezone
df['da­te'­].d­t.t­z_c­onv­ert­('U­S/E­ast­ern')
df['da­te'­].r­esa­mpl­e('­A').fi­rst()
yearly from the initial date in data (1960-­12-31)
df['da­te'­].r­esa­mpl­e('­A').fi­rst­().i­nt­erp­ola­te(­'li­near')
replace nan with interp­olation
df.col­umn­s.s­tr.s­trip()
removes space from df.columns
df.set­_in­dex­('D­ate', inplac­e=True)
newD = pd.to_­dat­eti­me(­'Da­te_­list', format­='%­Y-%M-%D %H:%M)
pd.Ser­ies­(Co­lum­ns_­list, index=­newD)
Construct a pandas Series c
ts2_interp = ts2.re­ind­ex(­ts1.in­dex­).i­nte­rpo­lat­e(h­ow=­'li­near')
Reset the index of ts2 to ts1, and then use linear interp­olation to fill in the NaNs: ts2_interp
timezo­ne.d­t.t­z_­loc­ali­ze(­'US­/Ce­ntral')
localize the local time timezone to 'US/Ce­ntral
 

Build DF

df=pd.r­ea­d_c­sv(­"­fil­epa­th", index_­col=0)
add index column 0-len(inp)
index_­col­='n­ame­ofa­column'
df.ind­ex=­['A', 'B', ...]
assign index to df. len(in­dex­)==­len(df)
pd.Dat­aFr­ame­({'­id'­:[1­,2,3], 'gen':­'M'})
key: columns, values: row
pd.Dat­aFr­ame­(di­ct_­of_­lists)
zipped­=li­st(­zip­(li­st_­labels, list_v­alues))
pd.Dat­aFr­ame­(di­ct(­zip­ped))
list_l­abels, list_v­alues = list of list
pd.rea­d_c­sv(­"­fil­epa­th", header­=None)
no header
pd.rea­d_c­sv(­"­fil­epa­th", options)
col_n:list of column names
hearder=0, names=­col_n
rename the header
header­=None, names=­col_n
no header in file & header is col_n
na_val­ues­='-1'
convert specific value (-1) to a nan
na_val­ues­={'­col­nam­e':­['-1', '']}
define a dic for each col
parse_­dat­es=­[[0­,1,1]]
convert 3 columns of date to one col
parse_­dat­es=True
convert column with date to dateformat
delimi­ter=' '
header=3
header is in index 3
commen­t='#'
ignore all lines start with '#' in the input
index_col = 'dates'
set a column as index
df[cols]
take specific columns
df.to_­csv­('o­utp­utp­ath')
df.to_­exc­el(­'ou­tpu­tpath')
pd.Dat­aFr­ame­({'­smo­oth­ed'­:sm­oothed, 'unsmo­oth­ed'­:un­smo­othed})
create df.if they have index, will merge based on index

catego­rical

df['ty­pe'­].d­ecr­ibe()
count not null,# of unique,top item,freq. of top
df['ty­pe'­].u­nique()
#of unique items
df.loc­[df­['t­ype­']=­=x,:]
df[df[­'ty­pe'­]==x]
del def['t­ype']
delete a column

Numpy+Df

df.values
Create array of DataFrame values
df[col­name]=0
create a columns with zero elements in df

Cleanning

df_dropped = df.dro­p(l­ist­_to­_drop, axis='­col­umns')
Remove the approp­riate columns list_t­o_drop
df.set­_in­dex­(co­lname)
Set colname as the index
pd.to_­num­eric()
It converts a Series of values to floati­ng-­point values. Furthe­rmore, by specifying the keyword argument errors­='c­oerce', you can force strings like 'M' to be interp­reted as NaN.
df.res­et_­ind­ex(­)[c­olname]
Extract the colname column from df using .reset­_in­dex()
df.loc­[df­[co­lna­me]­=='­sth']
choose the rows in df for df[col­nam­e]=­'sth'
df.loc­[df­[co­lna­me].st­r.c­ont­ain­('s­th')]
choose the rows in df where the column df[col­name] contain 'sth'
 

Plot

import matplo­tli­b.p­yplot as plt
plt.pl­ot(­df[­'lo­w'].va­lues)
x axis= index of value
plt.show()
show the image
plt.pl­ot(­df[­'low'])
x axis is index of df (eg date)
df['lo­w'].plot()
plot series directly. has also x label
df.plot()
show all columns in df with legend
plt.ys­cal­e('­log')
log scale on vertical axis
df['lo­w'].pl­ot(­col­or=­'b'­,st­yle­='.-', legend­=True)
plt.ax­is(­(minx, maxx,m­iny­,maxy))
zoom
plt.ti­tle­('t­itle')
plt.yl­abe­l('­label)
plt.xl­abe­l('­xla­bel')
plt.sa­vef­ig(­'a.p­df')
plt.sa­vef­ig(­'a.j­pg')
df.plo­t(s­ubp­lot­s=True)
Draw each column in one subplot.
df.plo­t(x­='c­oln­ame­',y­='c­oln­ame­',k­ind­='s­cat­ter')
plot 2 columns
kind = 'box'
box plot
kind = 'hist'
histogram
kind='­area'
bins=30
intege­r:#of bins
range=­(4,8)
tuple (min,max)
normed­=True
boolean. normalize to one for hist
cumula­tiv­e=True
boolean for hist
alpha=0.3
visibility of several histograms
s=sizes
sizes= array of size of each circle in scatter plot
fig, axes=s­ubp­lot­s(n­row­s=1­,nc­ols=1)
df['lo­w'].pl­ot(­ax=­axe­s[0], ...)
...: kind, bins, normed­,cu­mul­ative
df.plo­t(y­='c­oln­ame­',k­ind­='box')
style=­'k.-'
color,­mar­ker­,line type
plt.clf()
clears the entire current figure with all its axes, but leaves the window opened, such that it may be reused for other plots

Indexing

df['co­lna­me'­]['­row­name']
rowname is index_col
df.col­nam­e['­row­name']
df.loc­['r­own­ame­','­col­name']
df.loc­['r­own­sta­rt'­,'r­own­end',:]
row names are inclusive.
df[['l­ow']]
returns a single column data frame
df['low']
returns a series with index of df