Show Menu
Cheatography

Python - Pandas Cheat Sheet (DRAFT) by

Pandas library

This is a draft cheat sheet. It is a work in progress and is not finished yet.

TO START

import numpy as np
import pandas as pd

SERIES (similar to numpy array)

pd.Series(data = list)
create series from list
pd.Series(data=list,
index=­labels)
create series
with index
pd.Ser­ies­(np­_arr)
create series from
numpy array
pd.Series(np_arr,
labels)
create series
with index
pd.Ser­ies­(dict)
create series
from dictionary
pd.Ser­ies­[num]
indexing
ser1 + ser2
sum two Series
Pandas series differs from numpy arrays because series can have axis labels, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object, also functions (although unlikely used).

Note: the terms "­dat­a=" and "­ind­ex=­" can be omitted.

DATAFRAMES and INDEXING

df = pd.DataFrame() *
create dataframe
df['col'] *
select col
df.loc­['row']
select row
df.ilo­c['­row']
select a row
by its index
df.col
select a column
(opt.2 - avoid)
df[['c­ol1­','­col2']] *
take two columns
type(d­f['­col'])
column type
df['ne­w_col'] = [1,2,3]
insert column
df.dro­p('­row­',a­xis=0)*
drop row
df.dro­p('­col­',a­xis=1)
drop column
df.drop('col',axis=1,
inplace=True)*
permanent drop
df.loc­['r­ow1­','­col1']
select a row
and a column
df.loc[['r1', 'r2'],['c1','c2']]*
select 2 rows
and 2 columns
df>­con­dition
return boolean
df[df>­cond]
return values
df[df[­'co­l']­>0]
return rows of col that satisfy condition
df[df[­'co­l1'­]>0­]['­col2']
return col2 that satisfy cond.
on col1
df[df[­'c1­']>­0][­['c­2',­'c3']]
return c2 & c3
that satisfy
cond. on col1
df[(cond1) & (cond2)]
return values that satisfy cond1 & cond2
df[(cond1) | (cond2)]
return values that satisfy cond1 | cond2
df.res­et_­index()
add num index
df['ne­w_col'] = 'NY LA'
.split()
add col quickly
df.set­_in­dex­('col')
set a column
as index
df.set_index('col',
inplac­e=True)
make it permanent
!!THERE IS ALSO MULTI-­IND­EXI­NG"
DataFrame function take a data (the values), index (the name of the index column), columns (the name of the column) parame­ters.

Columns are series.

take two columns: note the double brackets [[ ]]

axis=0 can be omitted, is the default value.

inplac­e=True will apply the result to the original dataframe. Without it, you are not changing the dataframe.

r = row.
c = column.
 

MISSING DATA

df.isn­ull()
check for na
df.dro­pna()
drop all rows with
at least 1 na
df.dropna(axis=1)
drop all cols with
at least 1 na
df.dropna(thresh=n)
keep with at least
n value/s
df.fillna(value='value')
replace na
df['col'].fillna(value=
df['col'].mean())
replace using funct

GROUPBY

df.gro­upb­y('­col')
group rows by a col
grouped_df.count()
use cnt function
grouped_df.mean()
use mean function
groupe­d_d­f.std()
use std function
groupe­d_d­f.min()
use min function
groupe­d_d­f.max()
use max function
grouped_df.descr­ibe()
df descri­ptives
grouped_df('col')
.count().loc['row]
apply function and
take a row
... .trans­pose()
rotate results
... .trans­pos­e()­['row']
rotate and
take a row

MERGING, JOINING, CONCAT­ENATING

pd.con­cat­([d­f1,­df2­,df3])
concat­enate dfs
pd.con­cat­([...]­,ax­is=1)
concat­enate by col
pd.mer­ge()*
merge two dfs
df1.jo­in(df2)
join two dfs
pd.merge() takes "df1", "df2", "how=", "on=" parame­ters. "­how­=" can be "­inn­er"/­"­out­er"/­"­lef­t"/"r­igh­t", "­on=­" has to be a column/s key.

join() is similar to merge but works on indexes that can be different. It also cn take the "­how­=" argument.
 

METHODS and FUNCTIONS

df['c'].unique()
return unique values
df['c'].nunique()
count unique val
df['c'].value_counts()
count how many
of same values
df['c1'].apply­(func)
apply func to df
df['c1­'].a­pp­ly(len)
apply len
df['c1­'].a­pp­ly(sum)
apply sum
... .apply(lambda x:x+2)
apply lambda
df.index
return idx names
df.info()
return df info
df.des­cribe()
return df stats
df.columns
return col names
del df['col']*
delete col from df
df.sor­t_v­alu­es(­by=­'col')
sort df values
df.piv­ot_­tab­le()*
create a pivot tbl
del differs from ".dr­op(­)" because it will perman­ently remove a column from the df.

pivot_­table() takes "values=", "index=", "columns=" parame­ters. It reads: "­Create a table from df, with values of colx, index of colx2, and divided by values in colx3"

INPUT and OUPUT code to start

# to import HTML tables
conda install lxml
conda install html5lib
conda install BeautifulSoup4

# to use SQL
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:') 
df.to_sql('data', engine)
sql_df = pd.read_sql('data',con=engine)

INPUT and OUPUT operations

pwd
ask nb route
df = pd.rea­d_c­sv(­'ex­ample')
read csv
df = pd.read_excel
('name­',s­hee­t_n­ame­='n­ame')
read excel
df = pd.rea­d_h­tml­('a­ddr­ess')
read html
df.to_csv('str',index=False)
save as csv
df.to_excel('name',
sheet_name='name',
index = False)
save as xlsx