Show Menu
Cheatography

Pandas Cheat Sheet (DRAFT) by

my functions learned along the way

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Lists [ ]

A list is an ordered an mutable (you can change it) Python container
creating a list: [ ]
numbers = [1,2,3,4]

cities = ["Br­uge­s", "­Rom­e"]

or mix of different types as wel as duplicated elements
list() constr­uctor:
of a string:
list("K­ari­m")
--> ["K",­"­a","r­"­,"i",­"­m"]
of tuple:
list((­"­Bru­ges­", "­Rom­e"))
--> ["Br­uge­s", "­Rom­e"]
of a dictionary
list({­"­hyd­rog­en":­1,"h­eli­um":2})
--> [hydro­gen­"­,"he­liu­m"]
of a set
list({­"­Bru­ges­", "­Rom­e"})
--> ["Br­uge­s", "­Rom­e"]
of a numpy array
list(n­p.a­rra­y([­1,2­,3]))
--> [1,2,3]
accessing:
starting idex = 0; last element = -1
cities[0]
--> ["Br­uge­s"]
cities[-2]
--> ["Br­uge­s"]
accessing multiple elements
[start­:st­op:­step]
 start index is inclusive
 end index is exclusive
 ­default value for step is 1; other values can be omitted = include all
modifying items
replace second item:
cities[1] ="Ge­nt"

replace first two items:
cities[:2] = ["Pa­ris­"­,"Lo­ndo­n"]
Removing elements (del, pop, remove)
del[ ] keyword --> delete first element:
del cities[0]
list.p­op(x) methode: removes the item at the given index, and returns it -->
remove­d_c­ities = cities.pop(1)
list.r­emo­ve(x) methode: deletes the first matching element from a the list, and returns None -->
cities.re­mov­e("B­rug­es")
Inserting elements
list.i­nse­rt(i,x) --> insert an element x (numbers, booleans, lists) at index i and returns none
list.a­ppe­nd(x) --> adds an item to the end of the list - equivalent to list.i­nse­rt(­len­(li­st),x)
Sorting
function: sorted­(it­era­ble[, key][, reverse]) --> returns a sorted list => add to variable
methode: list.s­ort­(key=…, reverse=) --> sorts the list in-place
arguments:
  - reverse : default = False = ascending
  - key: sort a list based on the value returned by the function (def or lambda) provided in the key parameter
Reversing
function: revers­ed(seq) --> to get a list use the list() constr­uctor ex.: produc­ts_­rev­ersed = list(r­eve­rse­d(p­rod­ucts))
methode: list.r­eve­rse() --> reverses the list in-place returning None
Concat­enate list
+ operator
list.e­xte­nd(­ite­rable) --> extends the list by appending all the items from the iterable
Check if an element exists in a list
  in → Evaluates to True if the object on the left side is included in the object on the right side.
  not in → Evaluates to True if the object on the left side is not included in the object on the right side.

basics

import the package
import pandas as pd
check version
pd.__v­ers­ion__
show all rows of dataframe (None = diplay all rows or fill in a number instead)
pd.set­_op­tio­n('­dis­pla­y.m­ax_­rows', None)
Copy data from clipboard
df = pd.rea­d_c­lip­board()
import data from csv-file ( .. = up one level)
df = pd.rea­d_c­sv(­"­pat­h/f­ile.cs­v")
Copy a data frame
df_copy = df.copy()
head() - show first 5 rows (default) or X rows
df.head()
or
df.hea­d(10)
tail() - show last 5 rows (default) or X rows
df.tail()
or
df.tai­l(10)
info() - This method prints inform­ation about a DataFrame including the index dtype and column dtypes, non-null values and memory usage
pd.info()
describe() - Descri­ptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distri­bution, excluding NaN values.
pd.des­cribe()
chain
.round(2)
to clean up the table
column names
df.columns
size of the dataframe
df.shape
Quantile() (like describe() but you can define your own values). Default axis = 0 => row-wise
df.qua­nti­le(­[0.1­,0.4,0.7, 0.8, 0.9])
Mean, Standard Deviation, Variance, Count, Median, Min, and Max on column level
df["­column name"].m­ean()
or other function, native or self-made
renaming columns
df.ren­ame­(co­lum­ns=­{'o­ldN­ame1': 'newNa­me1', 'oldNa­me2': 'newNa­me2'}, inplac­e=True)
Using the argument, inplace = True => save dataframe into itself. If we don’t state inplace = True you need to add result to a new or same dataframe with the "­=" operator
reorder columns - pass a list as a list and index
order we want:
cols = ['col_­nam­e_4', 'col_n­ame­_2'­,'c­ol_­nam­e_3', 'col_n­ame_1']

overwrite the old dataframe with the same dataframe but new column order:
df= df[cols]
adding new columns
df["­new­_co­lum­n_n­ame­"] = ...

... = [list] or a function applied to an other column or ...
Count unique rows
len(df­['c­olu­mn_­nam­e'].un­ique()
or `df['c­olu­mn_­nam­e'].nu­nique
Get count of (unique) values for a particular column
df.column­_name.value­_co­unts()
transform dataframe to list
chain with
.tolist()
-->
df.col­umn­s.t­olist()

making a dataframe

format: df = pd.Dat­aFrame(data,index values­,column names)
Creating df from list:
lst = ['This', 'is', 'a', 'nice', 'cheat', 'sheet']

df1 = pd.Dat­aFr­ame­(lst)
Creating df form dict:
dict= {"First Column Name":  ["First value", "­Second value",...], "­Second Column Name": ["First value", "­Second value",...], ... } 
df2 = pd.Dat­aFrame (dict)
another example:
df3 = pd.Dat­aFr­ame­(np.ra­ndo­m.r­andn(6, 4), index = list('­abc­def'), columns = list('­ABCD'))

Index

get index values (strings) - rows ("0", "­1", "­2" , ...)
df.index

>>> RangeI­nde­x(s­tart=0, stop=3­2561, step=1)
get column index values
df.columns
naming index (rows)
df.ind­ex.name = "name_o­f_c­hoice"
reset index
df_new = df.res­et_­index()
(the df has already been sliced otherwise the old and new index will be the same)
Resetting the index will make it a column and recreate another default index.
Parame­ters:
  drop = True (default = False) paramater won’t create that as column in the dataframe.
  inplace = True (default = False)
crosstabs has also index values
cross = pd.cro­sst­ab(­df_­new.co­l_n­ame­_1,­df_­new.co­l_n­ame_2)

cross.i­ndex

>>>
Index(­['v­alu­e_1­_of­_co­l_1', 'value­_2_­of_­col_1', ...], dtype=­'ob­ject', name='­col­_na­me_1')

individual items can be accessed like:
cross.l­oc­["va­lue­_?_­of_­col­_1"]
using the old index (index befor reseting) to access initial dataframe
df["­col­_na­me"]­[ne­w_d­f.i­nde­x_old]
--> index_old (see before name_o­f_c­hoice where we give our index a name)
Filtering a comple­mentary set from the data
df_new =df[df.ind­ex.i­si­n(d­f_s­ub.i­ndex)]
--> tilde sign :negate data (True becomes False ...)
 

df3

Selecting

slicing = getting and setting of subsets of the data set (3 ways)
.loc is primarily label based
.iloc is primarily integer position based
.loc, .iloc, and also [ ] indexing can accept a callable as indexer
df.loc­[ro­w_i­nde­xer­,co­lum­n_i­ndexer]
--> : is the null slice
selecting column(s):
df['co­lname'] or through a list of columns df[['c­oln­ame1', 'colna­me2']]

or directly as an attribute
df.colname
swapping columns
df[['B', 'A']] = df[['A', 'B']]
swapping column values on a subset (you have to swap the raw data !)
df.loc[:, ['B', 'A']] = df[['A', 'B']].t­o_­numpy()
create new column A with value 0 --> length of df
df['A'] = list(r­ang­e(l­en(­df.i­nd­ex)))
slicing using the [ ] operator --> [ ] slices the rows
[start­:en­d:step] -->­[2:5] --> starts at row 3 (row 2 not included); stops at row 5 (inclu­ded); default step = 1
If step is negatief = start from the last element
.loc - Selection by label (labels can NOT be integer values)
df.loc­['i­nde­x_l­abe­l_x­':'­ind­ex_­lab­el_y'']
--> index_­labels are row labels.
When slicing with .loc both the start bound AND the stop bound are included
Select all rows starting from row d, select all columns A to C
df3.lo­c['d':, 'A':'C']
--> red square
getting values with a boolean array
df3.loc[:, df3.lo­c['a'] > 0]
--> all rows and columns where row a >0 --> green square
select numeric columns (column names)
df_numeric = df.sel­ect­_dt­ypes(include = [np.nu­mber])

numeri­c_cols = df_num­eri­c.c­olu­mns.values
select non numeric columns (column names)
df_non­_nu­meric = df.sel­ect­_dt­ypes(exclude=[np.n­umber])

non_nu­mer­ic_cols = df_non­_nu­mer­ic.c­ol­umn­s.v­alues

setup enviro­nment

import pandas as pd
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib
plt.style.use("ggplot")
from matplotlib.pyplot import figure

%matplotlib inline
matplotlib.rcParams["figure.figsize"] = (12,8)

pd.options.mode.chained_assignment  = None

dropping and filling

drop columns
1) focus on columns to keep (add columns to a new dateframe)
df_new = df[['c­ol_­1_t­o_k­eep', 'col_2­_to­_keep', ... ]]
2) focus on columns to drop
df_new = df.dro­p([­'co­l_1­_to­_dr­op'­,'c­ol_­2_t­o_d­rop­','­col­_3_­to_­drop', ...], axis=1)
 
 
fill NaN with some value x
df.fil­lna(x)

datetime

Import statement
from datetime import datetime
--> python's default library for handling date and time
Creating datetime object
dateti­me(­yea­r=2020, month=4, day=11)

>>> dateti­me.d­at­eti­me(­2020, 4, 11, 0, 0)
arguments: year;m­ont­h;d­ay;­hou­r;m­inu­te;­sec­ond­;mi­lli­second
Now()
curren­t_time = dateti­me.n­ow()
Conver­ting: string to datetime object
dateti­me.s­tr­pti­me(­"­11-­04-­2020, 20:58:­15", "­%d-­%m-%Y, %H:%M:­%S")

>>>­dat­eti­me.d­at­eti­me(­2020, 4, 11, 20, 58, 15)
Conver­ting: datetime to string object
dateti­me.s­tr­fti­me(­dat­eti­me(­yea­r=2020, month=4, day=11), "­%d/­%m/­%Y")

>>> '11/04­/2020'
Data range in Pandas
pd.dat­e_r­ang­e(s­tar­t=d­ate­tim­e(y­ear­=2020, month=4, day=11­),p­eri­ods­=3,­fre­q='D')

>>>­Dat­eti­meI­nde­x([­'20­20-­04-11', '2020-­04-12', '2020-­04-­13'], dtype=­'da­tet­ime­64[­ns]', freq='D')
start argument can also be like: '2020-­04-11' or '2020/­04/11' or '2020, may 11'

apply function

under constr­uction

missing data

heatmap
cols = df.col­umn­s[:30]
--> select first 30 columns (names)
colours = ["bl­ue",­"­yel­low­"]
--> missing data will be displayed as yellow lines in the heatmap
sns.he­atm­ap(­df[­col­s].i­sn­ull­(),­cma­p=s­ns.c­ol­or_­pal­ett­e(c­olo­urs))
--> isnull() returns a df with true's and false's. If value is NA then isnull() returns true = 1
data percentage list
missing = {}

for col in df.col­umns:

 
pct_mi­ssing = np.mea­n(d­f[c­ol].is­null())

 
missin­g[col] = round(­pct­_mi­ssi­ng*100)

missin­g_s­orted = {key: value for key, value in sorted­(mi­ssi­ng.i­te­ms(), key=lambda item: item[1­],r­eve­rse­=True)}