Show Menu
Cheatography

pandas groupby Cheat Sheet (DRAFT) by

dfasdfasdfasdfadfasdf

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Pandas objects

Series
a one-di­men­sional labeled array that can hold any data type
DataFrame
a two-di­men­sional labeled data structure with columns of potent­ially different types
Index
a sequence of axis labels that can be used to identify rows or columns in a DataFrame
MultiIndex
a hierar­chical index object that allows for more than one index level in a DataFrame.
Timestamp
a specific moment in time, repres­ented in Pandas as a datetime object.
Period
a specific interval of time, repres­ented in Pandas as a period object.
Dateti­meIndex
an index of datetime objects, used to index Pandas objects like Series and DataFrame.
Timedelta
a duration of time, repres­ented in Pandas as a timedelta object.
Catego­rical
a data type used to represent catego­rical variables,
Sparse
a data structure used to represent sparse data effici­ently,
Interval
a data type used to represent intervals
DatetimeTZ
a datetime object with a timezone.

Pandas functions called on pd

pd.rea­d_csv()
used to read data from a CSV file and create a DataFrame.
pd.Dat­aFr­ame()
a constr­uctor function that is used to create a new DataFrame from data in memory.
pd.Ser­ies():
used to create a new Series object.
pd.con­cat()
used to concat­enate two or more DataFrames or Series
pd.mer­ge():
used to merge two DataFrames based on a common column.
pd.gro­upby()
used to group data in a DataFrame based on one or more columns.
pd.piv­ot_­tab­le():
used to create a pivot table from a DataFrame.
pd.to_­dat­eti­me():
used to convert a column of strings to datetime objects.
pd.to_­num­eric()
used to convert a column of strings to numeric objects.
pd.cut()
This function is used to bin data into discrete intervals.
pd.qcut():
This function is used to bin data into quantiles.
pd.dat­e_r­ange()
used to create a range of dates or Dateti­meI­ndex.
pd.tim­ede­lta()
used to create a timedelta object repres­enting a duration of time.

String vector­ization

str.re­pla­ce():
used to replace a pattern in a string with another pattern
str.ex­tract()
used to extract a pattern from a string and return the first match.
str.sp­lit()
used to split a string into a list of strings based on a delimiter.
str.join()
used to concat­enate a list of strings with a separator string
str.st­art­swith()
used to check if a string starts with a particular substring
str.en­dsw­ith()
used to check if a string ends with a particular substring
str.lo­wer(), str.up­per():
used to convert the case of a string to lowercase or uppercase, respec­tively
str.st­rip():
used to remove leading and trailing whitespace from a string
str.co­nta­ins()
used to check if a pattern exists in a string

Date Vector­ization

pd.to_­dat­eti­me():
used to convert a column of strings or Unix timestamps to a datetime object.
dt.date
used to extract the date component of a datetime object or Dateti­meI­ndex.
dt.time
extract the time component of a datetime object or Dateti­meI­ndex.
dt.hour, dt.minute, dt.second
used to extract the hour, minute, and second components of a datetime object or Dateti­meI­ndex.
dt.day­ofweek:
used to extract the day of the week as an integer,
dt.day­_name()
used to extract the name of the day of the week
dt.mon­th_­name()
used to extract the name of the month
dt.tz_­loc­alize() and dt.tz_­con­vert():
used to set or convert the timezone of a datetime object or Dateti­meI­ndex.
dt.is_­mon­th_­start and dt.is_­mon­th_end
used to check if a datetime object or Dateti­meIndex is at the start or end of a month, respec­tively.

pandas groupby

Consider it as DataFrame; you can use all DataFrame attributes and functions on the group object.

Pandas pivot_­table

data
The DataFrame or Series to be used for the pivot table.
values
The column to aggregate. If not specified, all numerical columns will be used.
index
The column(s) to use as row labels.
columns
The column(s) to use as column labels.
aggfunc
The function to use for aggreg­ating the values. Defaults to 'mean'.
margins
Add all row/co­lumns (e.g. 'All') label and compute grand total for that row/co­lumn.
margin­s_name
The label for the row/column that contains the grand total. Defaults to 'All'.
dropna
Whether or not to exclude rows with missing values. Defaults to True.
sort
Whether or not to sort the rows by the values of the index.

Pandas multiIndex

# Create a MultiIndex from two separate indexes
index1 = pd.Index(['A', 'B', 'C'], name='Index1')
index2 = pd.Index(['X', 'Y', 'Z'], name='Index2')
multi_index = pd.MultiIndex.from_product([index1, index2], names=['Index1', 'Index2'])

# Create a MultiIndex from a list of tuples
tuples = [('A', 'X'), ('A', 'Y'), ('B', 'X'), ('B', 'Y')]
multi_index = pd.MultiIndex.from_tuples(tuples, names=['Index1', 'Index2'])

# Select data from a single level of the index
df.loc['A']  # selects all rows where Index1 = 'A'
df.loc[('A', 'X')]  # selects a single row where Index1 = 'A' and Index2 = 'X'

# Select data from multiple levels of the index
# selects rows where Index1 is 'A' or 'B' and Index2 is 'X', and returns the 'Column1' values
df.loc[(['A', 'B'], 'X'), 'Column1'] 

# Stack a DataFrame to move a level of the MultiIndex to the columns
df.stack()

# Unstack a DataFrame to move a level of the columns to the index
df.unstack()