Python Pandas top 25 Cheat Sheet

1. Show installed versions

pd.__version__	Show Python version
pd.show_versions()	Show dependency version

2. Create an example DataFrame

df = pd.DataFrame({'col one':[100, 200], 'col two':[300, 400]})	Pass a dictionary to the DataFrame constructor, keys are the column names and the dictionary are the values
pd.DataFrame(np.random.rand(4, 8))	Use the rand function to create a larger data frame
pd.DataFrame(np.random.rand(4, 8), columns=list('abcdefgh'))	If you want non-numeric column names

3. Rename columns

df = df.rename({'col one':'col_one', 'col two':'col_two'}, axis='columns')	Pass a dictionary; keys are the old names and the values are the new names, and you also specify the axis
df.columns = ['col_one', 'col_two']	if you're going to rename all of the columns at once, a simpler method is just to overwrite the columns attribute of the DataFrame
df.columns = df.columns.str.replace(' ', '_')	To replace spaces with underscores, use the str.replace() method
df.add_prefix('X_')	Add a prefix
df.add_suffix('_Y')	Add a suffix

4. Reverse row order

drinks.loc[::-1]	The most straightforward method is to use the loc accessor
drinks.loc[::-1].reset_index(drop=True)	Reset the index; use reset_index() to drop the old index entirely

5. Reverse column order

drinks.loc[:, ::-1]

Use loc to reverse the left-to-right order of your columns

6. Select columns by data type

drinks.select_dtypes(include='number')	To select only the numeric columns
drinks.select_dtypes(include=['number', 'object', 'category', 'datetime'])	To include multiple data types by passing a list
drinks.select_dtypes(exclude='number')	To exclude certain data types

7. Convert strings to numbers

df.astype({'col_one':'float', 'col_two':'float'}).dtypes	o convert the data types to numeric. You can use the astype() method
pd.to_numeric(df.col_three, errors='coerce').fillna(0)	use the to_numeric() function on. If you know that the NaN values actually represent zeros, you can fill them with zeros using the fillna() method the third column and tell it to convert any invalid input into NaN values
df = df.apply(pd.to_numeric, errors='coerce').fillna(0)	To apply this function to the entire DataFrame all at once by using the apply() method

8. Reduce DataFrame size

drinks.info(memory_usage='deep')	Review the memory usage of a dataframe
cols = ['beer_servings', 'continent'] small_drinks = pd.read_csv('http://bit.ly/drinksbycountry', usecols=cols)	The first step is to only read in the columns that you actually need, which we specify with the "usecols" parameter
dtypes = {'continent':'category'} smaller_drinks = pd.read_csv('http://bit.ly/drinksbycountry', usecols=cols, dtype=dtypes)	The second step is to convert any object columns containing categorical data to the category data type, which we specify with the "dtype" parameter

9. Build a DataFrame from multiple files (row)

from glob import glob	Use the glog module
stock_files = sorted(glob('data/stocks*.csv'))	In this case, glob is looking in the "data" subdirectory for all CSV files that start with the word "stocks"; ['data/stocks1.csv', 'data/stocks2.csv', 'data/stocks3.csv']
pd.concat((pd.read_csv(file) for file in stock_files))	use a generator expression to read each of the files using read_csv() and pass the results to the concat() function, which will concatenate the rows into a single DataFrame
pd.concat((pd.read_csv(file) for file in stock_files), ignore_index=True)	There are now duplicate values in the index. To avoid that, we can tell the concat() function to ignore the index and instead use the default integer index

10. Build a DF from multiple files (columns)¶

drink_files = sorted(glob('data/drinks*.csv'))
pd.concat((pd.read_csv(file) for file in drink_files), axis='columns')	Tell the concat() function to concatenate along the columns axis

11. Create a DataFrame from the clipboard

df = pd.read_clipboard()

Just select the data and copy it to the clipboard. Then, you can use the read_clipboard() function to read it into a DataFrame

22. Create a pivot table

titanic.pivot_table(index='Sex', columns='Pclass', values='Survived', aggfunc='mean')	If you often create DataFrames, you might find it more convenient to use the pivot_table() method instead. With a pivot table, you directly specify the index, the columns, the values, and the aggregation function.
titanic.pivot_table(index='Sex', columns='Pclass', values='Survived', aggfunc='mean', margins=True)	An added benefit of a pivot table is that you can easily add row and column totals by setting margins=True
titanic.pivot_table(index='Sex', columns='Pclass', values='Survived', aggfunc='count', margins=True)	Create a cross-tabulation just by changing the aggregation function from "mean" to "count"

23. Convert continuous data into categorical data

pd.cut(titanic.Age, bins=[0, 18, 25, 99], labels=['child', 'young adult', 'adult']).head(10)

Label the age ranges, such as "child", "young adult", and "adult". The best way to do this is by using the cut() function. This assigned each value to a bin with a label. Ages 0 to 18 were assigned the label "child", ages 18 to 25 were assigned the label "young adult", and ages 25 to 99 were assigned the label "adult".

24. Change display options

pd.set_option('display.float_format', '{:.2f}'.format)	To standardise the display to use 2 decimal places
pd.reset_option('display.float_format')	Reset any option back to its default

25. Style a DataFrame

format_dict = {'Date':'{:%m/%d/%y}', 'Close':'${:.2f}', 'Volume':'{:,}'}	Create a dictionary of format strings that specifies how each column should be formatted
stocks.style.format(format_dict)	Pass it to the DataFrame's style.format() method
(stocks.style.format(format_dict).hide_index().highlight_min('Close', color='red').highlight_max('Close', color='lightgreen') )	We've now hidden the index, highlighted the minimum Close value in red, and highlighted the maximum Close value in green
(stocks.style.format(format_dict).hide_index().background_gradient(subset='Volume', cmap='Blues') )	Highlight the minimum Close value in red, and highlighted the maximum Close value in green
(stocks.style.format(format_dict) .hide_index() .bar('Volume', color='lightblue', align='zero') .set_caption('Stock Prices from October 2016') )	There's now a bar chart within the Volume column and a caption above the DataFrame.

21. Reshape a MultiIndexed Series

titanic.groupby('Sex').Survived.mean()	If you wanted to calculate the survival rate by a single category such as "Sex", you would use a groupby()
titanic.groupby(['Sex', 'Pclass']).Survived.mean()	If you wanted to calculate the survival rate across two different categories at once, you would groupby() both of those categories
titanic.groupby(['Sex', 'Pclass']).Survived.mean().unstack()	It can be hard to read and interact with data in this format, so it's often more convenient to reshape a MultiIndexed Series into a DataFrame by using the unstack() method

20. Select a slice of rows and columns

titanic.describe()	If you wanted a numerical summary of the dataset, you would use the describe() method
titanic.describe().loc['min':'max']	If you wanted to filter it to only show the "five-number summary", you can use the loc accessor and pass it a slice of the "min" through the "max" row labels
titanic.describe().loc['min':'max', 'Pclass':'Parch']	And if you're not interested in all of the columns, you can also pass it a slice of column labels

18. Aggregate by multiple functions

orders[orders.order_id == 1].item_price.sum()	Each order has an order_id and consists of one or more rows. To figure out the total price of an order, you sum the item_price for that order_id. For example, here's the total price of order number 1
orders.groupby('order_id').item_price.sum()	If you wanted to calculate the total price of every order, you would groupby() order_id and then take the sum of item_price for each group
orders.groupby('order_id').item_price.agg(['sum', 'count'])	However, you're not actually limited to aggregating by a single function such as sum(). To aggregate by multiple functions, you use the agg() method and pass it a list of functions such as sum() and count()

19. Combine the output of an aggregation with a DF

orders.groupby('order_id').item_price.sum()	What if we wanted to create a new column listing the total price of each order? Recall that we calculated the total price using the sum() method
len(orders.groupby('order_id').item_price.sum())	In other words, the output of the sum() function
len(orders.item_price)	...is smaller than the input to the function
total_price = orders.groupby('order_id').item_price.transform('sum') len(total_price)	The solution is to use the transform() method, which performs the same calculation but returns output data that is the same shape as the input data

This needs more work!

16. Split a string into multiple columns

df.name.str.split(' ', expand=True)	What if we wanted to split the "name" column into three separate columns, for first, middle, and last name? We would use the str.split() method and tell it to split on a space character and expand the results into a DataFrame
df[['first', 'middle', 'last']] = df.name.str.split(' ', expand=True) df	These three columns can actually be saved to the original DataFrame in a single assignment statement
df.location.str.split(', ', expand=True)	What if we wanted to split a string, but only keep one of the resulting columns? For example, let's split the location column on "comma space"
df['city'] = df.location.str.split(', ', expand=True)[0]	If we only cared about saving the city name in column 0, we can just select that column and save it to the DataFrame

13. Filter a DataFrame by multiple categories

movies[(movies.genre == 'Action') \| (movies.genre == 'Drama') \| (movies.genre == 'Western')]	If we wanted to filter the DataFrame to only show movies with the genre Action or Drama or Western, we could use multiple conditions separated by the "or" operator
movies[movies.genre.isin(['Action', 'Drama', 'Western'])]	However, you can actually rewrite this code more clearly by using the isin() method and passing it a list of genres
movies[~movies.genre.isin(['Action', 'Drama', 'Western'])]	And if you want to reverse this filter, so that you are excluding (rather than including) those three genres, you can put a tilde in front of the condition

14. Filter a DataFrame by largest categories

counts.nlargest(3)	The Series method nlargest() makes it easy to select the 3 largest values in this Series
counts.nlargest(3).index	And all we actually need from this Series is the index
movies[movies.genre.isin(counts.nlargest(3).index)]	Finally, we can pass the index object to isin(), and it will be treated like a list of genres

15. Handle missing values

ufo.isna().sum()	To find out how many values are missing in each column, you can use the isna() method and then take the sum()
ufo.isna().mean()	Similarly, you can find out the percentage of values that are missing by taking the mean() of isna()
ufo.dropna(axis='columns')	If you want to drop the columns that have any missing values, you can use the dropna() method
ufo.dropna(thresh=len(ufo)*0.9, axis='columns')	Or if you want to drop columns in which more than 10% of the values are missing, you can set a threshold for dropna()

Python Pandas top 25 Cheat Sheet (DRAFT) by Ianh

1. Show installed versions

2. Create an example DataFrame

3. Rename columns

4. Reverse row order

5. Reverse column order

6. Select columns by data type

7. Convert strings to numbers

8. Reduce DataFrame size

9. Build a DataFrame from multiple files (row)

10. Build a DF from multiple files (columns)¶

11. Create a DataFrame from the clipboard

22. Create a pivot table

23. Convert continuous data into categorical data

24. Change display options

25. Style a DataFrame

21. Reshape a MultiIndexed Series

20. Select a slice of rows and columns

18. Aggregate by multiple functions

19. Combine the output of an aggregation with a DF

16. Split a string into multiple columns

13. Filter a DataFrame by multiple categories

14. Filter a DataFrame by largest categories

15. Handle missing values

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

Python Pandas top 25 Cheat Sheet (DRAFT) by Ianh

1. Show installed versions

2. Create an example DataFrame

3. Rename columns

4. Reverse row order

5. Reverse column order

6. Select columns by data type

7. Convert strings to numbers

8. Reduce DataFrame size

9. Build a DataFrame from multiple files (row)

10. Build a DF from multiple files (columns)¶

11. Create a DataFrame from the clipboard

22. Create a pivot table

23. Convert continuous data into catego­rical data

24. Change display options

25. Style a DataFrame

21. Reshape a MultiI­ndexed Series

20. Select a slice of rows and columns

18. Aggregate by multiple functions

19. Combine the output of an aggreg­ation with a DF

16. Split a string into multiple columns

13. Filter a DataFrame by multiple categories

14. Filter a DataFrame by largest categories

15. Handle missing values

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

23. Convert continuous data into categorical data

21. Reshape a MultiIndexed Series

19. Combine the output of an aggregation with a DF