Show Menu
Cheatography

Python Pandas top 25 Cheat Sheet (DRAFT) by

https://github.com/justmarkham/pandas-videos/blob/master/top_25_pandas_tricks.ipynb --||-- https://www.youtube.com/watch?v=RlIiVeig3hc

This is a draft cheat sheet. It is a work in progress and is not finished yet.

1. Show installed versions

pd.__v­ers­ion__
Show Python version
pd.sho­w_v­ers­ions()
Show dependency version

2. Create an example DataFrame

df = pd.Dat­aFr­ame­({'col one':[100, 200], 'col two':[300, 400]})
Pass a dictionary to the DataFrame constr­uctor, keys are the column names and the dictionary are the values
pd.Dat­aFr­ame­(np.ra­ndo­m.r­and(4, 8))
Use the rand function to create a larger data frame
pd.Dat­aFr­ame­(np.ra­ndo­m.r­and(4, 8), column­s=l­ist­('a­bcd­efgh'))
If you want non-nu­meric column names

3. Rename columns

df = df.ren­ame­({'col one':'­col­_one', 'col two':'­col­_two'}, axis='­col­umns')
Pass a dictio­nary; keys are the old names and the values are the new names, and you also specify the axis
df.columns = ['col_­one', 'col_two']
if you're going to rename all of the columns at once, a simpler method is just to overwrite the columns attribute of the DataFrame
df.columns = df.col­umn­s.s­tr.r­ep­lace(' ', '_')
To replace spaces with unders­cores, use the str.re­place() method
df.add­_pr­efi­x('X_')
Add a prefix
df.add­_su­ffi­x('_Y')
Add a suffix

4. Reverse row order

drinks.lo­c[::-1]
The most straig­htf­orward method is to use the loc accessor
drinks.lo­c[:­:-1­].r­ese­t_i­nde­x(d­rop­=True)
Reset the index; use reset_­index() to drop the old index entirely

5. Reverse column order

drinks.loc[:, ::-1]
Use loc to reverse the left-t­o-right order of your columns

6. Select columns by data type

drinks.se­lec­t_d­typ­es(­inc­lud­e='­num­ber')
To select only the numeric columns
drinks.se­lec­t_d­typ­es(­inc­lud­e=[­'nu­mber', 'object', 'categ­ory', 'datet­ime'])
To include multiple data types by passing a list
drinks.se­lec­t_d­typ­es(­exc­lud­e='­num­ber')
To exclude certain data types

7. Convert strings to numbers

df.ast­ype­({'­col­_on­e':­'fl­oat', 'col_t­wo'­:'f­loa­t'}­).d­types
o convert the data types to numeric. You can use the astype() method
pd.to_­num­eri­c(d­f.c­ol_­three, errors­='c­oer­ce'­).f­ill­na(0)
use the to_num­eric() function on. If you know that the NaN values actually represent zeros, you can fill them with zeros using the fillna() method the third column and tell it to convert any invalid input into NaN values
df = df.app­ly(­pd.t­o_­num­eric, errors­='c­oer­ce'­).f­ill­na(0)
To apply this function to the entire DataFrame all at once by using the apply() method

8. Reduce DataFrame size

drinks.in­fo(­mem­ory­_us­age­='d­eep')
Review the memory usage of a dataframe
cols = ['beer­_se­rvi­ngs', 'conti­nent'] small_­drinks = pd.rea­d_c­sv(­'ht­tp:­//b­it.l­y/­dri­nks­byc­oun­try', usecol­s=cols)
The first step is to only read in the columns that you actually need, which we specify with the "­use­col­s" parameter
dtypes = {'cont­ine­nt'­:'c­ate­gory'} smalle­r_d­rinks = pd.rea­d_c­sv(­'ht­tp:­//b­it.l­y/­dri­nks­byc­oun­try', usecol­s=cols, dtype=­dtypes)
The second step is to convert any object columns containing catego­rical data to the category data type, which we specify with the "­dty­pe" parameter

9. Build a DataFrame from multiple files (row)

from glob import glob
Use the glog module
stock_­files = sorted­(gl­ob(­'da­ta/­sto­cks­*.c­sv'))
In this case, glob is looking in the "­dat­a" subdir­ectory for all CSV files that start with the word "­sto­cks­"; ['data­/st­ock­s1.c­sv', 'data/­sto­cks­2.csv', 'data/­sto­cks­3.csv']
pd.con­cat­((p­d.r­ead­_cs­v(file) for file in stock_­files))
use a generator expression to read each of the files using read_csv() and pass the results to the concat() function, which will concat­enate the rows into a single DataFrame
pd.con­cat­((p­d.r­ead­_cs­v(file) for file in stock_­files), ignore­_in­dex­=True)
There are now duplicate values in the index. To avoid that, we can tell the concat() function to ignore the index and instead use the default integer index

10. Build a DF from multiple files (columns)¶

drink_­files = sorted­(gl­ob(­'da­ta/­dri­nks­*.c­sv'))
pd.con­cat­((p­d.r­ead­_cs­v(file) for file in drink_­files), axis='­col­umns')
Tell the concat() function to concat­enate along the columns axis

11. Create a DataFrame from the clipboard

df = pd.rea­d_c­lip­board()
Just select the data and copy it to the clipboard. Then, you can use the read_c­lip­board() function to read it into a DataFrame
 

22. Create a pivot table

titani­c.p­ivo­t_t­abl­e(i­nde­x='­Sex', column­s='­Pcl­ass', values­='S­urv­ived', aggfun­c='­mean')
If you often create DataFr­ames, you might find it more convenient to use the pivot_­table() method instead. With a pivot table, you directly specify the index, the columns, the values, and the aggreg­ation function.
titani­c.p­ivo­t_t­abl­e(i­nde­x='­Sex', column­s='­Pcl­ass', values­='S­urv­ived', aggfun­c='­mean', margin­s=True)
An added benefit of a pivot table is that you can easily add row and column totals by setting margin­s=True
titani­c.p­ivo­t_t­abl­e(i­nde­x='­Sex', column­s='­Pcl­ass', values­='S­urv­ived', aggfun­c='­count', margin­s=True)
Create a cross-­tab­ulation just by changing the aggreg­ation function from "­mea­n" to "­cou­nt"

23. Convert continuous data into catego­rical data

pd.cut­(ti­tan­ic.Age, bins=[0, 18, 25, 99], labels­=['­child', 'young adult', 'adult­']).he­ad(10)
Label the age ranges, such as "­chi­ld", "­young adult", and "­adu­lt". The best way to do this is by using the cut() function. This assigned each value to a bin with a label. Ages 0 to 18 were assigned the label "­chi­ld", ages 18 to 25 were assigned the label "­young adult", and ages 25 to 99 were assigned the label "­adu­lt".

24. Change display options

pd.set­_op­tio­n('­dis­pla­y.f­loa­t_f­ormat', '{:.2f­}'.f­ormat)
To standa­rdise the display to use 2 decimal places
pd.res­et_­opt­ion­('d­isp­lay.fl­oat­_fo­rmat')
Reset any option back to its default

25. Style a DataFrame

format­_dict = {'Date­':'­{:%­m/%­d/%y}', 'Close­':'­${:.2f}', 'Volum­e':­'{:,}'}
Create a dictionary of format strings that specifies how each column should be formatted
stocks.st­yle.fo­rma­t(f­orm­at_­dict)
Pass it to the DataFr­ame's style.f­or­mat() method
(stock­s.s­tyl­e.f­orm­at(­for­mat­_di­ct).hi­de_­ind­ex(­).h­igh­lig­ht_­min­('C­lose', color=­'re­d').hi­ghl­igh­t_m­ax(­'Cl­ose', color=­'li­ght­green') )
We've now hidden the index, highli­ghted the minimum Close value in red, and highli­ghted the maximum Close value in green
(stock­s.s­tyl­e.f­orm­at(­for­mat­_di­ct).hi­de_­ind­ex(­).b­ack­gro­und­_gr­adi­ent­(su­bse­t='­Vol­ume', cmap='­Blues') )
Highlight the minimum Close value in red, and highli­ghted the maximum Close value in green
(stock­s.s­tyl­e.f­orm­at(­for­mat­_dict) .hide_­index() .bar('­Vol­ume', color=­'li­ght­blue', align=­'zero') .set_c­apt­ion­('Stock Prices from October 2016') )
There's now a bar chart within the Volume column and a caption above the DataFrame.

21. Reshape a MultiI­ndexed Series

titani­c.g­rou­pby­('S­ex'­).S­urv­ive­d.m­ean()
If you wanted to calculate the survival rate by a single category such as "­Sex­", you would use a groupby()
titani­c.g­rou­pby­(['­Sex', 'Pclas­s']­).S­urv­ive­d.m­ean()
If you wanted to calculate the survival rate across two different categories at once, you would groupby() both of those categories
titani­c.g­rou­pby­(['­Sex', 'Pclas­s']­).S­urv­ive­d.m­ean­().u­ns­tack()
It can be hard to read and interact with data in this format, so it's often more convenient to reshape a MultiI­ndexed Series into a DataFrame by using the unstack() method

20. Select a slice of rows and columns

titani­c.d­esc­ribe()
If you wanted a numerical summary of the dataset, you would use the describe() method
titani­c.d­esc­rib­e().lo­c['­min­':'­max']
If you wanted to filter it to only show the "­fiv­e-n­umber summar­y", you can use the loc accessor and pass it a slice of the "­min­" through the "­max­" row labels
titani­c.d­esc­rib­e().lo­c['­min­':'­max', 'Pclas­s':­'Pa­rch']
And if you're not interested in all of the columns, you can also pass it a slice of column labels

18. Aggregate by multiple functions

orders­[or­der­s.o­rder_id == 1].ite­m_p­ric­e.sum()
Each order has an order_id and consists of one or more rows. To figure out the total price of an order, you sum the item_price for that order_id. For example, here's the total price of order number 1
orders.gr­oup­by(­'or­der­_id­').i­te­m_p­ric­e.sum()
If you wanted to calculate the total price of every order, you would groupby() order_id and then take the sum of item_price for each group
orders.gr­oup­by(­'or­der­_id­').i­te­m_p­ric­e.a­gg(­['sum', 'count'])
However, you're not actually limited to aggreg­ating by a single function such as sum(). To aggregate by multiple functions, you use the agg() method and pass it a list of functions such as sum() and count()

19. Combine the output of an aggreg­ation with a DF

orders.gr­oup­by(­'or­der­_id­').i­te­m_p­ric­e.sum()
What if we wanted to create a new column listing the total price of each order? Recall that we calculated the total price using the sum() method
len(or­der­s.g­rou­pby­('o­rde­r_i­d').it­em_­pri­ce.s­um())
In other words, the output of the sum() function
len(or­der­s.i­tem­_price)
...is smaller than the input to the function
total_­price = orders.gr­oup­by(­'or­der­_id­').i­te­m_p­ric­e.t­ran­sfo­rm(­'sum') len(to­tal­_price)
The solution is to use the transf­orm() method, which performs the same calcul­ation but returns output data that is the same shape as the input data
This needs more work!

16. Split a string into multiple columns

df.nam­e.s­tr.s­plit(' ', expand­=True)
What if we wanted to split the "­nam­e" column into three separate columns, for first, middle, and last name? We would use the str.sp­lit() method and tell it to split on a space character and expand the results into a DataFrame
df[['f­irst', 'middle', 'last']] = df.nam­e.s­tr.s­plit(' ', expand­=True) df
These three columns can actually be saved to the original DataFrame in a single assignment statement
df.loc­ati­on.s­tr.sp­lit(', ', expand­=True)
What if we wanted to split a string, but only keep one of the resulting columns? For example, let's split the location column on "­comma space"
df['city'] = df.loc­ati­on.s­tr.sp­lit(', ', expand­=Tr­ue)[0]
If we only cared about saving the city name in column 0, we can just select that column and save it to the DataFrame

13. Filter a DataFrame by multiple categories

movies­[(m­ovi­es.g­enre == 'Action') | (movie­s.genre == 'Drama') | (movie­s.genre == 'Weste­rn')]
If we wanted to filter the DataFrame to only show movies with the genre Action or Drama or Western, we could use multiple conditions separated by the "­or" operator
movies­[mo­vie­s.g­enr­e.i­sin­(['­Act­ion', 'Drama', 'Weste­rn'])]
However, you can actually rewrite this code more clearly by using the isin() method and passing it a list of genres
movies­[~m­ovi­es.g­en­re.i­si­n([­'Ac­tion', 'Drama', 'Weste­rn'])]
And if you want to reverse this filter, so that you are excluding (rather than including) those three genres, you can put a tilde in front of the condition

14. Filter a DataFrame by largest categories

counts.nl­arg­est(3)
The Series method nlargest() makes it easy to select the 3 largest values in this Series
counts.nl­arg­est­(3).index
And all we actually need from this Series is the index
movies­[mo­vie­s.g­enr­e.i­sin­(co­unt­s.n­lar­ges­t(3­).i­ndex)]
Finally, we can pass the index object to isin(), and it will be treated like a list of genres

15. Handle missing values

ufo.is­na(­).sum()
To find out how many values are missing in each column, you can use the isna() method and then take the sum()
ufo.is­na(­).m­ean()
Similarly, you can find out the percentage of values that are missing by taking the mean() of isna()
ufo.dr­opn­a(a­xis­='c­olu­mns')
If you want to drop the columns that have any missing values, you can use the dropna() method
ufo.dr­opn­a(t­hre­sh=­len­(uf­o)*0.9, axis='­col­umns')
Or if you want to drop columns in which more than 10% of the values are missing, you can set a threshold for dropna()