pandas and openpyxl Cheat Sheet

Read/Write and Inspection

Read/Write
`df = pd.read_excel('file.xlsx')`	Read file (CSV, Excel, JSON, HTML, SQL)
`df.to_csv('file.csv')`	Write file
Inspection
`df.head()/.tail()`	View first/last rows
`df.shape`	Gives dimensions
`df.dtypes`	Shows types in each column (int, str etc)
`df.info`	Lists: (range of index, list of all columns, no. non null, data types, memory usage)
`df.describe`	Lists: (count, mean, std deviation, min, 25%, 50%, 75%, max)

Filter

Functionality
`df[df['column]` vs `df['column']`	1st filters dataframe, 2nd creates list of booleans
`df[df[x] > y & df[df[a] > v`	& combines filters
`df[df.apply(lamba row: row['col1'] * 2 > row['col2'], axis=0)`	Filter by function. Axis=0 for columns, 1 for rows.
`df.reset_index(drop=True)`	Reset indexs for filter
Rows and Columns
`df[['column1', 'column2']]`	Filter by columns
`df.iloc[1:5]`	Filter by row (2nd - 6th)
`df[df['column'] > value`	Filters rows based on boolean
`df[df['column'].isin([value1, value2])`	Filter rows based on list of values
`df.query('age > @min_age')`	Filter rows based on query string, @ points to variable 'and' to combine
`df.loc[df['column'] > value, 'column_name']`	Filters rows and columns
Dropping and replacing
`df.drop(columns=['column1', 'column2'])' or '{{lang-python}}df.drop(['unnecessary_column'], axis=1, inplace=True)`	Drop columns
`df.drop_duplicates(subset=['column'])`	Drop duplicate rows from specific columns
`df.dropna()` vs `.notna()`	Removes rows with missing values/non-missing values
`df.mask/.where(df['column'] > value)`	Replaces rows not meeting/meeting condition with NaN
String
`df[df['column'].str.contains('substring')`	Fitler where string column contains substring
`df[df['column'].str.match(r'abc$')]`	`^abc` : Start with 'abc'; `abc$` : End with 'abc'; `a\|b:` 'a' or 'b'; `abc+` : 'abc' followed by 1+ 'c's; `abc*` : 'ab' followed by 0+ 'c's; `abc?` : 'ab' followed by 0 or 1 'c'; `[abc]` : Any one of 'a', 'b', or 'c'

Data Manipulation

Editing data
`df['new_column'] = new_values`	Assign/create new values for column
`df['column'] *= value`	Multiply (/=, -= , +=) each entry in a column by a value.
`df.fillna(value or method)`	Fill Nan values
`df['column'].replace(to_replace, value)`	Replace values
`df['column'].rolling(window=7).mean()`	Rolling aggregate
`df.update(other_df)`	Updates values from other df
`df['column'].astype(dtype)`	Converts data type
`np.where(df['column']> value, 'Value=True', 'Value=False')`	Create array based on new conditions
`df['column'].apply(lambda/function name))`	Apply functions to selection
`df.rename(columns={'old_name': 'new_name'})`	Rename columns
Combining data
`pd.merge(df1,df2, on='common column', how='inner')` or `pd.merge(df1, df2, left_index=True, right_index=True, how='outer')`	Merges two dfs based on a common column. 'inner' requires both df to have all columns filled vs 'outer, 'left' vs 'right' to filter for columns of either df
`pd.concat([df1, df2])`	Concatenate (combines all values)
`df['column'].map(mapping_dict)`	Create a dictionary from a list to map keys and values to each other

Group by

grouped = df.groupby(['Store', 'Product']).agg(Total_Sales=('Sales','sum) Average_Cost=('Cost', 'mean'), Sales_Count=('Sales', 'count'))

filtered_grouped = grouped[grouped['Total_Sales'] > 500]

Aggregations:

sum; mean; median; min; max; count; size; std

(standard deviation)

; var

(variance)

; first; last; prod

product; nunique

(number of unique values)

Pivot

pivoted_df = df.pivot(index='Date', columns='Variable', values='Value')

pivot_table_df = df.pivot_table(index='Date', columns='Variable', values='Value', aggfunc='sum')

values (optional): Columns whose data will be aggregated.
index: Columns used as index.
columns (optional): Columns to pivot into new DataFrame's columns.
aggfunc: Aggregation function for values.
pivot_table: aggregates duplicates

Melt

 melted_df = pd.melt(df, id_vars=['id_column'], var_name='variable_name', value_name='value_name')

id_vars (optional): Columns to keep unchanged. Otherwise default melted
value_vars (optional): Columns to melt.
var_name (optional): Melted column name. Default is 'variable'.
value_name (optional): Melted values column name. Default is 'value'.
col_level (optional): Multi-index level

Stack/unstack

stacked_df = df.stack(level=-1, dropna=True)

Level: The level(s) of the column labels you want to stack. The default is the last level.
Dropna: Whether to drop rows in the resulting DataFrame with missing values. Default is True.

Time Operations

`pd.to_datetime(df['column'])`	Converts a column to datetime format
`pd.Timestamp.now()`	Returns the current date and time
`df['datetime_column'].dt.date/.time`	Extracts the date/time from a datetime column
`datetime.datetime.strptime(date_string, format)`	Useful for strings with words and dates, or including hours/minutes, or multiple dates types in one column (requires try except function)

Open pyxl

`openpyxl.load_workbook(filename) workbook.save(filename)`	Read/Write
`workbook[sheetname] or workbook.active`	Sheet selection
Cell Formats
`cell.font = Font(size=12)`	Font
`cell.number_format = '0.00%'`	Cell formats
`Alignment(horizontal="center"); Border(left=Side(border_style="thin", color="000000"); PatternFill("solid", fgColor="DDDDDD")`	More cell options
`Comment('Text', 'Author')`	Cell Comments, can change comment.width/height too.
`sheet.unmerge/merge_cells('A1:D1')`	Merge/Unmerge
`cell.hyperlink = 'http://www.example.com'`	Hyperlinks
`ColorScaleRule(start_type="min", start_color="FFFFFF", end_type="max", end_color="FF0000") ws.conditional_formatting.add("A1:A9", rule)`	Conditional Formatting
`DataValidation(type="list", formula1='"Item1,Item2,Item3"', showDropDown=True) ws.add_data_validation(dv)`	Data validation
Charts
`chart = BarChart(), chart.style = 13, sheet.add_chart(chart)`	Create, style, then add chart to sheet
`Image('path/to/image'), sheet.add_image(img, 'A1')`	Add images, better for mathplot or other libraries
`cell.value = '=SUM(A1:A10)'`	Write formulas

pandas and openpyxl Cheat Sheet by Reszi

Read/Write and Inspection

Filter

Data Manipulation

Group by

Pivot

Melt

Stack/unstack

Time Operations

Open pyxl

Created By

Metadata

Comments

Add a Comment

Related Cheat Sheets

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

pandas and openpyxl Cheat Sheet by Reszi

Read/Write and Inspection

Filter

Data Manipu­lation

Group by

Pivot

Melt

Stack/­unstack

Time Operations

Open pyxl

Created By

Metadata

Comments

Add a Comment

Related Cheat Sheets

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

Data Manipulation

Stack/unstack