Read/Write and Inspection
Read/Write |
df = pd.read_excel('file.xlsx')
|
Read file (CSV, Excel, JSON, HTML, SQL) |
|
Write file |
Inspection |
|
View first/last rows |
|
Gives dimensions |
|
Shows types in each column (int, str etc) |
|
Lists: (range of index, list of all columns, no. non null, data types, memory usage) |
|
Lists: (count, mean, std deviation, min, 25%, 50%, 75%, max) |
Filter
Functionality |
df[df['column]
vs df['column']
|
1st filters dataframe, 2nd creates list of booleans |
df[df[x] > y & df[df[a] > v
|
& combines filters |
df[df.apply(lamba row: row['col1'] * 2 > row['col2'], axis=0)
|
Filter by function. Axis=0 for columns, 1 for rows. |
df.reset_index(drop=True)
|
Reset indexs for filter |
Rows and Columns |
df[['column1', 'column2']]
|
Filter by columns |
|
Filter by row (2nd - 6th) |
|
Filters rows based on boolean |
df[df['column'].isin([value1, value2])
|
Filter rows based on list of values |
df.query('age > @min_age')
|
Filter rows based on query string, @ points to variable 'and' to combine |
df.loc[df['column'] > value, 'column_name']
|
Filters rows and columns |
Dropping and replacing |
df.drop(columns=['column1', 'column2'])' or '{{lang-python}}df.drop(['unnecessary_column'], axis=1, inplace=True)
|
Drop columns |
df.drop_duplicates(subset=['column'])
|
Drop duplicate rows from specific columns |
|
Removes rows with missing values/non-missing values |
df.mask/.where(df['column'] > value)
|
Replaces rows not meeting/meeting condition with NaN |
String |
df[df['column'].str.contains('substring')
|
Fitler where string column contains substring |
df[df['column'].str.match(r'abc$')]
|
^abc
: Start with 'abc'; abc$
: End with 'abc'; a|b:
'a' or 'b'; abc+
: 'abc' followed by 1+ 'c's; abc*
: 'ab' followed by 0+ 'c's; abc?
: 'ab' followed by 0 or 1 'c'; [abc]
: Any one of 'a', 'b', or 'c' |
Data Manipulation
Editing data |
df['new_column'] = new_values
|
Assign/create new values for column |
|
Multiply (/=, -= , +=) each entry in a column by a value. |
df.fillna(value or method)
|
Fill Nan values |
df['column'].replace(to_replace, value)
|
Replace values |
df['column'].rolling(window=7).mean()
|
Rolling aggregate |
|
Updates values from other df |
df['column'].astype(dtype)
|
Converts data type |
np.where(df['column']> value, 'Value=True', 'Value=False')
|
Create array based on new conditions |
df['column'].apply(lambda/function name))
|
Apply functions to selection |
df.rename(columns={'old_name': 'new_name'})
|
Rename columns |
Combining data |
pd.merge(df1,df2, on='common column', how='inner')
or pd.merge(df1, df2, left_index=True, right_index=True, how='outer')
|
Merges two dfs based on a common column. 'inner' requires both df to have all columns filled vs 'outer, 'left' vs 'right' to filter for columns of either df |
|
Concatenate (combines all values) |
df['column'].map(mapping_dict)
|
Create a dictionary from a list to map keys and values to each other |
Group by
grouped = df.groupby(['Store', 'Product']).agg(Total_Sales=('Sales','sum) Average_Cost=('Cost', 'mean'), Sales_Count=('Sales', 'count'))
filtered_grouped = grouped[grouped['Total_Sales'] > 500]
Aggregations: sum; mean; median; min; max; count; size; std
(standard deviation) ; var
(variance) ; first; last; prod
or product; nunique
(number of unique values)
Pivot
pivoted_df = df.pivot(index='Date', columns='Variable', values='Value')
pivot_table_df = df.pivot_table(index='Date', columns='Variable', values='Value', aggfunc='sum')
values (optional): Columns whose data will be aggregated.
index: Columns used as index.
columns (optional): Columns to pivot into new DataFrame's columns.
aggfunc: Aggregation function for values.
pivot_table: aggregates duplicates
Melt
melted_df = pd.melt(df, id_vars=['id_column'], var_name='variable_name', value_name='value_name')
id_vars (optional): Columns to keep unchanged. Otherwise default melted
value_vars (optional): Columns to melt.
var_name (optional): Melted column name. Default is 'variable'.
value_name (optional): Melted values column name. Default is 'value'.
col_level (optional): Multi-index level
Stack/unstack
stacked_df = df.stack(level=-1, dropna=True)
Level: The level(s) of the column labels you want to stack. The default is the last level.
Dropna: Whether to drop rows in the resulting DataFrame with missing values. Default is True.
Time Operations
pd.to_datetime(df['column'])
|
Converts a column to datetime format |
|
Returns the current date and time |
df['datetime_column'].dt.date/.time
|
Extracts the date/time from a datetime column |
datetime.datetime.strptime(date_string, format)
|
Useful for strings with words and dates, or including hours/minutes, or multiple dates types in one column (requires try except function) |
Open pyxl
openpyxl.load_workbook(filename) workbook.save(filename)
|
Read/Write |
workbook[sheetname] or workbook.active
|
Sheet selection |
Cell Formats |
cell.font = Font(size=12)
|
Font |
cell.number_format = '0.00%'
|
Cell formats |
Alignment(horizontal="center"); Border(left=Side(border_style="thin", color="000000"); PatternFill("solid", fgColor="DDDDDD")
|
More cell options |
Comment('Text', 'Author')
|
Cell Comments, can change comment.width/height too. |
sheet.unmerge/merge_cells('A1:D1')
|
Merge/Unmerge |
cell.hyperlink = 'http://www.example.com'
|
Hyperlinks |
ColorScaleRule(start_type="min", start_color="FFFFFF", end_type="max", end_color="FF0000") ws.conditional_formatting.add("A1:A9", rule)
|
Conditional Formatting |
DataValidation(type="list", formula1='"Item1,Item2,Item3"', showDropDown=True) ws.add_data_validation(dv)
|
Data validation |
Charts |
chart = BarChart(), chart.style = 13, sheet.add_chart(chart)
|
Create, style, then add chart to sheet |
Image('path/to/image'), sheet.add_image(img, 'A1')
|
Add images, better for mathplot or other libraries |
cell.value = '=SUM(A1:A10)'
|
Write formulas |
|
Created By
Metadata
Comments
No comments yet. Add yours below!
Add a Comment
Related Cheat Sheets