Read/Write and Inspection
        
                        
                                                                                    
                                                                                             Read/Write  | 
                                                                                 
                                                                                            
                                                                                            df = pd.read_excel('file.xlsx')
  | 
                                                                                                                        Read file (CSV, Excel, JSON, HTML, SQL)  | 
                                                                                 
                                                                                            
                                                                                             | 
                                                                                                                        Write file  | 
                                                                                 
                                                                                            
                                                                                             Inspection  | 
                                                                                 
                                                                                            
                                                                                             | 
                                                                                                                        View first/last rows  | 
                                                                                 
                                                                                            
                                                                                             | 
                                                                                                                        Gives dimensions  | 
                                                                                 
                                                                                            
                                                                                             | 
                                                                                                                        Shows types in each column (int, str etc)  | 
                                                                                 
                                                                                            
                                                                                             | 
                                                                                                                        Lists: (range of index, list of all columns, no. non null, data types, memory usage)  | 
                                                                                 
                                                                                            
                                                                                             | 
                                                                                                                        Lists: (count, mean, std deviation, min, 25%, 50%, 75%, max)  | 
                                                                                 
                                                                         
                             
    
    
            Filter
        
                        
                                                                                    
                                                                                             Functionality  | 
                                                                                 
                                                                                            
                                                                                            df[df['column]
  vs  df['column']
  | 
                                                                                                                        1st filters dataframe, 2nd creates list of booleans  | 
                                                                                 
                                                                                            
                                                                                            df[df[x] > y & df[df[a] > v
  | 
                                                                                                                        & combines filters  | 
                                                                                 
                                                                                            
                                                                                            df[df.apply(lamba row: row['col1'] * 2 > row['col2'], axis=0)
  | 
                                                                                                                        Filter by function. Axis=0 for columns, 1 for rows.  | 
                                                                                 
                                                                                            
                                                                                            df.reset_index(drop=True)
  | 
                                                                                                                        Reset indexs for filter  | 
                                                                                 
                                                                                            
                                                                                             Rows and Columns  | 
                                                                                 
                                                                                            
                                                                                            df[['column1', 'column2']]
  | 
                                                                                                                        Filter by columns  | 
                                                                                 
                                                                                            
                                                                                             | 
                                                                                                                        Filter by row (2nd - 6th)  | 
                                                                                 
                                                                                            
                                                                                             | 
                                                                                                                        Filters rows based on boolean  | 
                                                                                 
                                                                                            
                                                                                            df[df['column'].isin([value1, value2])
  | 
                                                                                                                        Filter rows based on list of values  | 
                                                                                 
                                                                                            
                                                                                            df.query('age > @min_age')
  | 
                                                                                                                        Filter rows based on query string, @ points to variable 'and' to combine  | 
                                                                                 
                                                                                            
                                                                                            df.loc[df['column'] > value, 'column_name']
  | 
                                                                                                                        Filters rows and columns  | 
                                                                                 
                                                                                            
                                                                                             Dropping and replacing  | 
                                                                                 
                                                                                            
                                                                                            df.drop(columns=['column1', 'column2'])' or   '{{lang-python}}df.drop(['unnecessary_column'], axis=1, inplace=True)
  | 
                                                                                                                        Drop columns  | 
                                                                                 
                                                                                            
                                                                                            df.drop_duplicates(subset=['column'])
  | 
                                                                                                                        Drop duplicate rows from specific columns  | 
                                                                                 
                                                                                            
                                                                                             | 
                                                                                                                        Removes rows with missing values/non-missing values  | 
                                                                                 
                                                                                            
                                                                                            df.mask/.where(df['column'] > value)
  | 
                                                                                                                        Replaces rows not meeting/meeting condition with NaN  | 
                                                                                 
                                                                                            
                                                                                             String  | 
                                                                                 
                                                                                            
                                                                                            df[df['column'].str.contains('substring')
  | 
                                                                                                                        Fitler where string column contains substring  | 
                                                                                 
                                                                                            
                                                                                            df[df['column'].str.match(r'abc$')]
  | 
                                                                                                                        ^abc
 : Start with 'abc';  abc$
 : End with 'abc';  a|b:
  'a' or 'b';  abc+
 : 'abc' followed by 1+ 'c's;  abc*
 : 'ab' followed by 0+ 'c's;  abc?
 : 'ab' followed by 0 or 1 'c';  [abc]
 : Any one of 'a', 'b', or 'c'  | 
                                                                                 
                                                                         
                             
    
    
            Data Manipulation
        
                        
                                                                                    
                                                                                             Editing data  | 
                                                                                 
                                                                                            
                                                                                            df['new_column'] = new_values
  | 
                                                                                                                        Assign/create new values for column  | 
                                                                                 
                                                                                            
                                                                                             | 
                                                                                                                        Multiply (/=, -= , +=) each entry in a column by a value.  | 
                                                                                 
                                                                                            
                                                                                            df.fillna(value or method)
  | 
                                                                                                                        Fill Nan values  | 
                                                                                 
                                                                                            
                                                                                            df['column'].replace(to_replace, value)
  | 
                                                                                                                        Replace values  | 
                                                                                 
                                                                                            
                                                                                            df['column'].rolling(window=7).mean()
  | 
                                                                                                                        Rolling aggregate  | 
                                                                                 
                                                                                            
                                                                                             | 
                                                                                                                        Updates values from other df  | 
                                                                                 
                                                                                            
                                                                                            df['column'].astype(dtype)
  | 
                                                                                                                        Converts data type  | 
                                                                                 
                                                                                            
                                                                                            np.where(df['column']> value, 'Value=True', 'Value=False')
  | 
                                                                                                                        Create array based on new conditions  | 
                                                                                 
                                                                                            
                                                                                            df['column'].apply(lambda/function name))
  | 
                                                                                                                        Apply functions to selection  | 
                                                                                 
                                                                                            
                                                                                            df.rename(columns={'old_name': 'new_name'})
  | 
                                                                                                                        Rename columns  | 
                                                                                 
                                                                                            
                                                                                             Combining data  | 
                                                                                 
                                                                                            
                                                                                            pd.merge(df1,df2, on='common column', how='inner')
 or   pd.merge(df1, df2, left_index=True, right_index=True, how='outer')
  | 
                                                                                                                        Merges two dfs based on a common column. 'inner' requires both df to have all columns filled vs 'outer, 'left' vs 'right' to filter for columns of either df  | 
                                                                                 
                                                                                            
                                                                                             | 
                                                                                                                        Concatenate (combines all values)  | 
                                                                                 
                                                                                            
                                                                                            df['column'].map(mapping_dict)
  | 
                                                                                                                        Create a dictionary from a list to map keys and values to each other  | 
                                                                                 
                                                                         
                             
    
    
            Group by
        
                        
                            grouped = df.groupby(['Store', 'Product']).agg(Total_Sales=('Sales','sum) Average_Cost=('Cost', 'mean'), Sales_Count=('Sales', 'count')) 
filtered_grouped = grouped[grouped['Total_Sales'] > 500]
 
Aggregations:  sum; mean; median; min; max; count; size; std
  (standard deviation) ; var
  (variance) ; first; last; prod
  or  product; nunique
 (number of unique values)  
                             
    
    
            Pivot
        
                        
                            pivoted_df = df.pivot(index='Date', columns='Variable', values='Value') 
pivot_table_df = df.pivot_table(index='Date', columns='Variable', values='Value', aggfunc='sum') 
    values (optional): Columns whose data will be aggregated. 
    index: Columns used as index. 
    columns (optional): Columns to pivot into new DataFrame's columns. 
    aggfunc: Aggregation function for values. 
pivot_table: aggregates duplicates  
                             
    
    
            Melt
        
                        
                             melted_df = pd.melt(df, id_vars=['id_column'], var_name='variable_name', value_name='value_name')
 
id_vars (optional): Columns to keep unchanged. Otherwise default melted 
value_vars (optional): Columns to melt.  
var_name (optional): Melted column name. Default is 'variable'. 
value_name (optional): Melted values column name. Default is 'value'. 
col_level (optional): Multi-index level  
                             
    
    
            Stack/unstack
        
                        
                            stacked_df = df.stack(level=-1, dropna=True)
 
    Level: The level(s) of the column labels you want to stack. The default is the last level. 
    Dropna: Whether to drop rows in the resulting DataFrame with missing values. Default is True.  
                             
    
    
            Time Operations
        
                        
                                                                                    
                                                                                            pd.to_datetime(df['column'])
  | 
                                                                                                                        Converts a column to datetime format  | 
                                                                                 
                                                                                            
                                                                                             | 
                                                                                                                        Returns the current date and time  | 
                                                                                 
                                                                                            
                                                                                            df['datetime_column'].dt.date/.time
  | 
                                                                                                                        Extracts the date/time from a datetime column  | 
                                                                                 
                                                                                            
                                                                                            datetime.datetime.strptime(date_string, format)
  | 
                                                                                                                        Useful for strings with words and dates, or including hours/minutes, or multiple dates types in one column (requires try except function)  | 
                                                                                 
                                                                         
                             
    
    
            Open pyxl
        
                        
                                                                                    
                                                                                            openpyxl.load_workbook(filename)   workbook.save(filename)
  | 
                                                                                                                        Read/Write  | 
                                                                                 
                                                                                            
                                                                                            workbook[sheetname] or workbook.active
  | 
                                                                                                                        Sheet selection  | 
                                                                                 
                                                                                            
                                                                                             Cell Formats  | 
                                                                                 
                                                                                            
                                                                                            cell.font = Font(size=12)
  | 
                                                                                                                        Font  | 
                                                                                 
                                                                                            
                                                                                            cell.number_format = '0.00%'
  | 
                                                                                                                        Cell formats  | 
                                                                                 
                                                                                            
                                                                                            Alignment(horizontal="center");   Border(left=Side(border_style="thin", color="000000");   PatternFill("solid", fgColor="DDDDDD")
  | 
                                                                                                                        More cell options  | 
                                                                                 
                                                                                            
                                                                                            Comment('Text', 'Author')
  | 
                                                                                                                        Cell Comments, can change comment.width/height too.  | 
                                                                                 
                                                                                            
                                                                                            sheet.unmerge/merge_cells('A1:D1')
  | 
                                                                                                                        Merge/Unmerge  | 
                                                                                 
                                                                                            
                                                                                            cell.hyperlink = 'http://www.example.com'
  | 
                                                                                                                        Hyperlinks  | 
                                                                                 
                                                                                            
                                                                                            ColorScaleRule(start_type="min", start_color="FFFFFF", end_type="max", end_color="FF0000")   ws.conditional_formatting.add("A1:A9", rule)
  | 
                                                                                                                        Conditional Formatting  | 
                                                                                 
                                                                                            
                                                                                            DataValidation(type="list", formula1='"Item1,Item2,Item3"', showDropDown=True)   ws.add_data_validation(dv)
  | 
                                                                                                                        Data validation  | 
                                                                                 
                                                                                            
                                                                                             Charts  | 
                                                                                 
                                                                                            
                                                                                            chart = BarChart(),   chart.style = 13,   sheet.add_chart(chart)
  | 
                                                                                                                        Create, style, then add chart to sheet  | 
                                                                                 
                                                                                            
                                                                                            Image('path/to/image'), sheet.add_image(img, 'A1')
  | 
                                                                                                                        Add images, better for mathplot or other libraries  | 
                                                                                 
                                                                                            
                                                                                            cell.value = '=SUM(A1:A10)'
  | 
                                                                                                                        Write formulas  | 
                                                                                 
                                                                         
                             
                             | 
                                                            
            
Created By
Metadata
Comments
No comments yet. Add yours below!
Add a Comment
Related Cheat Sheets