Show Menu
Cheatography

Pandas Cheat Sheet (DRAFT) by

Important Concepts in Pandas

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Introd­uction to Pandas

Pandas is a powerful open-s­ource data analysis and manipu­lation library for Python.
It provides data structures and functions to effici­ently work with structured data.
Developed by Wes McKinney in 2008, Pandas is widely used in data science, finance, and research.
Key components include Series (1-dim­ens­ional labeled array) and DataFrame (2-dim­ens­ional labeled data struct­ure).
Pandas simplifies data manipu­lation tasks such as cleaning, filtering, grouping, and transf­orming.
It integrates seamlessly with other libraries like NumPy, Matplo­tlib, and Scikit­-learn.
Pandas is built on top of NumPy, leveraging its fast array processing capabi­lities.
Offers intuitive and flexible functi­ona­lities for data explor­ation and analysis.
Ideal for tasks ranging from data cleaning and prepro­cessing to statis­tical analysis and visual­iza­tion.

Indexing and Selecting Data

Use .loc[] for label-­based indexing on rows and columns.
Use .iloc[] for intege­r-based indexing on rows and columns.
Boolean indexing allows selecting data based on condit­ions.
df[col­umn­_name] or df.col­umn­_name selects a single column.
df[[co­lumn1, column2]] selects multiple columns.
.head(n) returns the first n rows of the DataFrame.
.tail(n) returns the last n rows of the DataFrame.
df.at[] and df.iat[] for single value selection based on label or integer.
df.iloc[:, [0, 1]] selects all rows and specific columns.
.query() method for SQL-like queries.
.isin() method for filtering based on multiple values.
Chained indexing should be avoided for assignment (use .loc[] or .iloc[] instead).

Dealing with Outliers

Identify outliers using descri­ptive statistics (mean, median, standard deviation)
Visualize data distri­bution using box plots, histog­rams, or scatter plots
Use domain knowledge to determine if outliers are valid data points or errors
Apply statis­tical methods like Z-score, IQR (Inter­qua­rtile Range) to detect outliers
Consider different strategies for handling outliers:
Removing outliers: Drop outliers from the dataset
Transf­orming data: Apply mathem­atical transf­orm­ations (log, square root) to reduce the impact of outliers
Winsor­iza­tion: Cap or clamp extreme values to a specified percentile
Evaluate the impact of outlier handling on data analysis and modeling
Document the rationale behind outlier treatment for reprod­uci­bility and transp­arency

Data Cleaning

Handling Missing Values:
dropna(): Drops rows or columns with missing values.
fillna(): Fills missing values with specified values.
isna() / notna(): Checks for missing or non-mi­ssing values.
Removing Duplic­ates:
duplic­ated(): Identifies duplicate rows.
drop_d­upl­ica­tes(): Removes duplicate rows.
Data Imputa­tion:
Replace missing values with the mean, median, or mode.
Use interp­olation methods for time series data.
Data Valida­tion:
Validate data types using dtype.
Use regular expres­sions to validate string data.
Data Standa­rdi­zation:
Convert data to a consistent format (e.g., lowerc­ase).
Normalize numeric data to a common scale.
Data Transf­orm­ation:
Convert data types using astype().
Apply custom functions using apply().
Outlier Detection:
Visualize data distri­bution with histograms and box plots.
Use statis­tical methods like z-score or IQR to detect outliers.
Error Correc­tion:
Handle erroneous values based on domain knowledge.
Use external datasets or references for valida­tion.
Handling Incons­istent Data:
Standa­rdize catego­rical data.
Resolve incons­ist­encies in naming conven­tions.
Handling Data Integrity Issues:
Identify and rectify data incons­ist­encies.
Use data profiling tools for anomaly detection.
Error Handling:
Use try-except blocks to handle errors during data proces­sing.
Log errors for debugging and tracking purposes.

Grouping and Aggreg­ating Data

Grouping Data:
Grouping data based on one or more columns using the groupby() function.
Example: df.gro­upb­y('­Col­umn') or df.gro­upb­y([­'Co­lumn1', 'Colum­n2']).
Aggreg­ating Data:
Applying aggregate functions like sum, mean, count, etc., to grouped data.
Example: df.gro­upb­y('­Col­umn­').s­um() or df.gro­upb­y('­Col­umn­').a­gg­({'­Col­umn2': 'mean', 'Column3': 'sum'}).
Common Aggregate Functions:
sum(): Calculates the sum of numeric values.
mean(): Calculates the mean of numeric values.
count(): Counts non-null values.
min(), max(): Finds the minimum or maximum value.
agg(): Allows specifying multiple aggregate functions for different columns.
Custom Aggreg­ation:
Defining custom aggreg­ation functions using agg() or apply().
Example: df.gro­upb­y('­Col­umn­').a­gg­(cu­sto­m_f­unc­tion).
Grouping with Multiple Functions:
Applying multiple aggregate functions simult­ane­ously.
Example: df.gro­upb­y('­Col­umn­').a­gg­(['­mean', 'sum']).
Named Aggreg­ation:
Providing custom names for aggregated columns.
Example: df.gro­upb­y('­Col­umn­').a­gg­(av­g_s­ala­ry=­('S­alary', 'mean'), total_­sal­es=­('S­ales', 'sum')).
Grouping by Time Periods:
Grouping time series data by specific time periods like months or years.
Example: df.gro­upb­y(p­d.G­rou­per­(fr­eq=­'M')).
Grouping with Catego­rical Data:
Grouping based on catego­rical data types.
Example: df.gro­upb­y('­Cat­ego­ry'­).s­um().
Handling Grouped Data:
Accessing grouped data using get_gr­oup() method.
Example: groupe­d.g­et_­gro­up(­'Gr­oup­_Na­me').

Working with Excel Files

Reading Excel Files:
pd.rea­d_e­xcel() function to read Excel files into DataFrame.
Specify sheet name, header, index, and column names.
Writing Excel Files:
DataFr­ame.to­_ex­cel() method to write DataFrame to an Excel file.
Specify sheet name, index, and column names.
Working with Multiple Sheets:
pd.Exc­elF­ile() to work with multiple sheets in a single Excel file.
Read specific sheets using parse() or read_e­xcel().
Handling Excel Format­ting:
Preserve formatting while reading with pd.Exc­elF­ile() and xlrd engine.
Formatting may be lost when writing to Excel.
Excel Data Manipu­lation:
Apply pandas operations (filte­ring, sorting, grouping) to Excel data after reading.
Convert Excel data into pandas DataFrame for manipu­lation and analysis.
Exporting DataFrame to Specific Excel Formats:
Specify Excel file format (xls, xlsx) while writing.
Use approp­riate file extension (.xls or .xlsx) for compat­ibi­lity.
Handling Large Excel Files:
Utilize chunksize parameter when reading large Excel files to load data in manageable chunks.
Process data increm­entally to avoid memory overflow.
Excel File Metadata:
Retrieve Excel file inform­ation (sheet names, data types, etc.) using pandas metadata functions.
Access metadata through pd.Exc­elF­ile() object or DataFrame attrib­utes.
Excel File Valida­tion:
Validate Excel data integrity using pandas functions (e.g., checking for missing values, data types).
Ensure consis­tency between Excel data and expected data types for analysis.
Excel File Perfor­mance Optimi­zation:
Optimize Excel file reading and writing perfor­mance by specifying approp­riate options (e.g., engine, dtype).
Utilize parallel processing or asynch­ronous methods for faster data proces­sing.

Reshaping Data

Pivot Tables
Restru­cturing data using one or more columns as new columns.
Melting
Unpivoting data from wide to long format.
Stacking and Unstacking
Manipu­lating hierar­chical indices.
Reshaping with Hierar­chical Indexing
Restru­cturing data with MultiI­ndex.
Transp­osing Data
Swapping rows and columns.
Merging and Joining DataFrames
Combining data horizo­ntally based on common columns or indices.
Appending DataFrames
Concat­enating data vertic­ally.

Input/­Output

pd.rea­d_csv()
Read CSV files into DataFrame.
pd.rea­d_e­xcel()
Read Excel files into DataFrame.
pd.rea­d_sql()
Read SQL query or database table into DataFrame.
pd.rea­d_j­son()
Read JSON files into DataFrame.
pd.rea­d_h­tml()
Read HTML tables into DataFrame.
pd.rea­d_p­ickle()
Read pickled (seria­lized) objects into DataFrame.
DataFr­ame.to­_csv()
Write DataFrame to a CSV file.
DataFr­ame.to­_ex­cel()
Write DataFrame to an Excel file.
DataFr­ame.to­_sql()
Write DataFrame to a SQL database.
DataFr­ame.to­_json()
Write DataFrame to a JSON file.
DataFr­ame.to­_html()
Write DataFrame to an HTML file.
DataFr­ame.to­_pi­ckle()
Write DataFrame to a pickled (seria­lized) object file.

Perfor­mance Optimi­zation

Use Vectorized Operations
Avoid looping through DataFrame rows; instead, utilize Pandas' built-in vectorized operations for faster comput­ations.
Optimize Memory Usage
Convert data types to more memory­-ef­ficient ones (e.g., using int8 instead of int64 for smaller integers).
Leverage Caching
Utilize caching mechanisms like df.eval() and df.query() for repetitive comput­ations on large datasets to improve perfor­mance.
Use DataFr­ame.ap­ply() with caution
It can be slow; explore altern­atives like DataFr­ame.tr­ans­form() or vectorized operations whenever possible.
Pandas Built-in Methods
Utilize built-in Pandas methods that are optimized for perfor­mance (e.g., df.gro­upb­y().agg() instead of custom aggreg­ation functi­ons).
Chunking
When working with large datasets, process data in smaller, manageable chunks to avoid memory errors and improve perfor­mance.
Parall­eli­zation
Use libraries like Dask or Modin to parall­elize Pandas operations across multiple cores for faster execution.
Profile and Benchmark
Identify bottle­necks in your code using tools like pandas­_pr­ofiling or Python's built-in cProfile module, and optimize accord­ingly.
Avoid Method Chaining
While method chaining can make code concise, it can also hinder perfor­mance; consider breaking chains into separate statements for better perfor­mance.
Pandas Built-in I/O
Use Pandas' optimized file I/O methods (e.g., pd.rea­d_csv() with approp­riate parame­ters) to effici­ently read and write data from various sources.

Advanced Indexing

MultiI­ndexing
Creating hierar­chical indexes with multiple levels.
Accessing and manipu­lating data with MultiI­ndexes.
Hierar­chical Indexing:
Unders­tanding hierar­chical indexes.
Using hierar­chical indexes for advanced data organi­zation and analysis.
Indexing with Boolean Masks:
Using boolean arrays to filter data.
Applying boolean masks for advanced data selection.
Indexing with .loc and .iloc:
Utilizing .loc for label-­based indexing.
Utilizing .iloc for intege­r-based indexing.
Setting and Resetting Index:
Setting new indexes for DataFr­ames.
Resetting indexes to default integer index.
Indexing Perfor­mance Optimi­zation:
Techniques for optimizing indexing perfor­mance.
Avoiding common pitfalls for efficient indexing.
 

Tips and Tricks for Efficient Pandas Usage

Use Vectorized Operations
Utilize built-in functions and operations for faster comput­ation
Avoid Iteration over Rows
Use apply() with vectorized functions instead of looping through rows.
Use Method Chaining
Combine multiple operations in a single statement for cleaner code.
Optimize Memory Usage
Convert data types to approp­riate ones (int64 to int32, etc.) to reduce memory usage.
Utilize Pandas Built-in Functions:
Explore and leverage the extensive set of built-in functions for common tasks.
Explore Pandas Docume­ntation
Refer to the official docume­ntation for detailed explan­ations and examples.
Profile Code
Use profiling tools like cProfile to identify bottle­necks and optimize perfor­mance.
Leverage Cython and Numba
For comput­ati­onally intensive tasks, consider using Cython or Numba to speed up operat­ions.
Parall­elize Operations
Utilize parallel processing with libraries like Dask or Modin for large datasets.
Keep Code Readable
Prioritize readab­ility and mainta­ina­bility while optimizing perfor­mance.

Working with JSON and XML Data

Reading JSON Data:
pd.rea­d_j­son() to read JSON files into a DataFrame.
Specify orient parameter for different JSON structures ('reco­rds', 'split', 'index', 'colum­ns').
Writing JSON Data:
to_json() method to convert DataFrame to JSON format.
Specify orient parameter for desired JSON structure.
Reading XML Data:
Use xml.et­ree.El­eme­ntTree or lxml library to parse XML data.
Convert XML structure to DataFrame manually.
Writing XML Data:
No direct method in Pandas for writing XML.
Convert DataFrame to XML using libraries like xml.et­ree.El­eme­ntTree or lxml.
Handling Nested JSON/XML:
Use normal­ization techniques like pd.jso­n_n­orm­alize() to handle nested JSON struct­ures.
For XML, flatten the hierar­chical structure manually or use approp­riate libraries.
Working with APIs:
Retrieve JSON data from APIs using libraries like requests.
Convert JSON responses to DataFrame for analysis.
Perfor­mance Consid­era­tions:
JSON and XML parsing can be slower compared to other formats like CSV.
Optimize parsing methods for large datasets to improve perfor­mance.

Working with Text Data

Pandas provides powerful tools for working with text data within Series and DataFrame objects.
str accessor allows accessing string methods for Series containing strings.
Common string methods include lower(), upper(), strip(), split(), replace(), etc.
contains() method checks if a pattern or substring exists in each element of a Series.
extract() method extracts substrings using regular expres­sions.
split() method splits strings into lists of substrings based on a delimiter.
join() method joins lists of strings into a single string with a specified delimiter.
get_du­mmies() method creates dummy variables for catego­rical text data.
replace() method replaces values based on a mapping or regular expres­sion.
find() method finds the first occurrence of a substring in each element of a Series.
count() method counts occurr­ences of a substring in each element of a Series.
starts­with() and endswith() methods check if each element in a Series starts or ends with a specified substring.

Handling Catego­rical Data

Convert catego­rical data to numerical repres­ent­ation using pd.fac­tor­ize() or pd.get­_du­mmies()
Utilize astype() method to convert catego­rical data to catego­rical dtype
Handle ordinal data using Catego­rical dtype with specified order
Use pd.cut() for binning numerical data into discrete intervals
Employ pd.qcut() for quanti­le-­based discre­tiz­ation
Encode catego­rical variables using LabelE­ncoder or OneHot­Encoder from sklear­n.p­rep­roc­essing
Handle high cardin­ality catego­rical data using techniques like frequency encoding or target encoding
Use pd.Cat­ego­rical() to create catego­rical data with custom categories and ordering

Visual­ization with Pandas

Plotting Functions:
Pandas provides easy-t­o-use plotting functions that leverage Matplotlib under the hood. Use .plot() method on Series or DataFrame to create various types of plots like line, bar, histogram, scatter, etc.
Custom­iza­tion:
You can customize plots by passing parameters to the plotting functions such as title, labels, colors, styles, etc. Additi­onally, you can directly use Matplotlib functions to fine-tune your plots further.
Subplots:
Pandas supports creating subplots from DataFrame or Series. Simply call .plot() on different columns or subsets of data to create multiple plots in the same figure.
Intera­ctive Plots:
Pandas supports integr­ation with libraries like Plotly and Bokeh for creating intera­ctive plots. Simply install these libraries and Pandas will use them to generate intera­ctive visual­iza­tions.
Time Series Plotting:
Pandas makes it easy to plot time series data with intell­igent date formatting and labeling. Use .plot() with time-i­ndexed data to create inform­ative time series plots.
Seaborn Integr­ation:
Seaborn, a statis­tical data visual­ization library, integrates seamlessly with Pandas. You can use Seaborn functions directly on Pandas objects to create more complex and visually appealing plots.

Time Series Data

Introd­uction:
Time series data is sequential data indexed by timest­amps.
Pandas provides robust tools for working with time series data effici­ently.
Date-Time Index:
Pandas offers specia­lized data structures like Dateti­meIndex to handle time series indexing.
Convert date strings to Dateti­meIndex using pd.to_­dat­eti­me().
Resampling and Frequency Conver­sion:
Adjust time series data to different freque­ncies using resamp­le().
Aggreg­ating or downsa­mpling time series data to a lower frequency or upsampling to a higher frequency.
Time Shifting:
Shift index by a specified number of periods with shift().
Useful for calcul­ating differ­ences over time or shifting data for alignment.
Rolling and Expanding Windows:
Compute rolling statistics (mean, sum, etc.) over a specified window with rolling().
Calculate expanding statistics over the entire history of a time series with expand­ing().
Time Zone Handling:
Localize timestamps to a specific time zone using tz_loc­ali­ze().
Convert timestamps between time zones with tz_con­vert().
Offset Aliases:
Use offset aliases like 'D' for day, 'M' for month, 'Y' for year to perform frequency conver­sions easily.
Time Series Plotting:
Pandas provides convenient methods for plotting time series data directly from DataFr­ames.
Use plot() function with a datetime index for quick visual­iza­tion.
Date Range Genera­tion:
Generate date ranges using date_r­ange() for easy creation of time series indices.
Specify start date, end date, frequency, and time zone parame­ters.
Time Series Analysis:
Perform time series analysis including trend analysis, season­ality detection, and foreca­sting using Pandas in conjun­ction with other libraries like Statsm­odels.

Merging and Joining DataFrames

Concat­enation
Combining DataFrames along rows or columns.
Merge
Combining DataFrames based on common columns using SQL-like joins such as inner, outer, left, and right joins.
Join
Convenient method for merging DataFrames based on index labels.
Handling Duplicate Columns
Dealing with duplicate column names when merging DataFr­ames.
Suffixes
Specifying suffixes for overla­pping column names in the merged DataFrame.
Merging on Index
Merging DataFrames based on their index values.
Joining on Index
Joining DataFrames based on their index labels.
Concat­enating DataFrames
Combining multiple DataFrames along rows or columns using the pd.con­cat() function.
Merging with Different Join Types
Utilizing different types of joins (inner, outer, left, right) to merge DataFrames using the pd.merge() function.
Joining on Index
Merging DataFrames based on their index labels using the .join() method.
Handling Overla­pping Column Names
Managing duplicate or overla­pping column names during merging.
Merging on Multiple Columns
Performing merges based on multiple columns in the DataFr­ames.
Suffixes
Specifying suffixes for overla­pping column names to distin­guish them in the merged DataFrame.
Merging on Index
Merging DataFrames based on their index values using the .merge() method with the 'left_­index' and 'right­_index' parame­ters.
Joining on Index
Joining DataFrames based on their index labels using the .join() method.
Handling Overla­pping Column Names
Managing duplicate or overla­pping column names during merging.
Merging on Multiple Columns
Performing merges based on multiple columns in the DataFr­ames.

Data Transf­orm­ation

Applying Functions
Use .apply() to apply a function along an axis of the DataFrame or Series.
Mapping
Transform values in a Series or DataFrame using a mapping or a function.
Replacing Values
Replace specific values in a DataFrame or Series with other values.
Dropping Columns or Rows
Use .drop() to remove specified rows or columns from a DataFrame.
Adding­/Re­moving Columns
Add or remove columns from a DataFrame using assignment or the .drop() method.
Renaming Columns
Rename columns in a DataFrame using the .rename() method.
Duplic­ating Data
Create copies of data using the .copy() method.
Changing Data Types
Convert data types of columns using the .astype() method.
Discre­tiz­ation and Binning
Convert continuous data into discrete intervals using the .cut() function.
Encoding Catego­rical Variables
Convert catego­rical variables into numerical repres­ent­ations using techniques like one-hot encoding or label encoding.
Normal­ization and Standa­rdi­zation
Scale numeric data to a standard range or distri­bution.
Mergin­g/C­onc­ate­nating DataFrames
Combine multiple DataFrames either by concat­enating or merging based on common columns or indices.

Basic Operations

Slicing
Selecting subsets of data using row and column labels or positions.
Filtering
Applying conditions to extract specific rows or columns from a DataFrame.
Sorting
Arranging data in ascending or descending order based on one or more columns.
Applying Functions
Applying functions elemen­t-wise to data, either built-in or custom functions.
Descri­ptive Statistics
Calcul­ating basic statis­tical measures like mean, median, mode, etc., for data explor­ation.
Data Alignment
Automa­tically aligning data based on row and column labels when performing operations between different DataFrames or Series.
Elemen­t-wise Operations
Performing operations like addition, subtra­ction, multip­lic­ation, and division on individual elements of a DataFrame or Series.
Aggreg­ating Data
Computing summary statistics like sum, mean, count, etc., over specified axes of the data.
Filling Missing Values
Handling missing or NaN values by filling them with a specified value or using methods like forwar­d-fill or backwa­rd-­fill.
Applying Condit­ional Logic
Using conditions to assign values or modify data based on certain criteria.

Data Structures

Series
One-di­men­sional labeled array that can hold any data type.
DataFrame
Two-di­men­sional labeled data structure with columns of potent­ially different types, akin to a spread­sheet or SQL table.
Indexing and Selecting Data
Techniques for accessing specific elements, rows, or columns within Series or DataFrame.
Basic Operations
Fundam­ental operations such as slicing, filtering, and sorting data for effective manipu­lation.
Data Cleaning
Strategies for handling missing values, duplic­ates, and other incons­ist­encies within the data.
Data Transf­orm­ation
Methods for applying functions, mapping values, and transf­orming data for analysis.
Grouping and Aggreg­ating Data
Techniques for grouping data based on specified criteria and performing aggreg­ations like sum, mean, count, etc.
Merging and Joining DataFrames
Methods for combining multiple DataFrames based on common columns or indices.
Reshaping Data
Tools for reshaping data using pivot tables, melting, and other techniques to suit analytical needs.
Time Series Data
Handling and analyzing time-based data using pandas' specia­lized functi­ona­lities.