Pandas Cheat Sheet

Introduction to Pandas

Pandas is a powerful open-source data analysis and manipulation library for Python.

It provides data structures and functions to efficiently work with structured data.

Developed by Wes McKinney in 2008, Pandas is widely used in data science, finance, and research.

Key components include Series (1-dimensional labeled array) and DataFrame (2-dimensional labeled data structure).

Pandas simplifies data manipulation tasks such as cleaning, filtering, grouping, and transforming.

It integrates seamlessly with other libraries like NumPy, Matplotlib, and Scikit-learn.

Pandas is built on top of NumPy, leveraging its fast array processing capabilities.

Offers intuitive and flexible functionalities for data exploration and analysis.

Ideal for tasks ranging from data cleaning and preprocessing to statistical analysis and visualization.

Indexing and Selecting Data

Use .loc[] for label-based indexing on rows and columns.

Use .iloc[] for integer-based indexing on rows and columns.

Boolean indexing allows selecting data based on conditions.

df[column_name] or df.column_name selects a single column.

df[[column1, column2]] selects multiple columns.

.head(n) returns the first n rows of the DataFrame.

.tail(n) returns the last n rows of the DataFrame.

df.at[] and df.iat[] for single value selection based on label or integer.

df.iloc[:, [0, 1]] selects all rows and specific columns.

.query() method for SQL-like queries.

.isin() method for filtering based on multiple values.

Chained indexing should be avoided for assignment (use .loc[] or .iloc[] instead).

Dealing with Outliers

Identify outliers using descriptive statistics (mean, median, standard deviation)

Visualize data distribution using box plots, histograms, or scatter plots

Use domain knowledge to determine if outliers are valid data points or errors

Apply statistical methods like Z-score, IQR (Interquartile Range) to detect outliers

Consider different strategies for handling outliers:

Removing outliers: Drop outliers from the dataset

Transforming data: Apply mathematical transformations (log, square root) to reduce the impact of outliers

Winsorization: Cap or clamp extreme values to a specified percentile

Evaluate the impact of outlier handling on data analysis and modeling

Document the rationale behind outlier treatment for reproducibility and transparency

Data Cleaning

Handling Missing Values:

dropna(): Drops rows or columns with missing values.

fillna(): Fills missing values with specified values.

isna() / notna(): Checks for missing or non-missing values.

Removing Duplicates:

duplicated(): Identifies duplicate rows.

drop_duplicates(): Removes duplicate rows.

Data Imputation:

Replace missing values with the mean, median, or mode.

Use interpolation methods for time series data.

Data Validation:

Validate data types using dtype.

Use regular expressions to validate string data.

Data Standardization:

Convert data to a consistent format (e.g., lowercase).

Normalize numeric data to a common scale.

Data Transformation:

Convert data types using astype().

Apply custom functions using apply().

Outlier Detection:

Visualize data distribution with histograms and box plots.

Use statistical methods like z-score or IQR to detect outliers.

Error Correction:

Handle erroneous values based on domain knowledge.

Use external datasets or references for validation.

Handling Inconsistent Data:

Standardize categorical data.

Resolve inconsistencies in naming conventions.

Handling Data Integrity Issues:

Identify and rectify data inconsistencies.

Use data profiling tools for anomaly detection.

Error Handling:

Use try-except blocks to handle errors during data processing.

Log errors for debugging and tracking purposes.

Grouping and Aggregating Data

Grouping Data:

Grouping data based on one or more columns using the groupby() function.

Example: df.groupby('Column') or df.groupby(['Column1', 'Column2']).

Aggregating Data:

Applying aggregate functions like sum, mean, count, etc., to grouped data.

Example: df.groupby('Column').sum() or df.groupby('Column').agg({'Column2': 'mean', 'Column3': 'sum'}).

Common Aggregate Functions:

sum(): Calculates the sum of numeric values.

mean(): Calculates the mean of numeric values.

count(): Counts non-null values.

min(), max(): Finds the minimum or maximum value.

agg(): Allows specifying multiple aggregate functions for different columns.

Custom Aggregation:

Defining custom aggregation functions using agg() or apply().

Example: df.groupby('Column').agg(custom_function).

Grouping with Multiple Functions:

Applying multiple aggregate functions simultaneously.

Example: df.groupby('Column').agg(['mean', 'sum']).

Named Aggregation:

Providing custom names for aggregated columns.

Example: df.groupby('Column').agg(avg_salary=('Salary', 'mean'), total_sales=('Sales', 'sum')).

Grouping by Time Periods:

Grouping time series data by specific time periods like months or years.

Example: df.groupby(pd.Grouper(freq='M')).

Grouping with Categorical Data:

Grouping based on categorical data types.

Example: df.groupby('Category').sum().

Handling Grouped Data:

Accessing grouped data using get_group() method.

Example: grouped.get_group('Group_Name').

Working with Excel Files

Reading Excel Files:

pd.read_excel() function to read Excel files into DataFrame.

Specify sheet name, header, index, and column names.

Writing Excel Files:

DataFrame.to_excel() method to write DataFrame to an Excel file.

Specify sheet name, index, and column names.

Working with Multiple Sheets:

pd.ExcelFile() to work with multiple sheets in a single Excel file.

Read specific sheets using parse() or read_excel().

Handling Excel Formatting:

Preserve formatting while reading with pd.ExcelFile() and xlrd engine.

Formatting may be lost when writing to Excel.

Excel Data Manipulation:

Apply pandas operations (filtering, sorting, grouping) to Excel data after reading.

Convert Excel data into pandas DataFrame for manipulation and analysis.

Exporting DataFrame to Specific Excel Formats:

Specify Excel file format (xls, xlsx) while writing.

Use appropriate file extension (.xls or .xlsx) for compatibility.

Handling Large Excel Files:

Utilize chunksize parameter when reading large Excel files to load data in manageable chunks.

Process data incrementally to avoid memory overflow.

Excel File Metadata:

Retrieve Excel file information (sheet names, data types, etc.) using pandas metadata functions.

Access metadata through pd.ExcelFile() object or DataFrame attributes.

Excel File Validation:

Validate Excel data integrity using pandas functions (e.g., checking for missing values, data types).

Ensure consistency between Excel data and expected data types for analysis.

Excel File Performance Optimization:

Optimize Excel file reading and writing performance by specifying appropriate options (e.g., engine, dtype).

Utilize parallel processing or asynchronous methods for faster data processing.

Reshaping Data

Pivot Tables	Restructuring data using one or more columns as new columns.
Melting	Unpivoting data from wide to long format.
Stacking and Unstacking	Manipulating hierarchical indices.
Reshaping with Hierarchical Indexing	Restructuring data with MultiIndex.
Transposing Data	Swapping rows and columns.
Merging and Joining DataFrames	Combining data horizontally based on common columns or indices.
Appending DataFrames	Concatenating data vertically.

Input/Output

pd.read_csv()	Read CSV files into DataFrame.
pd.read_excel()	Read Excel files into DataFrame.
pd.read_sql()	Read SQL query or database table into DataFrame.
pd.read_json()	Read JSON files into DataFrame.
pd.read_html()	Read HTML tables into DataFrame.
pd.read_pickle()	Read pickled (serialized) objects into DataFrame.
DataFrame.to_csv()	Write DataFrame to a CSV file.
DataFrame.to_excel()	Write DataFrame to an Excel file.
DataFrame.to_sql()	Write DataFrame to a SQL database.
DataFrame.to_json()	Write DataFrame to a JSON file.
DataFrame.to_html()	Write DataFrame to an HTML file.
DataFrame.to_pickle()	Write DataFrame to a pickled (serialized) object file.

Performance Optimization

Use Vectorized Operations	Avoid looping through DataFrame rows; instead, utilize Pandas' built-in vectorized operations for faster computations.
Optimize Memory Usage	Convert data types to more memory-efficient ones (e.g., using int8 instead of int64 for smaller integers).
Leverage Caching	Utilize caching mechanisms like df.eval() and df.query() for repetitive computations on large datasets to improve performance.
Use DataFrame.apply() with caution	It can be slow; explore alternatives like DataFrame.transform() or vectorized operations whenever possible.
Pandas Built-in Methods	Utilize built-in Pandas methods that are optimized for performance (e.g., df.groupby().agg() instead of custom aggregation functions).
Chunking	When working with large datasets, process data in smaller, manageable chunks to avoid memory errors and improve performance.
Parallelization	Use libraries like Dask or Modin to parallelize Pandas operations across multiple cores for faster execution.
Profile and Benchmark	Identify bottlenecks in your code using tools like pandas_profiling or Python's built-in cProfile module, and optimize accordingly.
Avoid Method Chaining	While method chaining can make code concise, it can also hinder performance; consider breaking chains into separate statements for better performance.
Pandas Built-in I/O	Use Pandas' optimized file I/O methods (e.g., pd.read_csv() with appropriate parameters) to efficiently read and write data from various sources.

Advanced Indexing

MultiIndexing

Creating hierarchical indexes with multiple levels.

Accessing and manipulating data with MultiIndexes.

Hierarchical Indexing:

Understanding hierarchical indexes.

Using hierarchical indexes for advanced data organization and analysis.

Indexing with Boolean Masks:

Using boolean arrays to filter data.

Applying boolean masks for advanced data selection.

Indexing with .loc and .iloc:

Utilizing .loc for label-based indexing.

Utilizing .iloc for integer-based indexing.

Setting and Resetting Index:

Setting new indexes for DataFrames.

Resetting indexes to default integer index.

Indexing Performance Optimization:

Techniques for optimizing indexing performance.

Avoiding common pitfalls for efficient indexing.

Tips and Tricks for Efficient Pandas Usage

Use Vectorized Operations	Utilize built-in functions and operations for faster computation
Avoid Iteration over Rows	Use apply() with vectorized functions instead of looping through rows.
Use Method Chaining	Combine multiple operations in a single statement for cleaner code.
Optimize Memory Usage	Convert data types to appropriate ones (int64 to int32, etc.) to reduce memory usage.
Utilize Pandas Built-in Functions:	Explore and leverage the extensive set of built-in functions for common tasks.
Explore Pandas Documentation	Refer to the official documentation for detailed explanations and examples.
Profile Code	Use profiling tools like cProfile to identify bottlenecks and optimize performance.
Leverage Cython and Numba	For computationally intensive tasks, consider using Cython or Numba to speed up operations.
Parallelize Operations	Utilize parallel processing with libraries like Dask or Modin for large datasets.
Keep Code Readable	Prioritize readability and maintainability while optimizing performance.

Working with JSON and XML Data

Reading JSON Data:

pd.read_json() to read JSON files into a DataFrame.

Specify orient parameter for different JSON structures ('records', 'split', 'index', 'columns').

Writing JSON Data:

to_json() method to convert DataFrame to JSON format.

Specify orient parameter for desired JSON structure.

Reading XML Data:

Use xml.etree.ElementTree or lxml library to parse XML data.

Convert XML structure to DataFrame manually.

Writing XML Data:

No direct method in Pandas for writing XML.

Convert DataFrame to XML using libraries like xml.etree.ElementTree or lxml.

Handling Nested JSON/XML:

Use normalization techniques like pd.json_normalize() to handle nested JSON structures.

For XML, flatten the hierarchical structure manually or use appropriate libraries.

Working with APIs:

Retrieve JSON data from APIs using libraries like requests.

Convert JSON responses to DataFrame for analysis.

Performance Considerations:

JSON and XML parsing can be slower compared to other formats like CSV.

Optimize parsing methods for large datasets to improve performance.

Working with Text Data

Pandas provides powerful tools for working with text data within Series and DataFrame objects.

str accessor allows accessing string methods for Series containing strings.

Common string methods include lower(), upper(), strip(), split(), replace(), etc.

contains() method checks if a pattern or substring exists in each element of a Series.

extract() method extracts substrings using regular expressions.

split() method splits strings into lists of substrings based on a delimiter.

join() method joins lists of strings into a single string with a specified delimiter.

get_dummies() method creates dummy variables for categorical text data.

replace() method replaces values based on a mapping or regular expression.

find() method finds the first occurrence of a substring in each element of a Series.

count() method counts occurrences of a substring in each element of a Series.

startswith() and endswith() methods check if each element in a Series starts or ends with a specified substring.

Handling Categorical Data

Convert categorical data to numerical representation using pd.factorize() or pd.get_dummies()

Utilize astype() method to convert categorical data to categorical dtype

Handle ordinal data using Categorical dtype with specified order

Use pd.cut() for binning numerical data into discrete intervals

Employ pd.qcut() for quantile-based discretization

Encode categorical variables using LabelEncoder or OneHotEncoder from sklearn.preprocessing

Handle high cardinality categorical data using techniques like frequency encoding or target encoding

Use pd.Categorical() to create categorical data with custom categories and ordering

Visualization with Pandas

Plotting Functions:	Pandas provides easy-to-use plotting functions that leverage Matplotlib under the hood. Use .plot() method on Series or DataFrame to create various types of plots like line, bar, histogram, scatter, etc.
Customization:	You can customize plots by passing parameters to the plotting functions such as title, labels, colors, styles, etc. Additionally, you can directly use Matplotlib functions to fine-tune your plots further.
Subplots:	Pandas supports creating subplots from DataFrame or Series. Simply call .plot() on different columns or subsets of data to create multiple plots in the same figure.
Interactive Plots:	Pandas supports integration with libraries like Plotly and Bokeh for creating interactive plots. Simply install these libraries and Pandas will use them to generate interactive visualizations.
Time Series Plotting:	Pandas makes it easy to plot time series data with intelligent date formatting and labeling. Use .plot() with time-indexed data to create informative time series plots.
Seaborn Integration:	Seaborn, a statistical data visualization library, integrates seamlessly with Pandas. You can use Seaborn functions directly on Pandas objects to create more complex and visually appealing plots.

Time Series Data

Introduction:

Time series data is sequential data indexed by timestamps.

Pandas provides robust tools for working with time series data efficiently.

Date-Time Index:

Pandas offers specialized data structures like DatetimeIndex to handle time series indexing.

Convert date strings to DatetimeIndex using pd.to_datetime().

Resampling and Frequency Conversion:

Adjust time series data to different frequencies using resample().

Aggregating or downsampling time series data to a lower frequency or upsampling to a higher frequency.

Time Shifting:

Shift index by a specified number of periods with shift().

Useful for calculating differences over time or shifting data for alignment.

Rolling and Expanding Windows:

Compute rolling statistics (mean, sum, etc.) over a specified window with rolling().

Calculate expanding statistics over the entire history of a time series with expanding().

Time Zone Handling:

Localize timestamps to a specific time zone using tz_localize().

Convert timestamps between time zones with tz_convert().

Offset Aliases:

Use offset aliases like 'D' for day, 'M' for month, 'Y' for year to perform frequency conversions easily.

Time Series Plotting:

Pandas provides convenient methods for plotting time series data directly from DataFrames.

Use plot() function with a datetime index for quick visualization.

Date Range Generation:

Generate date ranges using date_range() for easy creation of time series indices.

Specify start date, end date, frequency, and time zone parameters.

Time Series Analysis:

Perform time series analysis including trend analysis, seasonality detection, and forecasting using Pandas in conjunction with other libraries like Statsmodels.

Merging and Joining DataFrames

Concatenation	Combining DataFrames along rows or columns.
Merge	Combining DataFrames based on common columns using SQL-like joins such as inner, outer, left, and right joins.
Join	Convenient method for merging DataFrames based on index labels.
Handling Duplicate Columns	Dealing with duplicate column names when merging DataFrames.
Suffixes	Specifying suffixes for overlapping column names in the merged DataFrame.
Merging on Index	Merging DataFrames based on their index values.
Joining on Index	Joining DataFrames based on their index labels.
Concatenating DataFrames	Combining multiple DataFrames along rows or columns using the pd.concat() function.
Merging with Different Join Types	Utilizing different types of joins (inner, outer, left, right) to merge DataFrames using the pd.merge() function.
Joining on Index	Merging DataFrames based on their index labels using the .join() method.
Handling Overlapping Column Names	Managing duplicate or overlapping column names during merging.
Merging on Multiple Columns	Performing merges based on multiple columns in the DataFrames.
Suffixes	Specifying suffixes for overlapping column names to distinguish them in the merged DataFrame.
Merging on Index	Merging DataFrames based on their index values using the .merge() method with the 'left_index' and 'right_index' parameters.
Joining on Index	Joining DataFrames based on their index labels using the .join() method.
Handling Overlapping Column Names	Managing duplicate or overlapping column names during merging.
Merging on Multiple Columns	Performing merges based on multiple columns in the DataFrames.

Data Transformation

Applying Functions	Use .apply() to apply a function along an axis of the DataFrame or Series.
Mapping	Transform values in a Series or DataFrame using a mapping or a function.
Replacing Values	Replace specific values in a DataFrame or Series with other values.
Dropping Columns or Rows	Use .drop() to remove specified rows or columns from a DataFrame.
Adding/Removing Columns	Add or remove columns from a DataFrame using assignment or the .drop() method.
Renaming Columns	Rename columns in a DataFrame using the .rename() method.
Duplicating Data	Create copies of data using the .copy() method.
Changing Data Types	Convert data types of columns using the .astype() method.
Discretization and Binning	Convert continuous data into discrete intervals using the .cut() function.
Encoding Categorical Variables	Convert categorical variables into numerical representations using techniques like one-hot encoding or label encoding.
Normalization and Standardization	Scale numeric data to a standard range or distribution.
Merging/Concatenating DataFrames	Combine multiple DataFrames either by concatenating or merging based on common columns or indices.

Basic Operations

Slicing	Selecting subsets of data using row and column labels or positions.
Filtering	Applying conditions to extract specific rows or columns from a DataFrame.
Sorting	Arranging data in ascending or descending order based on one or more columns.
Applying Functions	Applying functions element-wise to data, either built-in or custom functions.
Descriptive Statistics	Calculating basic statistical measures like mean, median, mode, etc., for data exploration.
Data Alignment	Automatically aligning data based on row and column labels when performing operations between different DataFrames or Series.
Element-wise Operations	Performing operations like addition, subtraction, multiplication, and division on individual elements of a DataFrame or Series.
Aggregating Data	Computing summary statistics like sum, mean, count, etc., over specified axes of the data.
Filling Missing Values	Handling missing or NaN values by filling them with a specified value or using methods like forward-fill or backward-fill.
Applying Conditional Logic	Using conditions to assign values or modify data based on certain criteria.

Data Structures

Series	One-dimensional labeled array that can hold any data type.
DataFrame	Two-dimensional labeled data structure with columns of potentially different types, akin to a spreadsheet or SQL table.
Indexing and Selecting Data	Techniques for accessing specific elements, rows, or columns within Series or DataFrame.
Basic Operations	Fundamental operations such as slicing, filtering, and sorting data for effective manipulation.
Data Cleaning	Strategies for handling missing values, duplicates, and other inconsistencies within the data.
Data Transformation	Methods for applying functions, mapping values, and transforming data for analysis.
Grouping and Aggregating Data	Techniques for grouping data based on specified criteria and performing aggregations like sum, mean, count, etc.
Merging and Joining DataFrames	Methods for combining multiple DataFrames based on common columns or indices.
Reshaping Data	Tools for reshaping data using pivot tables, melting, and other techniques to suit analytical needs.
Time Series Data	Handling and analyzing time-based data using pandas' specialized functionalities.

Pandas Cheat Sheet (DRAFT) by Arshdeep

Introduction to Pandas

Indexing and Selecting Data

Dealing with Outliers

Data Cleaning

Grouping and Aggregating Data

Working with Excel Files

Reshaping Data

Input/Output

Performance Optimization

Advanced Indexing

Tips and Tricks for Efficient Pandas Usage

Working with JSON and XML Data

Working with Text Data

Handling Categorical Data

Visualization with Pandas

Time Series Data

Merging and Joining DataFrames

Data Transformation

Basic Operations

Data Structures

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

Pandas Cheat Sheet (DRAFT) by Arshdeep

Introd­uction to Pandas

Indexing and Selecting Data

Dealing with Outliers

Data Cleaning

Grouping and Aggreg­ating Data

Working with Excel Files

Reshaping Data

Input/­Output

Perfor­mance Optimi­zation

Advanced Indexing

Tips and Tricks for Efficient Pandas Usage

Working with JSON and XML Data

Working with Text Data

Handling Catego­rical Data

Visual­ization with Pandas

Time Series Data

Merging and Joining DataFrames

Data Transf­orm­ation

Basic Operations

Data Structures

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

Introduction to Pandas

Grouping and Aggregating Data

Input/Output

Performance Optimization

Handling Categorical Data

Visualization with Pandas

Data Transformation