Show Menu

CodingJinxx Pandas Facts Cheat Sheet by

Pandas Cheat Sheet containing most important functions covered in my AI Class

Pandas Datast­ruc­tures

The primary two components of pandas are the Series and DataFrame.

Dataframes and Series

A Series is One Dimens­ional Data
A Dataframe is N-Dime­nsional Data

Creating a Dataframe from Scratch

data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
purchases = pd.DataFrame(data)
Turns a Dictionary into a Dataframe

Adding an Index to a Dataframe

purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])
A Dataframe is a (key, value) based Data Structure

Dataframe with Index Set

Unders­tanding your Data


# Describing individual columns



Generates a matrix describing correl­ation between data

Reading in Data

It’s quite simple to load data from various file formats into a DataFrame. In the following examples we’ll keep using our apples and oranges data, but this time it’s coming from various files.

Reading CSV

df = pd.read_csv('purchases.csv')

# Or with setting the index column, needs to be set with CSV

df = pd.read_csv('purchases.csv', index_col=0)

Reading JSON

# The index is automatically set with json
df = pd.read_json('purchases.json')

Reading in SQLite

import sqlite3
con = sqlite3.connect("database.db")
df = pd.read_sql_query("SELECT * FROM purchases", con)

Setting index in post

df = df.set_index('index')

Writing Data

df.to_sql('new_purchases', con)
con being a sql connection

When we save JSON and CSV files, all we have to input into those functions is our desired filename with the approp­riate file extension. With SQL, we’re not creating a new file but instead inserting a new table into the database using our con variable from before.

Dataframe Slicing Extracting Selecting

Up until now we’ve focused on some basic summaries of our data. We’ve learned about simple column extraction using single brackets, and we imputed null values in a column using fillna(). Below are the other methods of slicing, selecting, and extracting you’ll need to use consta­ntly.
It’s important to note that, although many methods are the same, DataFrames and Series have different attrib­utes, so you’ll need be sure to know which type you are working with or else you will receive attribute errors.
Let’s look at working with columns first.

By Columns

genre_col = movies_df['genre']
genre_col = movies_df[['genre']]
subset = movies_df[['genre', 'rating']]
To extract data as Dataframes from a dataframe pass a list key

By Rows

prom = movies_df.loc["Prometheus"]
prom = movies_df.iloc[1]

movie_subset = movies_df.loc['Prometheus':'Sing']
movie_subset = movies_df.iloc[1:4]
loc and iloc can be thought of as similar to Python list slicing. To show this even further, let’s select multiple rows.

Condit­ional Selections

condition = (movies_df['director'] == "Ridley Scott")

movies_df[movies_df['director'] == "Ridley Scott"].head() # Returns a Dataframe where Director is Ridley Scott

movies_df[movies_df['rating'] >= 8.6].head(3)

movies_df[(movies_df['director'] == 'Christopher Nolan') | (movies_df['director'] == 'Ridley Scott')].head()

# OR Simply

movies_df[movies_df['director'].isin(['Christopher Nolan', 'Ridley Scott'])].head()

    ((movies_df['year'] >= 2005) & (movies_df['year'] <= 2010))
    & (movies_df['rating'] > 8.0)
    & (movies_df['revenue_millions'] < movies_df['revenue_millions'].quantile(0.25))
Similar to isnull(), this returns a Series of True and False values: True for films directed by Ridley Scott and False for ones not directed by him.

Applying Functions to Data

def rating_function(x):
    if x >= 8.0:
        return "good"
        return "bad"

movies_df["rating_category"] = movies_df["rating"].apply(rating_function)

Viewing Data

movies_df.head() # Print first 5 rows
movies_df.head(10) # Print first 10 rows

movies_df.tail() # Print last 5 Rows
movies_df.tail(2) # Print last 2 rows
Previews start or ends of dataframes

Getting metadata from a dataframe # Output info regarding datatypes
movies_df.shape # Outputs number of rows and columns

Dataframe Info

Cleaning Data

### Duplicate Data Handling
temp_df = temp_df.drop_duplicates() # Drops duplicate data
# Or
# Keeping duplicate data
temp_df.drop_duplicates(inplace=True, keep=True)

### Renaming Columns
        'Runtime (Minutes)': 'Runtime', 
        'Revenue (Millions)': 'Revenue_millions'
    }, inplace=True)
# Can also be set directly
movies_df.columns = ['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime', 
                     'rating', 'votes', 'revenue_millions', 'metascore']
# Setting column names lowercase 
movies_df.columns = [col.lower() for col in movies_df]

### Handling Null Values
movies_df.isnull() # Returns a column with either True or False for null or not
movies_df.isnull().sum() # Returns a count of all null entries
# Removing Nulls
# Specify the axis to drop against

Plotting with Matplotlib

Another great thing about pandas is that it integrates with Matplo­tlib, so you get the ability to plot directly off DataFrames and Series. To get started we need to import Matplotlib (pip install matplo­tlib):

Scatter plot

import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 20, 'figure.figsize': (10, 8)}) # set font and plot size to be larger

movies_df.plot(kind='scatter', x='rating', y='revenue_millions', title='Revenue (millions) vs Rating');

Histogram plot

movies_df['rating'].plot(kind='hist', title='Rating');






No comments yet. Add yours below!

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.

          Related Cheat Sheets

          More Cheat Sheets by CodingJinxx

          CodingJinxx Pandas FAQ Cheat Sheet