CodingJinxx Pandas Facts Cheat Sheet

Pandas Datastructures

The primary two components of pandas are the Series and DataFrame.

Dataframes and Series

A Series is One Dimensional Data
A Dataframe is N-Dimensional Data

Creating a Dataframe from Scratch

data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}
purchases = pd.DataFrame(data)

Turns a Dictionary into a Dataframe

Adding an Index to a Dataframe

purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])

A Dataframe is a (key, value) based Data Structure

Dataframe with Index Set

Understanding your Data

movies_df.describe()

# Describing individual columns
movies_df['genre'].describe()

movies_df['genre'].value_counts().head(10)

movies_df.corr()

Generates a matrix describing correlation between data

Reading in Data

It’s quite simple to load data from various file formats into a DataFrame. In the following examples we’ll keep using our apples and oranges data, but this time it’s coming from various files.

Reading CSV

df = pd.read_csv('purchases.csv')

# Or with setting the index column, needs to be set with CSV

df = pd.read_csv('purchases.csv', index_col=0)

Reading JSON

# The index is automatically set with json
df = pd.read_json('purchases.json')

Reading in SQLite

import sqlite3
con = sqlite3.connect("database.db")
df = pd.read_sql_query("SELECT * FROM purchases", con)

Setting index in post

df = df.set_index('index')

Writing Data

df.to_csv('new_purchases.csv')
df.to_json('new_purchases.json')
df.to_sql('new_purchases', con)

con being a sql connection

When we save JSON and CSV files, all we have to input into those functions is our desired filename with the appropriate file extension. With SQL, we’re not creating a new file but instead inserting a new table into the database using our con variable from before.

Dataframe Slicing Extracting Selecting

Up until now we’ve focused on some basic summaries of our data. We’ve learned about simple column extraction using single brackets, and we imputed null values in a column using fillna(). Below are the other methods of slicing, selecting, and extracting you’ll need to use constantly.
It’s important to note that, although many methods are the same, DataFrames and Series have different attributes, so you’ll need be sure to know which type you are working with or else you will receive attribute errors.
Let’s look at working with columns first.

By Columns

genre_col = movies_df['genre']
genre_col = movies_df[['genre']]
subset = movies_df[['genre', 'rating']]
subset.head()

To extract data as Dataframes from a dataframe pass a list key

By Rows

prom = movies_df.loc["Prometheus"]
prom = movies_df.iloc[1]


movie_subset = movies_df.loc['Prometheus':'Sing']
movie_subset = movies_df.iloc[1:4]

loc and iloc can be thought of as similar to Python list slicing. To show this even further, let’s select multiple rows.

Conditional Selections

condition = (movies_df['director'] == "Ridley Scott")

movies_df[movies_df['director'] == "Ridley Scott"].head() # Returns a Dataframe where Director is Ridley Scott

movies_df[movies_df['rating'] >= 8.6].head(3)


movies_df[(movies_df['director'] == 'Christopher Nolan') | (movies_df['director'] == 'Ridley Scott')].head()

# OR Simply

movies_df[movies_df['director'].isin(['Christopher Nolan', 'Ridley Scott'])].head()

movies_df[
    ((movies_df['year'] >= 2005) & (movies_df['year'] <= 2010))
    & (movies_df['rating'] > 8.0)
    & (movies_df['revenue_millions'] < movies_df['revenue_millions'].quantile(0.25))
]

Similar to isnull(), this returns a Series of True and False values: True for films directed by Ridley Scott and False for ones not directed by him.

Applying Functions to Data

def rating_function(x):
    if x >= 8.0:
        return "good"
    else:
        return "bad"

movies_df["rating_category"] = movies_df["rating"].apply(rating_function)

Viewing Data

movies_df.head() # Print first 5 rows
movies_df.head(10) # Print first 10 rows

movies_df.tail() # Print last 5 Rows
movies_df.tail(2) # Print last 2 rows

Previews start or ends of dataframes

Getting metadata from a dataframe

movies_df.info() # Output info regarding datatypes
movies_df.shape # Outputs number of rows and columns

Dataframe Info

df.info()

Cleaning Data

### Duplicate Data Handling
temp_df = temp_df.drop_duplicates() # Drops duplicate data
# Or
temp_df.drop_duplicates(inplace=True)
# Keeping duplicate data
temp_df.drop_duplicates(inplace=True, keep=True)

### Renaming Columns
movies_df.rename(columns={
        'Runtime (Minutes)': 'Runtime', 
        'Revenue (Millions)': 'Revenue_millions'
    }, inplace=True)
# Can also be set directly
movies_df.columns = ['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime', 
                     'rating', 'votes', 'revenue_millions', 'metascore']
# Setting column names lowercase 
movies_df.columns = [col.lower() for col in movies_df]

### Handling Null Values
movies_df.isnull() # Returns a column with either True or False for null or not
movies_df.isnull().sum() # Returns a count of all null entries
# Removing Nulls
movies_df.dropna()
# Specify the axis to drop against
movies_df.dropna(axis=1)

Plotting with Matplotlib

Another great thing about pandas is that it integrates with Matplotlib, so you get the ability to plot directly off DataFrames and Series. To get started we need to import Matplotlib (pip install matplotlib):

Scatter plot

import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 20, 'figure.figsize': (10, 8)}) # set font and plot size to be larger

movies_df.plot(kind='scatter', x='rating', y='revenue_millions', title='Revenue (millions) vs Rating');

Histogram plot

movies_df['rating'].plot(kind='hist', title='Rating');

Boxplot

movies_df['rating'].plot(kind="box");

Boxplot

Created By

CodingJinxx

Metadata

Languages: English

Published: 2nd December, 2021
Last Updated: 9th December, 2021
Rated: 0 out of 5 stars based on 2 ratings

Favourited By

Comments

No comments yet. Add yours below!

Add a Comment

Related Cheat Sheets

Pandas Cheat Sheet

PYTHON PANDAS Cheat Sheet

More Cheat Sheets by CodingJinxx

CodingJinxx Pandas FAQ Cheat Sheet

CodingJinxx Pandas Facts Cheat Sheet by CodingJinxx

Pandas Datast­ruc­tures

Dataframes and Series

Creating a Dataframe from Scratch

Adding an Index to a Dataframe

Dataframe with Index Set

Unders­tanding your Data

movies­_df­['g­enr­e'].va­lue­_co­unt­s().he­ad(10)

movies­_df.corr()

Reading in Data

Reading CSV

Reading JSON

Reading in SQLite

Setting index in post

Writing Data

Dataframe Slicing Extracting Selecting

By Columns

By Rows

Condit­ional Selections

Applying Functions to Data

Viewing Data

Getting metadata from a dataframe

Dataframe Info

Cleaning Data

Plotting with Matplotlib

Scatter plot

Histogram plot

Boxplot

Boxplot

Created By

Metadata

Favourited By

Comments

Add a Comment

Related Cheat Sheets

More Cheat Sheets by CodingJinxx

Latest Cheat Sheet

Random Cheat Sheet

About Cheatography

Behind the Scenes

Recent Cheat Sheet Activity

Please Disable Your Ad Blocker

Pandas Datastructures

Understanding your Data

movies_df['genre'].value_counts().head(10)

movies_df.corr()

Conditional Selections