Pandas Datastructures
The primary two components of pandas are the Series and DataFrame. |
Dataframes and Series
A Series is One Dimensional Data
A Dataframe is N-Dimensional Data
Creating a Dataframe from Scratch
data = {
'apples': [3, 2, 0, 1],
'oranges': [0, 3, 7, 2]
}
purchases = pd.DataFrame(data)
|
Turns a Dictionary into a Dataframe
Adding an Index to a Dataframe
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])
|
A Dataframe is a (key, value) based Data Structure
Understanding your Data
movies_df.describe()
# Describing individual columns
movies_df['genre'].describe()
movies_df['genre'].value_counts().head(10)
movies_df.corr()
Generates a matrix describing correlation between data
|
|
Reading in Data
It’s quite simple to load data from various file formats into a DataFrame. In the following examples we’ll keep using our apples and oranges data, but this time it’s coming from various files. |
Reading CSV
df = pd.read_csv('purchases.csv')
# Or with setting the index column, needs to be set with CSV
df = pd.read_csv('purchases.csv', index_col=0)
|
Reading JSON
# The index is automatically set with json
df = pd.read_json('purchases.json')
|
Reading in SQLite
import sqlite3
con = sqlite3.connect("database.db")
df = pd.read_sql_query("SELECT * FROM purchases", con)
|
Setting index in post
df = df.set_index('index')
|
Writing Data
df.to_csv('new_purchases.csv')
df.to_json('new_purchases.json')
df.to_sql('new_purchases', con)
|
con being a sql connection
When we save JSON and CSV files, all we have to input into those functions is our desired filename with the appropriate file extension. With SQL, we’re not creating a new file but instead inserting a new table into the database using our con variable from before.
Dataframe Slicing Extracting Selecting
Up until now we’ve focused on some basic summaries of our data. We’ve learned about simple column extraction using single brackets, and we imputed null values in a column using fillna(). Below are the other methods of slicing, selecting, and extracting you’ll need to use constantly.
It’s important to note that, although many methods are the same, DataFrames and Series have different attributes, so you’ll need be sure to know which type you are working with or else you will receive attribute errors.
Let’s look at working with columns first. |
By Columns
genre_col = movies_df['genre']
genre_col = movies_df[['genre']]
subset = movies_df[['genre', 'rating']]
subset.head()
|
To extract data as Dataframes from a dataframe pass a list key
By Rows
prom = movies_df.loc["Prometheus"]
prom = movies_df.iloc[1]
movie_subset = movies_df.loc['Prometheus':'Sing']
movie_subset = movies_df.iloc[1:4]
|
loc and iloc can be thought of as similar to Python list slicing. To show this even further, let’s select multiple rows.
Conditional Selections
condition = (movies_df['director'] == "Ridley Scott")
movies_df[movies_df['director'] == "Ridley Scott"].head() # Returns a Dataframe where Director is Ridley Scott
movies_df[movies_df['rating'] >= 8.6].head(3)
movies_df[(movies_df['director'] == 'Christopher Nolan') | (movies_df['director'] == 'Ridley Scott')].head()
# OR Simply
movies_df[movies_df['director'].isin(['Christopher Nolan', 'Ridley Scott'])].head()
movies_df[
((movies_df['year'] >= 2005) & (movies_df['year'] <= 2010))
& (movies_df['rating'] > 8.0)
& (movies_df['revenue_millions'] < movies_df['revenue_millions'].quantile(0.25))
]
|
Similar to isnull(), this returns a Series of True and False values: True for films directed by Ridley Scott and False for ones not directed by him.
Applying Functions to Data
def rating_function(x):
if x >= 8.0:
return "good"
else:
return "bad"
movies_df["rating_category"] = movies_df["rating"].apply(rating_function)
|
|
|
Viewing Data
movies_df.head() # Print first 5 rows
movies_df.head(10) # Print first 10 rows
movies_df.tail() # Print last 5 Rows
movies_df.tail(2) # Print last 2 rows
|
Previews start or ends of dataframes
Getting metadata from a dataframe
movies_df.info() # Output info regarding datatypes
movies_df.shape # Outputs number of rows and columns
|
Cleaning Data
### Duplicate Data Handling
temp_df = temp_df.drop_duplicates() # Drops duplicate data
# Or
temp_df.drop_duplicates(inplace=True)
# Keeping duplicate data
temp_df.drop_duplicates(inplace=True, keep=True)
### Renaming Columns
movies_df.rename(columns={
'Runtime (Minutes)': 'Runtime',
'Revenue (Millions)': 'Revenue_millions'
}, inplace=True)
# Can also be set directly
movies_df.columns = ['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
'rating', 'votes', 'revenue_millions', 'metascore']
# Setting column names lowercase
movies_df.columns = [col.lower() for col in movies_df]
### Handling Null Values
movies_df.isnull() # Returns a column with either True or False for null or not
movies_df.isnull().sum() # Returns a count of all null entries
# Removing Nulls
movies_df.dropna()
# Specify the axis to drop against
movies_df.dropna(axis=1)
|
Plotting with Matplotlib
Another great thing about pandas is that it integrates with Matplotlib, so you get the ability to plot directly off DataFrames and Series. To get started we need to import Matplotlib (pip install matplotlib): |
Scatter plot
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 20, 'figure.figsize': (10, 8)}) # set font and plot size to be larger
movies_df.plot(kind='scatter', x='rating', y='revenue_millions', title='Revenue (millions) vs Rating');
|
Histogram plot
movies_df['rating'].plot(kind='hist', title='Rating');
|
Boxplot
movies_df['rating'].plot(kind="box");
|
|
Created By
Metadata
Favourited By
Comments
No comments yet. Add yours below!
Add a Comment
Related Cheat Sheets
More Cheat Sheets by CodingJinxx