Show Menu
Cheatography

Import the Pandas Module

import pandas as pd

Create a DataFrame

# Method 1
df1 = pd.DataFrame({
    'name': ['John Smith', 'Jane Doe'],
    'address': ['13 Main St.', '46 Maple Ave.'],
    'age': [34, 28]
})

# Method 2
df2 = pd.DataFrame([
    ['John Smith', '123 Main St.', 34],
    ['Jane Doe', '456 Maple Ave.', 28],
    ['Joe Schmo', '9 Broadway', 51]
    ],
    columns=['name', 'address', 'age'])

Loading and Saving CSVs

# Load a CSV File in to a DataFrame
df = pd.read_csv('my-csv-file.csv')

# Saving DataFrame to a CSV File
df.to_csv('new-csv-file.csv')

# Load DataFrame in Chunks (For large Datasets)
# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize=1000)

# Get the first DataFrame chunk: df_urb_pop
df_urb_pop = next(urb_pop_reader)

Inspect a DataFrame

df.head(5)
First 5 rows
df.info()
Statistics of columns (row count, null values, datatype)

Reshape (for Scikit)

nums = np.array(range(1, 11))
-> [ 1 2 3 4 5 6 7 8 9 10]

nums = nums.reshape(-1, 1)
-> [ [1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9],
[10]]
You can think of resh­ape() as rotating this array. Rather than one big row of numbers, nums is now a big column of numbers - there’s one number in each row.

Converting Datatypes

# Convert argument to numeric type
panda­s.t­o_n­ume­ric­(arg, errors­="ra­ise­")
errors:
"­rai­se" -> raise an exception
"­coe­rce­" -> invalid parsing will be set as NaN
 

DataFrame for Select Columns / Rows

df = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],

  columns=['month', 'east', 'north', 'south', 'west']
)

Select Columns

# Select one Column
clinic_north = df.north
--> Reshape values for Scikit learn: clinic_north.values.reshape(-1, 1)

# Select multiple Columns
clinic_north_south = df[['north', 'south']]
Make sure that you have a double set of brackets [[ ]], or this command won’t work!

Select Rows

# Select one Row
march = df.iloc[2]


# Select multiple Rows
jan_feb_march = df.iloc[:3]
feb_march_april = df.iloc[1:4]
may_june = df.iloc[-2:]


# Select Rows with Logic
january = df[df.month == 'January']
-> <, >, <=, >=, !=, ==

march_april = df[(df.month == 'March') | (df.month == 'April')]
-> &, |

january_february_march = df[df.month.isin(['January', 'February', 'March'])]
-> column_name.isin([" ", " "])
Selecting a Subset of a Dataframe often results in non-­con­sec­utive indices.

Using .res­et_­ind­ex() will create a new DataFrame move the old indices into a new colum called index.

Use .res­et_­ind­ex(­dro­p=T­rue) if you dont need the index column.
Use .res­et_­ind­ex(­inp­lac­e=T­rue) to prevent a new DataFrame from brein created.

Adding a Column

df = pd.DataFrame([
  [1, '3 inch screw', 0.5, 0.75],
  [2, '2 inch nail', 0.10, 0.25],
  [3, 'hammer', 3.00, 5.50],
  [4, 'screwdriver', 2.50, 3.00]
],
  columns=['Product ID', 'Description', 'Cost to Manufacture', 'Price']
)

# Add a Column with specified row-values
df['Sold in Bulk?'] = ['Yes', 'Yes', 'No', 'No']

# Add a Column with same value in every row
df['Is taxed?'] = 'Yes'

# Add a Column with calculation
df['Revenue'] = df['Price'] - df['Cost to Manufacture']

Performing Column Operation

df = pd.DataFrame([
  ['JOHN SMITH', 'john.smith@gmail.com'],
  ['Jane Doe', 'jdoe@yahoo.com'],
  ['joe schmo', 'joeschmo@hotmail.com']
],
columns=['Name', 'Email'])

# Changing a column with an Operation
df['Name'] = df.Name.apply(lower)
-> lower, upper

# Perform a lambda Operation on a Column
get_last_name = lambda x: x.split(" ")[-1]
df['last_name'] = df.Name.apply(get_last_name)

Performing a Operation on Multiple Columns

df = pd.DataFrame([
  ["Apple", 1.00, "No"],
  ["Milk", 4.20, "No"],
  ["Paper Towels", 5.00, "Yes"],
  ["Light Bulbs", 3.75, "Yes"],
],
  columns=["Item", "Price", "Is taxed?"])


# Lambda Function
df['Price with Tax'] = df.apply(lambda row:
     row['Price'] * 1.075
     if row['Is taxed?'] == 'Yes'
     else row['Price'],
     axis=1
)
We apply a lambda to rows, as opposed to columns, when we want to perform functi­onality that needs to access more than one column at a time.

Rename Columns

# Method 1
df.columns = ['NewName_1', 'NewName_2, 'NewName_3', '...']

# Method 2
df.rename(columns={
    'OldName_1': 'NewName_1',
    'OldName_2': 'NewName_2'
}, inplace=True)
Using inpl­ace­=True lets us edit the original DataFrame.
 

Series vs. Dataframes

# Dataframe and Series
print(­typ­e(c­lin­ic_­nor­th)):
# <class 'panda­s.c­ore.se­rie­s.S­eri­es'­>

print(­typ­e(df)):
# <class 'panda­s.c­ore.fr­ame.Da­taF­ram­e'>

print(­typ­e(c­lin­ic_­nor­th_­south))
# <class 'panda­s.c­ore.fr­ame.Da­taF­ram­e'>
In Pandas
- a series is a one-di­men­sional object that contains any type of data.

- a data­frame is a two-di­men­sional object that can hold multiple columns of different types of data.

A single column of a dataframe is a series, and a data­frame is a container of two or more series objects.

Column Statistics

Mean = Average
df.co­lum­n.­m­ean()
Median
df.co­lum­n.­m­edi­an()
Minimal Value
df.co­lum­n.­m­in()
Maximum Value
df.co­lum­n.­m­ax()
Number of Values
df.co­lum­n.­c­oun­t()
Unique Values
df.co­lum­n.­n­uni­que()
Standard Deviation
df.co­lum­n.­s­td()
List of Unique Values
df.co­lum­n.­u­niq­ue()
Dont't forget reset_­ind­ex() at the end of a grou­pby operation

Calcul­ating Aggregate Functions

# Group By
grouped = df.groupby(['col1', 'col2']).col3
.measurement().reset_index()
# -> group by column1 and column2, calculate values of column3

# Percentile
high_earners = df.groupby('category').wage
    .apply(lambda x: np.percentile(x, 75))
    .reset_index()
# np.percentile can calculate any percentile over an array of values
Don't forget reset.i­nd­ex()

Pivot Tables

orders = pd.read_csv('orders.csv')

shoe_counts = orders.
groupby(['shoe_type', 'shoe_color']).
id.count().reset_index()

shoe_counts_pivot = shoe_counts.pivot(
index = 'shoe_type',
columns = 'shoe_color',
values = 'id').reset_index()
We have to build a temporary table where we group by the columns we want to include in the pivot table

Merge (Same Column Name)

sales = pd.read_csv('sales.csv')
targets = pd.read_csv('targets.csv')
men_women = pd.read_csv('men_women_sales.csv')

# Method 1
sales_targets = pd.merge(sales, targets, how=" ")
# how: "inner"(default), "outer", "left", "right"

#Method 2 (Method Chaining)
all_data = sales.merge(targets).merge(men_women)

Inner Merge (Different Column Name)

orders = pd.read_csv('orders.csv')
products = pd.read_csv('products.csv')

# Method 1: Rename Columns
orders_products = pd.merge(orders, products.rename(columns={'id':'product_id'}), how=" ")
.reset_index()
# how: "inner"(default), "outer", "left", "right"

# Method 2:
orders_products =
pd.merge(orders, products,
                left_on="product_id",
                right_on="id",
                suffixes=["_orders","_products"])
Method 2:
If we use this syntax, we’ll end up with two columns called id.
Pandas won’t let you have two columns with the same name, so it will change them to id_x and id_y.
We can help make them more useful by using the keyword suff­ixes.

Concat­enate

bakery = pd.read_csv('bakery.csv')
ice_cream = pd.read_csv('ice_cream.csv')

menu = pd.concat([bakery, ice_cream])

Melt

panda­s.m­elt­(Da­taF­rame, id_vars, value_­vars, var_name, value_­nam­e='­val­ue')

id_vars: Column(s) to use as identifier variables.

value­_vars: Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.

var_n­ame: Name to use for the ‘variable’ column.

value­_name: Name to use for the ‘value’ column.
Unpivot a DataFrame from wide to long format, optionally leaving identi­fiers set.

Assert Statements

# Test if country is of type object
assert gapminder.country.dtypes == np.object

# Test if year is of type int64
assert gapminder.year.dtypes == np.int64

# Test if life_expectancy is of type float64
assert gapminder.life_expectancy.dtypes == np.float64

# Assert that country does not contain any missing values
assert pd.notnull(gapminder.country).all()

# Assert that year does not contain any missing values
assert pd.notnull(gapminder.year).all()
   

Help Us Go Positive!

We offset our carbon usage with Ecologi. Click the link below to help us!

We offset our carbon footprint via Ecologi
 

Comments

No comments yet. Add yours below!

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.

          Related Cheat Sheets

            Python 3 Cheat Sheet by Finxter

          More Cheat Sheets by Justin1209

          NumPy