Show Menu

Beautiful Soup Cheat Sheet (DRAFT) by

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Import Resources

import requests

from bs4 import BeautifulSoup

Make a soup object out of a website

// 1. The HTTP request
webpage = request.get('URL', 'html.parser');

// 2. Turn the website into a soup object
soup = BeautifulSoup(webpage.content);
"­htm­l.p­ars­er" is one option for parsers we could use. There are other options, like "­lxm­l" and "­htm­l5l­ib" that have different advantages and disadv­ant­ages.

Object Types

//1. Tags correspond to HTML tags
Example Code:
soup = BeautifulSoup('<div id="example">An example div</div><p>An example p tag</p>');

--> <div id="example">An example div</div>
--> gets the first tag of that type on the page

--> div
--> {'id': 'example'}

//2. Navigable Strings: Piece of text inside of HTML Tags
--> An example div

Navigating by Tags

Example Code:
<h1>World's Best Chocolate Chip Cookies</h1>
<div class="banner">
  <li> 1 cup flour </li>
  <li> 1/2 cup sugar </li>
  <li> 2 tbsp oil </li>
  <li> 1/2 tsp baking soda </li>
  <li> ½ cup chocolate chips </li> 
  <li> 1/2 tsp vanilla <li>
  <li> 2 tbsp milk </li>

//1. Get the children of a tag:
for child in soup.ul.children:
--> <li> 1 cup flour </li>
--> <li> 1/2 cup sugar </li>

//2. Get the parent of a tag:
for parent in

Find All

//1. find_all()
--> Outputs all <h1>...</h1> on the website

//1.1. find_all()  with regex
import re
--> Outputs all <ul>...</ul> and <ol>...</ol>
--> Outputs all headings

//1.2. find_all() with lists
soup.find_all(['h1', 'a', 'p'])

//1.3 find_all() with attributes
soup.find_all(attrs={'class':'banner', 'id':'jumbotron'});

//1.4 find_all() with functions
def has_banner_class_and_hello_world(tag):
    return tag.attr('class') == "banner" and tag.string == "Hello world"


CSS Selectors

//1. grab CSS classes with .select("class_name")".recipeLink")

//*2. grab CSS IDs with .select("#id_name")"#selected")

//3. using a loop
for link in".recipeLink > a"):
  webpage = requests.get(link)
  new_soup = BeautifulSoup(webpage)