This is a draft cheat sheet. It is a work in progress and is not finished yet.
Installing
$ pip install beautifulsoup4 |
Installing BeautifulSoup |
$ pip install lxml |
Installing a parser |
$ pip install html5lib |
Installing a parser |
Kinds of objects
tag = soup.b; type(tag) |
Tag |
# <class 'bs4.element.Tag'> |
tag.name |
Name |
# u'b' |
tag.name = "blockquote" |
change tag's name |
tag['class'] |
Attributes |
# u'boldest' |
tag.attrs |
|
# {u'class': u'boldest'} |
|
|
Basic Operation
from bs4 import BeautifulSoup |
import module |
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') |
Making a soup |
Navigating the tree
soup.body.b |
Navigating using tag names |
# <b>text</b> |
soup.a |
get the first <a> tag |
soup.find_all('a') |
get all the <a> tags |
len(soup.contents) |
<html> tag has a child <html> |
soup.contents[0].name |
# u'html' |
WRONG: test_text.contents[0] |
A string does not have .contents |
title_tag.string |
a tag has only one child, and that child is a NavigableString |
# u'The Dormouse's story' |
head_tag.contents |
|
# [<title>The Dormouse's story</title>] |
soup.html.string |
a tag contains more than one thing, .string is None |
# None |
|
|
|