Cheatography
https://cheatography.com
Useful stuff for web scraping/data mining in Python
This is a draft cheat sheet. It is a work in progress and is not finished yet.
Web scraping with urllib
Import basic functions |
from urllib.request import urlopen, urlretrieve, Request
|
Request webpage 'example.com' |
raw_request = Request('https://example.com')
|
Add headers (I) |
raw_request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0')
|
Add headers (II) |
raw_request.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8')
|
Get HTML code as a string |
html = urlopen(raw_request).read().decode("utf-8")
|
Download file |
urlretrieve(fileURL, 'file_name_in_destination')
|
|
Python libraries
urllib: URL handling |
|
BeautifulSoup: HTML/XML parser |
from bs4 import BeautifulSoup
|
Regular expressions: pattern matching |
|
NetworkX: network analysis |
|
Websites of interest
Stanford Network Analysis Project |
|
Koblenz Network Collection |
|
Network Repository |
|
Graph dataset formats
GML (.gml) |
Custom structure |
|
Pajek (.net) |
Custom structure |
G = nx.read_pajek(path)
|
JSON (.json) |
Custom structure |
G = nx.read_json(path)
|
Plain text |
|
Manual |
Multilayer v1 (.) |
|
Manual |
Multilayer v2 (.) |
layer1 layer2 node1 node2 weight
|
Manual |
Those formats without a specific function are usually easy to parse manually.
Beautiful Soup HTML parsing
|
|
String slicing tricks
Split by character |
'a_string'.split('_')
|
Join with a character |
'_'.join(['a', 'string'])
|
Capitalize |
|
Capitalize, lower, upper |
foo.capitalize(), foo.lower(), foo.upper()
|
Find, count |
foo.find('e'), foo.count('e')
|
Replace |
'a_string.replace('string', 'banana')'
|
|