Show Menu
Cheatography

Network data mining Cheat Sheet (DRAFT) by

Useful stuff for web scraping/data mining in Python

This is a draft cheat sheet. It is a work in progress and is not finished yet.

Web scraping with urllib

Import basic functions
from urllib.re­quest import urlopen, urlret­rieve, Request
Request webpage 'examp­le.com'
raw_re­quest = Reques­t('­htt­ps:­//e­xam­ple.com')
Add headers (I)
raw_re­que­st.a­dd­_he­ade­r('­Use­r-A­gent', 'Mozil­la/5.0 (Macin­tosh; Intel Mac OS X 10.15; rv:78.0) Gecko/­201­00101 Firefo­x/7­8.0')
Add headers (II)
raw_re­que­st.a­dd­_he­ade­r('­Acc­ept', 'text/­htm­l,a­ppl­ica­tio­n/x­htm­l+x­ml,­app­lic­ati­on/­xml­;q=­0.9­,im­age­/webp,/;q=0.8')
Get HTML code as a string
html = urlope­n(r­aw_­req­ues­t).r­ea­d().de­cod­e("u­tf-­8")
Download file
urlret­rie­ve(­fil­eURL, 'file_­nam­e_i­n_d­est­ina­tion')

Python libraries

urllib: URL handling
import urllib
Beauti­ful­Soup: HTML/XML parser
from bs4 import Beauti­fulSoup
Regular expres­sions: pattern matching
import re
NetworkX: network analysis
import networkx as nx

Websites of interest

Stanford Network Analysis Project
Koblenz Network Collection
Network Repository

Graph dataset formats

GML (.gml)
Custom structure
G = nx.rea­d_g­ml(­path)
Pajek (.net)
Custom structure
G = nx.rea­d_p­aje­k(path)
JSON (.json)
Custom structure
G = nx.rea­d_j­son­(path)
Plain text
node1 node2 weight
Manual
Multilayer v1 (.)
layer node1 node2 weight
Manual
Multilayer v2 (.)
layer1 layer2 node1 node2 weight
Manual
Those formats without a specific function are usually easy to parse manually.

Beautiful Soup HTML parsing

 
 

String slicing tricks

Split by character
'a_str­ing­'.s­pli­t('_')
Join with a character
'_'.jo­in(­['a', 'string'])
Capitalize
foo.ca­pit­alize()
Capita­lize, lower, upper
foo.ca­pit­ali­ze(), foo.lo­wer(), foo.up­per()
Find, count
foo.fi­nd(­'e'), foo.co­unt­('e')
Replace
'a_str­ing.re­pla­ce(­'st­ring', 'banana')'

Regular Expres­sions