Show Menu

Spidering Cheat Sheet (DRAFT) by

This is a draft cheat sheet. It is a work in progress and is not finished yet.


Also known as "­cra­wli­ng"
Involves following web links to download a copy of an entire site.
Analyze offline to discover: potential security weaknesses in code, list of keywords for passwo­rd-­gue­ssing, confid­ential data, and names, emails, addresses, and phone numbers.
May spider a site many times w/various tools. Scanning primes many tools.
Manual spidering is done by browsing a site and saving each page.
Automated scans are common but if the site is too complex it may fail.

Robot Control

Automated tools are often referred to as robots o bots.
Developers control bots by using a robots.txt file either placed in the root directory of the web app or using meta tags on individual pages.
Robots Exclusion Protocol is an unofficial commonly implem­ented protocol that uses robots.txt to specify which user-agent types should be disallowed access to certain direct­ories and individual pages.
Tags that prevent page caching:
Tags that control:

Automated Spidering with ZAP

ZAP interc­eption proxy includes spidering capabi­lities
Primed by using interc­eption proxy to navigate to the site
Does well but dynami­cally generated links on client can cause it to miss pages
Has separate AJAX spider for dynamic sites
Will show out of scope targets


Provides detailed unders­tanding of techno­logies running on a web app including OS, web servers, packaged web apps, languages, framew­orks, and APIs

ZAP + Wappalyzer

Leverages Wappalyzer functi­onality through the use of Technology Detection extension in ZAP market­place
The extension is passive
Only implements some functi­onality
Not release quality

ZAP Forced Browse

Based on inactive DirBuster Project, a dictionary attack
Seeks to find unlinked content using a number of default wordlists or one provided by user
Forced Browse = entire site
Forced Browse Directory = focuses on 1 directory
Forced Browse Directory + Children = recursive forced browse against a directory and any discovered sub-di­rec­tories

Automated Spidering with Burp

Similar to ZAP and Paros

Automated Spidering with wget

Consol­e-based web browser that runs on most platforms and has basic spidering capabi­lities saves retrieved items in a directory
Obeys robot.txt unless invoked with the -e robots=off option
-r invokes recursion of discovered links
-l [N} specifies max link recursion where N = #, default is set to 5
It can retrieve via HTTP, HTTPS, and FTP
Popularity due to script inclusion
Syntax: wget -r [domain] -l 3 -P /tmp


Custom word list generator.
Spiders a website generating a word list from the EXIF data of any images and the site contents
Syntax: ./cewl.rb [domain]

Analyzing Results: What to Look For

(HTML) Comments that reveal sensitive or useful inform­ation
Commented code and links
Disabled functi­onality
Linked servers such as content and app servers

HTML Comments

Contents in the HTML that are included in the server response to the client such as dev notes, explan­ations of functi­onality and variables, usernames and passwords. Should move comments to server­-side.

Disabled Functi­onality

Reveals previous or future sections of site.
In the case of "­dis­abl­ed" functi­onality may be able to still invoke or it may indicate security weakne­sses.
Links that have been commented out
Client­-side code that has been commented out, which may be server­-side code now