Domain Metadata Analysis
-
Root Domain Crawl
- Javascript / Cookie Tracking
- Javascript Libs
- SSL available
- Page Speed
- Domain Whois Data
- Security Issues
- HTTP Server
- HTTP Protocol
- structured Data (schema.org)
- Used HTML Tags ("iframe", "svg", ...)
- Content Management Systems
- PHP Versions
- RSS/Atom feeds
-
Full Domain Crawl
- Match Tracking Data with data privacy statement
- Referrer
- Redirects
- Broken Links
-
time consuming Crawl
- SSL Implementation / Rating
- HTML Validation (w3.org)
- Ports (MySQL, MongoDB, ...)
Other similar Projects
- http://httparchive.org
- https://trends.builtwith.com/
- https://www.wappalyzer.com/
- https://www.whatruns.com/
- https://sitestacks.com/
- https://www.similartech.com
- https://nerdydata.com
- https://allora.io
- https://www.netcraft.com/internet-data-mining/
- https://yanalyzer.com
- https://whatcms.org/
- https://web.archive.org
- https://scotthelme.co.uk/alexa-top-1-million-crawl-aug-2016/
- https://securityheaders.io
- https://www.ssllabs.com/ssltest
Domain Lists
- https://toolbar.netcraft.com/stats/topsites
- https://www.alexa.com/topsites
- https://www.quantcast.com/top-sites/
- https://blog.majestic.com/development/majestic-million-csv-daily/
- https://moz.com/top500
- https://gtmetrix.com/top1000.html
- https://www.quantcast.com/top-sites
Used Libs and Formats
Splash - Lightweight, scriptable browser as a service with an HTTP API
adblockparser - Parser for Adblock Plus rules
HTTP Archive format (HAR)
- https://groups.google.com/forum/?hl=en#!forum/http-archive-specification
- http://www.softwareishard.com/blog/har-12-spec/
HTTP Archive format (HAR) Viewer
Publish
Keywords
"Webometrie" "Webometrics" "Cybermetrics" "Web Mining" "Internet Data Mining", "Internet Research", "Internet Technologie Trends"
Crawler Performance without Threads
avg sec. * domain count = duration sec. / 86400 = duration days 5 * 1000000 = 5000000 / 86400 = 57.8 days