Skip to content

spidy Web Crawler Release 1.4

Latest
Compare
Choose a tag to compare
@rivermont rivermont released this 04 Oct 03:20
· 150 commits to master since this release

Much update!

  • Confirmed and added support for OS/X and Linux thanks to michellemorales and j-setiawan.
  • Updated documentation to the current state of things. Still work to be done there.
  • Removed 'bad file' functionality as it wasn't working as intended and wasn't important anyway. That's what error logs are for.
  • Resolving <base> tags to grab links that wouldn't have been recognized before. Thanks lxml!
  • Added an optional (on by default) check for file size. Won't download any files larger than 500 MB, assuming the site returns a Content-Length header.
  • Added Firefox (on Ubuntu) as an option for browser spoofing.

spidy.zip contains just crawler.py and config/, while the source code archives contain all files.