Skip to content

Commit

Permalink
spidy v1.4.0
Browse files Browse the repository at this point in the history
  • Loading branch information
rivermont committed Oct 4, 2017
1 parent 60871b5 commit 31663d3
Show file tree
Hide file tree
Showing 12 changed files with 198 additions and 115 deletions.
17 changes: 13 additions & 4 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,28 @@ Right now neither of us have access to a Linux or OS/X machine, so we don't have
If you find a bug raise an issue, and if you have a suggestion go ahead and fork it.<br>
We will happily look at anything that you build off of spidy; we're not very creative people and we know that there are more/better ideas out there!

--------------------
***

## Notable TODOs

- Linux and OS/X support
- Better documentation. Both `docs.md` and the README are rather outdated
- Better documentation:
- In `docs.md`, many functions and variables simply have "TODO" as their description. These need filling out.
- More inline comments is not bad either.
- Multiple HTTP threads at once, using [mutexes](https://stackoverflow.com/questions/3310049/proper-use-of-mutexes-in-python) to corrdinate lists.
- Working GUI - the remnants of our efforts can be found in `gui.py`

### Less Important

- Automatic bug dealing-with with Travis CI and/or Sentry would be nice
- Respecting of `robots.txt`, with a disable option
- PyPI/pip/apt
- PyPI/pip/apt?

--------------------
Very trivial edits may be ignored, but things like spelling and grammar correction is fine.

**If you make changed to `crawler.py`, please adjust the line values in `docs.md`. That way links won't break.**<br>
Less important, if you make any changes please update the version as well as the badges on README lines [18 and 19](https://github.com/rivermont/spidy/blob/master/README.md#L18).<br>
Thanks!


***
9 changes: 5 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,15 @@ Developed by [rivermont](https://github.com/rivermont) (/rɪvɜːrmɒnt/) and [F
Looking for technical documentation? Check out [docs.md](https://github.com/rivermont/spidy/blob/master/docs.md)<br>
Looking to contribute to this project? Have a look at [`CONTRIBUTING.md`](https://github.com/rivermont/spidy/blob/master/CONTRIBUTING.md), then check out the docs.

![Version: 1.3.1](https://img.shields.io/badge/version-1.3.1-brightgreen.svg)
![Version: 1.4.0](https://img.shields.io/badge/version-1.4.0-brightgreen.svg)
[![Release: 1.3.0](https://img.shields.io/badge/release-1.3.0-brightgreen.svg)](https://github.com/rivermont/spidy/releases)
[![License: GPL v3](https://img.shields.io/badge/license-GPLv3.0-blue.svg)](http://www.gnu.org/licenses/gpl-3.0)
[![Python: 3.5](https://img.shields.io/badge/python-3.5-brightgreen.svg)](https://docs.python.org/3/)
[![Python: 3](https://img.shields.io/badge/python-3-lightgrey.svg)](https://docs.python.org/3/)
![Windows](https://img.shields.io/badge/Windows,%20OS/X,%20Linux-%20%20brightgreen.svg)
![All Platforms!](https://img.shields.io/badge/Windows,%20OS/X,%20Linux-%20%20-brightgreen.svg)
<br>
![Lines of Code: 1168](https://img.shields.io/badge/lines%20of%20code-1168-green.svg)
![Lines of Docs: 460](https://img.shields.io/badge/lines%20of%20docs-460-orange.svg)
![Lines of Code: 1178](https://img.shields.io/badge/lines%20of%20code-1178-green.svg)
![Lines of Docs: 544](https://img.shields.io/badge/lines%20of%20docs-544-orange.svg)

***

Expand All @@ -37,6 +37,7 @@ See `config/wsj.cfg` for an example.
Now uses the `Content-Type` header to determine how to save files.<br>
Also cut the number of requests to sites in half, effectively killing HTTP 429 Errors.


# Contents

- [spidy Web Crawler](#spidy-web-crawler)
Expand Down
6 changes: 3 additions & 3 deletions config/blank.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ SAVE_WORDS = <True/False>
# Whether to zip saved pages when autosaving.
ZIP_FILES = <True/False>

# Whether to get documents larger than 500 MB
OVERRIDE_SIZE = <True/False>

# Whether to restrict crawling to a single domain or not.
RESTRICT = <True/False>

Expand All @@ -28,9 +31,6 @@ DONE_FILE = ''
# Location of the word save file.
WORD_FILE = ''

# Location of the bad link save file.
BAD_FILE = ''

# Number of queried links after which to autosave.
SAVE_COUNT = #

Expand Down
2 changes: 1 addition & 1 deletion config/default.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@ RAISE_ERRORS = False
SAVE_PAGES = True
SAVE_WORDS = False
ZIP_FILES = True
OVERRIDE_SIZE = False
RESTRICT = False
DOMAIN = ''
TODO_FILE = 'crawler_todo.txt'
DONE_FILE = 'crawler_done.txt'
WORD_FILE = 'crawler_words.txt'
BAD_FILE = 'crawler_bad.txt'
SAVE_COUNT = 100
HEADER = HEADERS['spidy']
MAX_NEW_ERRORS = 5
Expand Down
2 changes: 1 addition & 1 deletion config/heavy.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@ OVERWRITE = False
RAISE_ERRORS = False
SAVE_PAGES = True
ZIP_FILES = True
OVERRIDE_SIZE = True
SAVE_WORDS = True
RESTRICT = False
DOMAIN = ''
TODO_FILE = 'crawler_todo.txt'
DONE_FILE = 'crawler_done.txt'
WORD_FILE = 'crawler_words.txt'
BAD_FILE = 'crawler_bad.txt'
SAVE_COUNT = 100
HEADER = HEADERS['spidy']
MAX_NEW_ERRORS = 5
Expand Down
2 changes: 1 addition & 1 deletion config/infinite.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@ RAISE_ERRORS = False
SAVE_PAGES = True
SAVE_WORDS = False
ZIP_FILES = True
OVERRIDE_SIZE = False
RESTRICT = False
DOMAIN = ''
TODO_FILE = 'crawler_todo.txt'
DONE_FILE = 'crawler_done.txt'
WORD_FILE = 'crawler_words.txt'
BAD_FILE = 'crawler_bad.txt'
SAVE_COUNT = 250
HEADER = HEADERS['spidy']
MAX_NEW_ERRORS = 1000000
Expand Down
2 changes: 1 addition & 1 deletion config/light.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@ OVERWRITE = False
RAISE_ERRORS = False
SAVE_PAGES = False
ZIP_FILES = False
OVERRIDE_SIZE = False
SAVE_WORDS = False
RESTRICT = False
DOMAIN = ''
TODO_FILE = 'crawler_todo.txt'
DONE_FILE = 'crawler_done.txt'
WORD_FILE = 'crawler_words.txt'
BAD_FILE = 'crawler_bad.txt'
SAVE_COUNT = 150
HEADER = HEADERS['spidy']
MAX_NEW_ERRORS = 5
Expand Down
4 changes: 2 additions & 2 deletions config/rivermont-infinite.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,17 @@ OVERWRITE = False
RAISE_ERRORS = False
SAVE_PAGES = True
ZIP_FILES = False
OVERRIDE_SIZE = False
SAVE_WORDS = False
RESTRICT = False
DOMAIN = ''
TODO_FILE = 'crawler_todo.txt'
DONE_FILE = 'crawler_done.txt'
WORD_FILE = 'crawler_words.txt'
BAD_FILE = 'crawler_bad.txt'
SAVE_COUNT = 100
HEADER = HEADERS['spidy']
MAX_NEW_ERRORS = 1000000
MAX_KNOWN_ERRORS = 1000000
MAX_HTTP_ERRORS = 1000000
MAX_NEW_MIMES = 1000000
START = ['https://rivermont.github.io/crawler-home']
START = ['http://24.40.136.85/']
4 changes: 2 additions & 2 deletions config/rivermont.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,17 @@ OVERWRITE = False
RAISE_ERRORS = False
SAVE_PAGES = True
ZIP_FILES = False
OVERRIDE_SIZE = False
SAVE_WORDS = False
RESTRICT = False
DOMAIN = ''
TODO_FILE = 'crawler_todo.txt'
DONE_FILE = 'crawler_done.txt'
WORD_FILE = 'crawler_words.txt'
BAD_FILE = 'crawler_bad.txt'
SAVE_COUNT = 100
HEADER = HEADERS['spidy']
MAX_NEW_ERRORS = 5
MAX_KNOWN_ERRORS = 20
MAX_HTTP_ERRORS = 20
MAX_NEW_MIMES = 10
START = ['https://rivermont.github.io/crawler-home']
START = ['http://24.40.136.85/']
2 changes: 1 addition & 1 deletion config/wsj.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ RAISE_ERRORS = False
SAVE_PAGES = True
SAVE_WORDS = False
ZIP_FILES = False
OVERRIDE_SIZE = False

# Whether to restrict crawling to a single domain or not.
RESTRICT = True
Expand All @@ -13,7 +14,6 @@ DOMAIN = 'wsj.com/'
TODO_FILE = 'wsj_todo.txt'
DONE_FILE = 'wsj_done.txt'
WORD_FILE = 'wsj_words.txt'
BAD_FILE = 'wsj_bad.txt'
SAVE_COUNT = 60
HEADER = HEADERS['spidy']
MAX_NEW_ERRORS = 100
Expand Down
Loading

0 comments on commit 31663d3

Please sign in to comment.