Skip to content

Commit

Permalink
fix: fix dockerfile
Browse files Browse the repository at this point in the history
fix Travis errors as on main branch

Fix Autosave errors

This commit fixes errors while autosaving by single thread. Specifically
it resolves the discrepancies in the contents saved.

Fixes rivermont#56

docs: update docker instructions

to specify how users can pass custom config to spidy in docker
  • Loading branch information
pbnj committed Oct 9, 2023
1 parent 15d4e8c commit fd50dff
Show file tree
Hide file tree
Showing 3 changed files with 26 additions and 27 deletions.
22 changes: 9 additions & 13 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,19 +1,15 @@
# To run spidy in a container and write all files back to the host filesystem:
# docker run --rm -it -v $PWD:/data spidy
# Usage:
# 1. Build this image:
# docker build -t spidy:latest .
# 2. Run spidy container:
# docker run --rm -it -v $PWD:/data spidy

FROM python:3.6
LABEL maintainer "Peter Benjamin <[email protected]>"
WORKDIR /src/app/
COPY . .
VOLUME [ "/data" ]
RUN pip install -r requirements.txt

RUN apt-get update \
&& apt-get install -y \
--no-install-recommends \
python3 \
python3-lxml \
python3-requests \
&& rm -rf /var/cache/apt/* \
&& pip install -r requirements.txt
VOLUME [ "/data" ]
WORKDIR /data

ENTRYPOINT [ "python", "spidy/crawler.py" ]
ENTRYPOINT [ "python", "/src/app/spidy/crawler.py" ]
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,10 +114,11 @@ Spidy can be easily run in a Docker container.<br>

- First, build the [`Dockerfile`](dockerfile): `docker build -t spidy .`
- Verify that the Docker image has been created: `docker images`
- Then, run it: `docker run --rm -it -v $PWD:/data spidy`
- Then, run it with a data path mount (where you wish files to be written to on your local disk). For example, if you wish to have results written to your local `/tmp` directory, run spidy with this command: `docker run --rm -it -v /tmp:/data spidy`
- `--rm` tells Docker to clean up after itself by removing stopped containers.
- `-it` tells Docker to run the container interactively and allocate a pseudo-TTY.
- `-v $PWD:/data` tells Docker to mount the current working directory as `/data` directory inside the container. This is needed if you want Spidy's files (e.g. `crawler_done.txt`, `crawler_words.txt`, `crawler_todo.txt`) written back to your host filesystem.
- `-v /tmp:/data` tells Docker to mount the a host path/directory as `/data` directory inside the container. This is needed if you want Spidy's files (e.g. `crawler_done.txt`, `crawler_words.txt`, `crawler_todo.txt`) written back to your host filesystem.
- To use custom spidy configurations, mount the configuration files from your host into any path inside the container. For example, assuming your config files are in `$HOME/.spidy`, mount them into `/config` inside the container with `-v ~/.spidy:/config/`. Then, when prompted for the custom config, provide the container path, like `/config/test.cfg`.

### Spidy Docker Demo

Expand Down
26 changes: 14 additions & 12 deletions spidy/crawler.py
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,10 @@ def clear(self):
with self.lock:
self._set.clear()

def remove(self, o):
with self.lock:
self._set -= o


class RobotsIndex(object):
"""
Expand Down Expand Up @@ -319,13 +323,13 @@ def crawl_worker(thread_id, robots_index):
write_log('CRAWL', f'Queried {str(COUNTER.val)} links.', worker=thread_id)
info_log()
write_log('SAVE', 'Saving files...')
save_files()
save_files(todo, done, words)
if ZIP_FILES:
zip_saved_files(time.time(), 'saved')
finally:
# Reset variables
COUNTER = Counter(0)
WORDS.clear()
COUNTER = Counter(COUNTER.val - counter)
WORDS.remove(words)
# Crawl the page
else:
try:
Expand Down Expand Up @@ -481,32 +485,30 @@ def make_words(site):
return word_list


def save_files():
def save_files(todo, done, words):
"""
Saves the TODO, done, and word lists into their respective files.
Also logs the action to the console.
"""

global TODO, DONE

with open(TODO_FILE, 'w', encoding='utf-8', errors='ignore') as todoList:
for site in copy(TODO.queue):
with open(TODO_FILE, 'w', encoding='utf-8', errors='ignore') as todo_list:
for site in todo:
try:
todoList.write(site + '\n') # Save TODO list
todo_list.write(site + '\n') # Save TODO list
except UnicodeError:
continue
write_log('SAVE', f'Saved TODO list to {TODO_FILE}')

with open(DONE_FILE, 'w', encoding='utf-8', errors='ignore') as done_list:
for site in copy(DONE.queue):
for site in done:
try:
done_list.write(site + '\n') # Save done list
except UnicodeError:
continue
write_log('SAVE', f'Saved DONE list to {TODO_FILE}')

if SAVE_WORDS:
update_file(WORD_FILE, WORDS.get_all(), 'words')
update_file(WORD_FILE, words, 'words')


def make_file_path(url, ext):
Expand Down Expand Up @@ -1218,7 +1220,7 @@ def done_crawling(keyboard_interrupt=False):
else:
write_log('CRAWL', 'I think you\'ve managed to download the entire internet. '
'I guess you\'ll want to save your files...')
save_files()
save_files(TODO.queue, DONE.queue, WORDS.get_all())
LOG_FILE.close()


Expand Down

0 comments on commit fd50dff

Please sign in to comment.