-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Dockerfile to simplify installation #93
base: master
Are you sure you want to change the base?
Conversation
I haven't used Docker, so bear with me...
Thanks for working on this! |
No prob - you can think of a Docker container as a lightweight VM... Like VirtualBox, but with better tooling and less overhead). The Dockerfile automates building/configuring the container and the This creates a fully isolated, reproducible installation of grab-site in a 200-300MB image. This image can be run on any host OS, including CoreOS where Python isn't even installed. Using Alpine as a base we could get this image down to 20-50MB, but that requires some modifications to py-lmdb. As for testing, we can have https://hub.docker.com automatically rebuild the image whenever new code is pushed (see: https://docs.docker.com/docker-hub/builds/) and run Docker-based tests in Travis if you want: https://docs.travis-ci.com/user/docker/ |
|
Is there an issue filed somewhere for py-lmdb's failure to compile on Alpine Linux's gcc? |
|
Oh, that explains it :-) |
There isn't an issue filed on https://github.com/dw/py-lmdb/issues yet. |
I haven't tried running grab-site, but it seems like installing py-lmdb works on
|
Dockerfile
Outdated
RUN pip3 install ./ | ||
RUN apt-get purge -y build-essential | ||
RUN apt-get autoremove -y | ||
RUN apt-get clean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone tells me that each RUN
creates a new layer, so the purge/autoremove/clean would not reduce the size of the final Docker image. What do you think about combining the RUN
s on lines 6-12 into one RUN
command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Someone" is me, in case additional clarification of this comment is needed :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility is to use FROM python:3.4
instead of FROM python:3.4-slim
; the non-slim variant is based off of buildpack-deps
which has a lot of compilers / tools / libraries installed. The resulting total image size would be bigger, but the advantage is that the buildpack-deps
portion would be shared with every other image based off of that, so in the usual case where you have several images, the total space usage would be lower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I combined the commands & got the image size down to 235.4MB. I'm kinda surprised that images don't get flattened, but moby/moby#332 offers a lengthy discussion on it.
As for basing it on python:3.4
, that would reduce the build time & total size of images on the system, but only if a significant number of the other images on the system are based on it too, which I don't think we can assume. It's probably better to just optimize for the smallest resulting image size.
fb56712
to
ce4d178
Compare
@@ -34,6 +34,7 @@ Note: grab-site currently **does not work with Python 3.5**; please use Python 3 | |||
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is a lie, sorry. I've been updating this TOC manually and probably don't want the Tips for specific websites
expanded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, removed
You're right about it working on Alpine - I was just missing |
ee5a78e
to
3393d82
Compare
README.md
Outdated
Start the grab-site server. You can set the port, volume, and name to whatever you want: | ||
|
||
```bash | ||
docker run --detach -p 29000:29000 -v /home/ludios/download/grab-site-data:/data --name warcfactory slang800/grab-site |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just ~/grabs
instead of /home/ludios/download/grab-site-data
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docker requires an absolute path for mounts... I suppose I could do $(pwd)/grabs
, if that's obvious to most users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
~
will be made absolute by the shell, no?
$ echo ~/
/home/at/
Thanks for the fixes. I am currently somewhat busy and under-dockered, can a grab-site user please give the Docker instructions a try and see if they work? (And let me know if you had to perform any other steps to make this a useful setup?) |
I had docker already, but worth linking to https://docs.docker.com/engine/installation/ instructions I tried it with: Ran (sudo for docker commands because i skipped this step https://docs.docker.com/engine/installation/linux/ubuntulinux/#/create-a-docker-group):
Crawl finished successfully! |
I tried this out, but couldn't find a way to attach a terminal to a Would adding tmux to the container and using tmux work? (Note, tmux 2.1 is broken; 1.8 is a known-good version.) I just hope that |
Also, running |
Splitting up the server and client would make sense, especially since you could then run them on different machines, but I should probably do that as a separate PR, since I'll need to look into how they communicate.
Would using dumb-init as PID 1 allow the orphaned grab-site processes to keep running in the case where
You could run |
hey people! what is the status of this PR? I could give a hand. |
For now, I would like someone else to be the Dockerized grab-site upstream. I don't use Docker and I don't have the resources to 1) figure out if a PR is taking the right approach with Dockerization (which base? which init? one container per grab-site? how to integrate tmux, if needed?) 2) double my manual testing matrix. So, please, have at it and promote your fork/Dockerfile here. If you (or someone else) stays interested in maintaining and testing it, I might take a PR in the future. |
@semente I've been using it pretty often for my own projects, and it works fine, but I haven't rebased it since 2016. I'll try rebasing and pushing a new image to the Docker hub.
Ok, I'll keep an image updated over here: https://cloud.docker.com/u/slang800/repository/docker/slang800/grab-site |
@notslang Thank you for all this work. Can you confirm that your fork still works fine? I am curious if you ran into any issues or discovered anything of note. |
It says updated 3 years ago, any plans to update it? Or any plans to officially ship a Dockerfile for this? |
FYI this third party grab-site Dockerfile currently works as of this comment being posted: https://github.com/Nold360/docker-grab-site. |
I'm deploying a couple instances of grab-site to a CoreOS cluster, so I made a Dockerfile... Hopefully this is a bit easier to use than pip/virtualenv. The reason why this uses the larger
python:3.4-slim
image (rather thanpython:3.4-alpine
) is because Alpine had some issues compiling https://github.com/dw/py-lmdb with its version of gcc.This PR still needs docs, so it's a work-in-progress right now.
After starting the container you can use the regular
grab-site
command viadocker exec <container-name> grab-site <args and site url>