Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow easier installation of extra packages #25

Open
stsievert opened this issue Jul 19, 2020 · 9 comments
Open

Allow easier installation of extra packages #25

stsievert opened this issue Jul 19, 2020 · 9 comments
Assignees
Labels
docker documentation Improvements or additions to documentation enhancement New feature or request

Comments

@stsievert
Copy link
Contributor

Let's say I want the create a custom Docker image for CPU workers. The docs say that I should follow this process:

  1. Create and test a Docker image
  2. Push that image to Docker hub (creating an account if necessary).

That's a fair amount of work, especially if the user is unfamiliar with Docker. It'd be nice to avoid that work.

@stsievert
Copy link
Contributor Author

stsievert commented Jul 19, 2020

I suspect there is a way to avoid this work, especially because it appears Dask-CHTC uses dask-docker. The Docker file daskdev/dask checks for environment variables to install extra packages, and also for an environment.yml file to see if the environment should be updated: EXTRA_{APT, CONDA, PIP}_PACKAGES and also an environment.yml file. The relevant code is at base/preparse.sh#L12.

If there is an environment.yml file, daskdev/dask runs conda env update -f environment.yml. This would simplify the installation docs. Instead of verbose documentation on the two steps (install extra packages and install Dask-CHTC), it could become this:

Install packages

Using a conda environment

Create a conda environment file. The environment.yml file looks like this:

name: chtc
channels:
  - default
  - conda-forge
dependencies:
  - python>=3.8   # Dask-CHTC requires Python 3.8 or newer.
  - numpy
  - scipy
  - pandas
  - scikit-learn
  - dask
  - distributed
  - dask-ml
  - spacy  # for NLP
  - pip:
    - requests
    - fastapi  # a web server similar to Flask

After this file is created, run conda env create -f environment.yml.

Using pip/conda

...

@JoshKarpel JoshKarpel added documentation Improvements or additions to documentation enhancement New feature or request labels Jul 20, 2020
@JoshKarpel
Copy link
Contributor

JoshKarpel commented Jul 20, 2020

Unfortunately, I don't think we can assume the user is inheriting from daskdev/dask. In particular, to use CHTC GPU nodes, they must not, because it doesn't have the right ancestors. Doubly-unfortunate, it looks like they only look for the environment.yml file in /opt/app, I assume because they expect you to bind-mount your code into that directory. HTCondor doesn't let us control the target of the bind-mount it uses, so that won't work for us.

However, I do like the idea of the entrypoint being able to install extra packages, and there's no reason we can't implement something similar in our own entrypoint script. This is pretty foreign to us in CHTC-land because we usually want people to bake everything into their image up-front, but I like the added flexibility this provides.

My instinct is to provide a conda_environment argument for CHTCCluster that takes a path to a conda environment file does the same thing as prepare.sh does. I guess the expectation would be that you have that file sitting around on the submit node with the rest of your code, and point your CHTCCluster to it as well, thus (hopefully) producing identical environments on both sides. We would still need the Docker guide, but you wouldn't need it at all if all you want to do is get an existing conda-based image and add packages to it (which is what the two existing examples are!). @stsievert , thoughts?

@JoshKarpel JoshKarpel self-assigned this Jul 20, 2020
@stsievert stsievert changed the title Provide easier Docker image creation Allow easier installation of extra packages Jul 21, 2020
@stsievert
Copy link
Contributor Author

to use CHTC GPU nodes, they must not [inherit from daskdev/dask ... I do like the idea of the entrypoint being able to install extra packages
...
We would still need the Docker guide, but you wouldn't need it at all if all you want to do is get an existing conda-based image and add packages to it

That's the main motivation for this issue, and the reason I specified a CPU node in #25 (comment). I've re-titled this issue to more accurately reflect my concern.

guess the expectation would be that you have that file sitting around on the submit node with the rest of your code, and point your CHTCCluster to it as well,

Conda environments are just YAML files; if we're doing this from Python, we should support dictionaries aka parsed YAML files too.

provide a conda_environment argument for CHTCCluster that takes a path to a conda environment file does the same thing as prepare.sh does.

Maybe add arguments for the other install options daskdev/dask supports?

CHTCCluser(
    ...,
    conda_env: Optional[str, Path, dict] = None,
    pip_packages: Optional[List[str]] = None,
    apt_packages: Optional[List[str]] = None,
    conda_packages: Optional[List[str]] = None,
)

@JoshKarpel
Copy link
Contributor

Conda environments are just YAML files; if we're doing this from Python, we should support dictionaries aka parsed YAML files too.

Agreed!

I'm not sure we can do apt_packages, again because we don't know what platform the worker image will be (conda_packages and pip_packages are fine; to state it explicitly, I am OK with strongly encouraging a conda-first workflow). Do you think there's any value in forcing people to write conda environments by not providing the pip/conda_packages options?

I suppose that since we're doing in our own wrapper script, we could do system_packages and try to discover the system package manager ourselves. That's a little insane because packages tend to be named differently, but is a possibility. My sense is that installing system packages is probably the least useful of these options, especially if we got conda_* working, so I'm going to forge ahead with the others first.

@stsievert
Copy link
Contributor Author

stsievert commented Jul 21, 2020

again because we don't know what platform the worker image will be

I was operating under the assumption Dask-CHTC would still use the daskdev/dask Docker images, and install extra packages on top of that image. Is that not what you were thinking? daskdev/dask supports extra pip/conda/apt packages and an environment.yml file (prepare.sh. Can't Dask-CHTC do some prepossessing before worker launch to set the correct environment variables and copy/write env.yml to the correct location?

If not, I'm fine with an environment file or extra pip/conda packages.

@JoshKarpel
Copy link
Contributor

daskdev/dask is just the default; I want to leave it flexible enough to allow people to use any image they'd like as long as it can start a Dask worker.

@stsievert
Copy link
Contributor Author

I want to leave it flexible enough to allow people to use any image they'd like

So if I want to use GPUs, would it be possible to do this?

CHTCCluster(
    image="pytorch/pytorch:1.5.1-cuda10.1-cudnn7-runtime",
    conda_env={"requirements": ["pytorch", "torchvision"], "channels": ["conda-forge"]},
)

That's pretty much what the GPU example in the docs does: inherits from the PyTorch image, installs requirements then launches tini (which supports multiple entrypoints in the Dockerfile; tini/README.md#existing-entrypoints).

@JoshKarpel
Copy link
Contributor

I think it would be very close... I'll need to investigate a little. We might be able to get away with installing tini ourselves and then using it in the entrypoint script.

@JoshKarpel
Copy link
Contributor

JoshKarpel commented Aug 28, 2020

Per discussion in #31 , this is much harder than expected. The current plan is to transparently build Docker images on our build nodes to solve this problem, using podman/buildah. So something like this:

CHTCCluster(
    image="pytorch/pytorch:1.5.1-cuda10.1-cudnn7-runtime",
    conda_env={"requirements": ["pytorch", "torchvision"], "channels": ["conda-forge"]},
)

would implicitly submit a build job that will build a Docker image with the image as the base, installing the conda_env`, and then uploading that package to the CHTC local Docker registry. We should tag/cache these images based on the inputs so that we don't need to rebuild as often.

Since building the image will take some time, something will need to block. The best option is probably (unfortunately) to block during __init__, since we need to know the image tag to construct the submit description. I'm imagining a method _ensure_image() -> str that builds the image if it doesn't exist (blocking) and returns the right thing to put in the docker_image in the submit description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docker documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants