Allow easier installation of extra packages #25

stsievert · 2020-07-19T17:50:07Z

Let's say I want the create a custom Docker image for CPU workers. The docs say that I should follow this process:

Create and test a Docker image
Push that image to Docker hub (creating an account if necessary).

That's a fair amount of work, especially if the user is unfamiliar with Docker. It'd be nice to avoid that work.

stsievert · 2020-07-19T17:50:55Z

I suspect there is a way to avoid this work, especially because it appears Dask-CHTC uses dask-docker. The Docker file daskdev/dask checks for environment variables to install extra packages, and also for an environment.yml file to see if the environment should be updated: EXTRA_{APT, CONDA, PIP}_PACKAGES and also an environment.yml file. The relevant code is at base/preparse.sh#L12.

If there is an environment.yml file, daskdev/dask runs conda env update -f environment.yml. This would simplify the installation docs. Instead of verbose documentation on the two steps (install extra packages and install Dask-CHTC), it could become this:

Install packages

Using a conda environment

Create a conda environment file. The environment.yml file looks like this:
name: chtc
channels:
  - default
  - conda-forge
dependencies:
  - python>=3.8   # Dask-CHTC requires Python 3.8 or newer.
  - numpy
  - scipy
  - pandas
  - scikit-learn
  - dask
  - distributed
  - dask-ml
  - spacy  # for NLP
  - pip:
    - requests
    - fastapi  # a web server similar to Flask
After this file is created, run conda env create -f environment.yml.

Using pip/conda

...

JoshKarpel · 2020-07-20T20:31:51Z

Unfortunately, I don't think we can assume the user is inheriting from daskdev/dask. In particular, to use CHTC GPU nodes, they must not, because it doesn't have the right ancestors. Doubly-unfortunate, it looks like they only look for the environment.yml file in /opt/app, I assume because they expect you to bind-mount your code into that directory. HTCondor doesn't let us control the target of the bind-mount it uses, so that won't work for us.

However, I do like the idea of the entrypoint being able to install extra packages, and there's no reason we can't implement something similar in our own entrypoint script. This is pretty foreign to us in CHTC-land because we usually want people to bake everything into their image up-front, but I like the added flexibility this provides.

My instinct is to provide a conda_environment argument for CHTCCluster that takes a path to a conda environment file does the same thing as prepare.sh does. I guess the expectation would be that you have that file sitting around on the submit node with the rest of your code, and point your CHTCCluster to it as well, thus (hopefully) producing identical environments on both sides. We would still need the Docker guide, but you wouldn't need it at all if all you want to do is get an existing conda-based image and add packages to it (which is what the two existing examples are!). @stsievert , thoughts?

stsievert · 2020-07-21T14:25:03Z

to use CHTC GPU nodes, they must not [inherit from daskdev/dask ... I do like the idea of the entrypoint being able to install extra packages
...
We would still need the Docker guide, but you wouldn't need it at all if all you want to do is get an existing conda-based image and add packages to it

That's the main motivation for this issue, and the reason I specified a CPU node in #25 (comment). I've re-titled this issue to more accurately reflect my concern.

guess the expectation would be that you have that file sitting around on the submit node with the rest of your code, and point your CHTCCluster to it as well,

Conda environments are just YAML files; if we're doing this from Python, we should support dictionaries aka parsed YAML files too.

provide a conda_environment argument for CHTCCluster that takes a path to a conda environment file does the same thing as prepare.sh does.

Maybe add arguments for the other install options daskdev/dask supports?

CHTCCluser(
    ...,
    conda_env: Optional[str, Path, dict] = None,
    pip_packages: Optional[List[str]] = None,
    apt_packages: Optional[List[str]] = None,
    conda_packages: Optional[List[str]] = None,
)

JoshKarpel · 2020-07-21T14:56:19Z

Conda environments are just YAML files; if we're doing this from Python, we should support dictionaries aka parsed YAML files too.

Agreed!

I'm not sure we can do apt_packages, again because we don't know what platform the worker image will be (conda_packages and pip_packages are fine; to state it explicitly, I am OK with strongly encouraging a conda-first workflow). Do you think there's any value in forcing people to write conda environments by not providing the pip/conda_packages options?

I suppose that since we're doing in our own wrapper script, we could do system_packages and try to discover the system package manager ourselves. That's a little insane because packages tend to be named differently, but is a possibility. My sense is that installing system packages is probably the least useful of these options, especially if we got conda_* working, so I'm going to forge ahead with the others first.

stsievert · 2020-07-21T15:11:07Z

again because we don't know what platform the worker image will be

I was operating under the assumption Dask-CHTC would still use the daskdev/dask Docker images, and install extra packages on top of that image. Is that not what you were thinking? daskdev/dask supports extra pip/conda/apt packages and an environment.yml file (prepare.sh. Can't Dask-CHTC do some prepossessing before worker launch to set the correct environment variables and copy/write env.yml to the correct location?

If not, I'm fine with an environment file or extra pip/conda packages.

JoshKarpel · 2020-07-21T15:15:14Z

daskdev/dask is just the default; I want to leave it flexible enough to allow people to use any image they'd like as long as it can start a Dask worker.

stsievert · 2020-07-21T15:47:57Z

I want to leave it flexible enough to allow people to use any image they'd like

So if I want to use GPUs, would it be possible to do this?

CHTCCluster(
    image="pytorch/pytorch:1.5.1-cuda10.1-cudnn7-runtime",
    conda_env={"requirements": ["pytorch", "torchvision"], "channels": ["conda-forge"]},
)

That's pretty much what the GPU example in the docs does: inherits from the PyTorch image, installs requirements then launches tini (which supports multiple entrypoints in the Dockerfile; tini/README.md#existing-entrypoints).

JoshKarpel · 2020-07-21T17:59:12Z

I think it would be very close... I'll need to investigate a little. We might be able to get away with installing tini ourselves and then using it in the entrypoint script.

…Docker image (partial progress on #25)

JoshKarpel · 2020-08-28T13:32:39Z

Per discussion in #31 , this is much harder than expected. The current plan is to transparently build Docker images on our build nodes to solve this problem, using podman/buildah. So something like this:

CHTCCluster(
    image="pytorch/pytorch:1.5.1-cuda10.1-cudnn7-runtime",
    conda_env={"requirements": ["pytorch", "torchvision"], "channels": ["conda-forge"]},
)

would implicitly submit a build job that will build a Docker image with the image as the base, installing the conda_env`, and then uploading that package to the CHTC local Docker registry. We should tag/cache these images based on the inputs so that we don't need to rebuild as often.

Since building the image will take some time, something will need to block. The best option is probably (unfortunately) to block during __init__, since we need to know the image tag to construct the submit description. I'm imagining a method _ensure_image() -> str that builds the image if it doesn't exist (blocking) and returns the right thing to put in the docker_image in the submit description.

This was referenced Jul 19, 2020

DOC: adapts documentation for first-time users #28

Merged

Unsupported pickle protocol: 5 #27

Closed

JoshKarpel added documentation Improvements or additions to documentation enhancement New feature or request labels Jul 20, 2020

JoshKarpel self-assigned this Jul 20, 2020

JoshKarpel added the docker label Jul 20, 2020

stsievert changed the title ~~Provide easier Docker image creation~~ Allow easier installation of extra packages Jul 21, 2020

JoshKarpel added a commit that referenced this issue Jul 21, 2020

add options for installing extra conda/pip packages without making a …

8c6cf46

…Docker image (partial progress on #25)

JoshKarpel mentioned this issue Jul 21, 2020

Options for installing extra packages at runtime #31

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow easier installation of extra packages #25

Allow easier installation of extra packages #25

stsievert commented Jul 19, 2020

stsievert commented Jul 19, 2020 •

edited

Loading

Install packages

Using a conda environment

Using pip/conda

JoshKarpel commented Jul 20, 2020 •

edited

Loading

stsievert commented Jul 21, 2020

JoshKarpel commented Jul 21, 2020

stsievert commented Jul 21, 2020 •

edited

Loading

JoshKarpel commented Jul 21, 2020

stsievert commented Jul 21, 2020

JoshKarpel commented Jul 21, 2020

JoshKarpel commented Aug 28, 2020 •

edited

Loading

Allow easier installation of extra packages #25

Allow easier installation of extra packages #25

Comments

stsievert commented Jul 19, 2020

stsievert commented Jul 19, 2020 • edited Loading

Install packages

Using a conda environment

Using pip/conda

JoshKarpel commented Jul 20, 2020 • edited Loading

stsievert commented Jul 21, 2020

JoshKarpel commented Jul 21, 2020

stsievert commented Jul 21, 2020 • edited Loading

JoshKarpel commented Jul 21, 2020

stsievert commented Jul 21, 2020

JoshKarpel commented Jul 21, 2020

JoshKarpel commented Aug 28, 2020 • edited Loading

stsievert commented Jul 19, 2020 •

edited

Loading

JoshKarpel commented Jul 20, 2020 •

edited

Loading

stsievert commented Jul 21, 2020 •

edited

Loading

JoshKarpel commented Aug 28, 2020 •

edited

Loading