Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use installed packages to solve dependency graph #1596

Closed
sfarina opened this issue Mar 7, 2022 · 12 comments
Closed

Use installed packages to solve dependency graph #1596

sfarina opened this issue Mar 7, 2022 · 12 comments
Labels
cache Related to dependency cache

Comments

@sfarina
Copy link

sfarina commented Mar 7, 2022

What's the problem this feature will solve?

I'm trying to build stable requirements.txt files for docker containers built on top of existing containers (specifically pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime). To save on image size, these containers don't have the pip cache intact, so pip-compile takes a few minutes to download a large (~1GB) wheel, defeating part of the purpose of using a base docker image. I run pip-compile inside of a docker build -f pipcomile.dockerfile

Describe the solution you'd like

any of:

  1. an option to tell pip-compile not to solve for package X, or anything it depends on (the docker container already manages package X)
  2. an option to tell pip-compile not to solve for any package that is already installed

Alternative Solutions

  1. wait for the wheel to be downloaded every time
  2. cache the large wheel in the docker build before the pip-comile step

Additional context

(user@host)$ docker run --rm -it pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime bash
(root@docker)# pip install pip-tools
(root@docker)# pip install torch==1.10.0
    Requirement already satisfied: torch==1.10.0 in /opt/conda/lib/python3.7/site-packages (1.10.0)
(root@docker)# echo "torch==1.10.0" > requirements.in
(root@docker)# python3 -m piptools compile -v
#...
  torch==1.10.0 not in cache, need to check index
# takes 3 minutes to download a large wheel that's already installed
@AndydeCleyre
Copy link
Contributor

OK, I don't have a great handle on every aspect of the caching, and I don't know if this is 100% satisfactory, but this can be worked around somewhat by copying/mounting/sharing a very small json cache file (not the wheel itself).

I used the container to run pip-compile which indeed takes too long and does too much work. This generated /root/.cache/pip-tools/depcache-cp3.7.json (formatted for this comment):

{
  "__format__": 1,
  "dependencies": {
    "torch": {
      "1.10.0": [
        "typing-extensions"
      ]
    },
    "typing-extensions": {
      "4.1.1": []
    }
  }
}

And now:

$ podman run --rm -it -v $PWD/depcache-cp3.7.json:/root/.cache/pip-tools/depcache-cp3.7.json:rw docker://pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime bash
# pip install pip-tools
# echo "torch==1.10.0" >requirements.in
# time pip-compile
#
# This file is autogenerated by pip-compile with python 3.7
# To update, run:
#
#    pip-compile
#
torch==1.10.0
    # via -r requirements.in
typing-extensions==4.1.1
    # via torch

real    0m1.662s
user    0m0.427s
sys     0m0.048s

@sfarina
Copy link
Author

sfarina commented Mar 8, 2022

Thanks for looking into this so quickly!

I'll try this out. It's not ideal - I'll have to explain the magic config file, but it's a better workaround than mine, which downloads the wheel and caches the wheel in the docker build before installing pip-tools.

@sfarina
Copy link
Author

sfarina commented Mar 8, 2022

Maybe this is a problem with how pip/pypi works. The dependency graph should be solvable without downloading ANY packages through pypi exposing some small file(s) or api.

I think it would still be nice to have requirements.in parse something like

torch==installed # or ignored, or existing, ...
transformers

to ignore/trust an existing package, but maybe I'm alone in that.

update:
it would do away with the need for the magic config file: depcache-cp3.7.json:/root/.cache/pip-tools/depcache-cp3.7.json

@AndydeCleyre
Copy link
Contributor

I'm still interested in this issue, and I still don't have all the answers. But I'll add now another workaround, oriented to your last suggestion.

Be warned: it sacrifices the total locking guarantees, but will "probably" (😓) be fine:

# echo "transformers" >>locked.in
# pip-compile locked.in
# echo "-r locked.txt" >>requirements.txt

# echo "torch" >>installed.txt
# echo "-r installed.txt" >>requirements.txt

# pip install -r requirements.txt

@sfarina
Copy link
Author

sfarina commented Mar 8, 2022

I'm ok with sacrificing "total locking" guarantees, as some of the locking is done by pinning a docker base image.

Thanks for the new workaround, but I don't think would fix the situation, since transformers depends on torch (which you wouldn't have known a priori). I'm in a meeting but can test in a bit.

@AndydeCleyre
Copy link
Contributor

Oh yeah sorry, this method won't help in this case.

@AndydeCleyre
Copy link
Contributor

Maybe this is a problem with how pip/pypi works. The dependency graph should be solvable without downloading ANY packages through pypi exposing some small file(s) or api.

I think this is it really. Unfortunately, especially with setup.py packages, arbitrary code can be run on the system-of-installation to determine the requirements, and we can't rely on simple static dependency declarations universally.

That said, in the container you're using, the relevant info does seem to be available in /opt/conda/pkgs/*/info/*.json.

And in normally installed packages, we may find the details needed in e.g. /usr/lib/python3/dist-packages/*.egg-info/{PKG-INFO,requires.txt}.

So maybe we can update our cache file with data from those sources. If we do, I don't know if it should be done by default, as it has different security implications than using the PyPI data/packages.

I'll also link some related issues:

@sfarina
Copy link
Author

sfarina commented Mar 21, 2022

So maybe we can update our cache file with data from those sources. If we do, I don't know if it should be done by default, as it has different security implications than using the PyPI data/packages.

Having an option to look through /path/to/python/{dist,site}-packages/*.egg-info/<files> to update the cache would be nice, but the future is probably the .whl.METADATA issue you linked, whenever that is finished.

@atugushev atugushev added the cache Related to dependency cache label Apr 6, 2022
@EpicWink
Copy link

I have a similar problem, but in my case the dependency1 I have installed in my Docker image is not available on PyPI or our internal index (but may be in our internal index in the future2), due to it requiring system libraries. This means I need one of the first two options in the OP (so I can lock all dependencies): consider installed (preferred) or ignore specific packages.

Footnotes

  1. If you want to know, the dependency is (a fork of) OpenSfM, requiring OpenCV (and its Python bindings, which we manage manually) and Ceres-solver.

  2. We'll likely turn the opensfm wheels into manylinux wheels, then distribute in our internal index, but I'd like to find a way to use the OpenCV distributed with opencv-python.

@sfarina
Copy link
Author

sfarina commented Aug 17, 2022

Since this a an edge case relevant to my docker workflow, I'll post my docker workaround: create a docker cache for the pip cache:

RUN --mount=type=cache,target=/root/.cache/pip python3 -m piptools compile -v 

That should avoid repeatedly downloading big wheels (until the cache is cleared).

It might be better to use a real mount instead of a cache mount, but I'm no docker expert so, by all means, experiment.

@sfarina sfarina closed this as completed May 16, 2024
@EpicWink
Copy link

EpicWink commented May 16, 2024

@sfarina you closed this as completed: could you please link the pull request which completed this issue? Or are you thinking that your Docker cache mount solves your use case, in which case could you please instead close as won't fix?

@sfarina sfarina reopened this May 16, 2024
@sfarina
Copy link
Author

sfarina commented May 16, 2024

won't fix / stale

@sfarina sfarina closed this as not planned Won't fix, can't repro, duplicate, stale May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cache Related to dependency cache
Projects
None yet
Development

No branches or pull requests

4 participants