Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve speed of Poetry on CI #332

Closed
amotl opened this issue Jan 24, 2021 · 22 comments
Closed

Improve speed of Poetry on CI #332

amotl opened this issue Jan 24, 2021 · 22 comments

Comments

@amotl
Copy link
Member

amotl commented Jan 24, 2021

Hi there,

@gutzbenj tried to improve speed on GHA CI through #328 and I later chimed in through #331.

We observed different problems, specifically when caching whole Python virtualenvs using GHA's actions/cache and are discussing it with @sondrelg at snok/install-poetry#18.

However, the recommendation is actually not to do it at all, see actions/cache#175. Otherwise, this will probably bring in more problems than not. We should really trust @webknjaz and @pradyunsg here:

Usually only pip's wheel cache should be cached in GHA. Caching the whole interpreter installation is going to create problems every time they upgrade it, see actions/cache#175 (comment).

Please don't cache site-packages or entire interpreter trees, see actions/cache#175 (comment).

The reason why it takes twice the time to install packages from the download cache into the environment on Windows is that Windows is just slow when handling thousands of small files, as @maphew suggested at actions/cache#175 (comment).

With kind regards,
Andreas.

@amotl
Copy link
Member Author

amotl commented Jan 24, 2021

My observation currently is that the process might be slower with caching than without!?

image
-- https://github.com/earthobservations/wetterdienst/actions/runs/507624752

image
-- https://github.com/earthobservations/wetterdienst/actions/runs/507642040

@amotl
Copy link
Member Author

amotl commented Jan 24, 2021

Regarding performance improvements wrt. observations by others, python-poetry/poetry#2184 (comment) by @TheFriendlyCoder might be important to watch. Also, some comments at python-poetry/poetry#2094 might yield more insights.

cc @sondrelg

@sondrelg
Copy link

Don't have much to add here, but I guess whether adding caching for Windows runs is faster or not, should boil down to whether it would be faster to retrieve a cache or download packages from PyPi - everything that happens after that should be the same.

Looking at this run, it seems your total cache size is only 50mb, and is retrieved in 1s, so generally I would expect that to beat PyPi, but I guess this isn't really a bottleneck either way.

@amotl
Copy link
Member Author

amotl commented Jan 25, 2021

It really feels like Poetry is doing some kind of network roundtrips even if all dependencies can be satisfied from the cache. This is kind of reflected by observations from others, see #332 (comment).

@webknjaz
Copy link

Any correct resolver would need network anyway because if you request ==1.0.0, it'll also match ==1.0.0.post0, for example. And only the index would know if that exists.

@amotl
Copy link
Member Author

amotl commented Jan 25, 2021

@webknjaz: I see, thanks for outlining this scenario. So, with respect to eventually sacrificing correctness on this aspect, any outcome from discussions like python-poetry/poetry#1992, python-poetry/poetry#2184 or python-poetry/poetry#3322 might bring some improvements on this?

@webknjaz
Copy link

I guess... I gave up on poetry years ago. But with pip, you can do something like this by making a wheelhouse directory, and doing something like pip install --no-index -f ./wheelhouse-dir <whatever requirements you have>, and it'll only use that dir for resolution so if you cache it (maybe with pip download), you could achieve what you want.

@amotl
Copy link
Member Author

amotl commented Jan 25, 2021

Thanks Sviatoslav.

We will look into that and how it might be combined with Poetry. python-poetry/poetry#2873 also discusses that option. I also remembered pip --no-index and how that might be used in our case.

So, as we will need the synthesized requirements.txt file for installation anyway and we also need it to include development dependencies, maybe something along the lines of

poetry export --dev --extras=http --extras=sql --extras=excel --without-hashes --format=requirements.txt --output=requirements.txt
pip download --dest=~/.cache/wheelhouse --requirement=requirements.txt
pip download --dest=~/.cache/wheelhouse setuptools-scm wheel

When restoring the wheelhouse content into a virtualenv using

time pip install --no-index --find-links=~/.cache/wheelhouse --requirement=requirements.txt

we see it will still take 1m13s on my macOS machine. So, this might also not gain any improvements on Windows.

However, I also see that not all packages have been able to be downloaded as wheels:

l ~/.cache/wheelhouse/*.tar.gz | wc -l
      19

So, a significant amount of time was spent on building wheels from sdist packages beforehand when running pip install .... This will become obvious when running pip install a second time on a newly created virtualenv: It would only take 38s on my macOS machine now - I believe this is the best thing we can achieve here.

So, instead of only caching ~/.cache/wheelhouse, one should also cache ~/.cache/pip.

@TheFriendlyCoder
Copy link

Any correct resolver would need network anyway because if you request ==1.0.0, it'll also match ==1.0.0.post0, for example. And only the index would know if that exists.

Would you mind clarifying what you mean here? If I have a pegged revision as a dependency (ie: "==1.0.0") I would expect that exact pegged revision to be installed (ie: "v1.0.0") and not some alternative compatible version (ie: "v1.0.0.post0"). Perhaps you meant to use "~=1.0.0" or "^1.0.0" in your example, which would be referring to "compatible" versions and not explicitly pegged revisions.

That being said, regardless of the dependency specification that may appear in the pyproject.toml file, the resolver should not need to reach out to the remote repository except when initially generating the poetry.lock file. Once that file is generated it should contain all relevant metadata needed in order to reproduce the dependency tree without additional external resources. So then, assuming that all of the requisite dependencies are stored in the local cache, one should be able to install those dependencies mentioned in the lock file without requiring additional network IO.

Maybe I'm missing some nuance here so feel free to correct me if that is the case.

@webknjaz
Copy link

webknjaz commented Jan 26, 2021

Would you mind clarifying what you mean here? If I have a pegged revision as a dependency (ie: "==1.0.0") I would expect that exact pegged revision to be installed (ie: "v1.0.0") and not some alternative compatible version (ie: "v1.0.0.post0"). Perhaps you meant to use "~=1.0.0" or "^1.0.0" in your example, which would be referring to "compatible" versions and not explicitly pegged revisions.

No, I described exactly how the resolvers in pip and pipenv (IIRC) work. And based on https://www.python.org/dev/peps/pep-0440/#post-releases this is by design. Maybe @pradyunsg could explain better. Another case would be having a universal wheel locally while there's a platform-specific wheel for the same index on PyPI, the resolver has no way of knowing that there's a more precise match unless it talks to the server. Besides, if a version got marked as bad and removed or yanked, the installer should know about that too.

@TheFriendlyCoder
Copy link

No, I described exactly how the resolvers in pip and pipenv (IIRC) work. And based on https://www.python.org/dev/peps/pep-0440/#post-releases this is by design.

The section you are referring to simply describes the format and interpretation of the post-release build identifier. That section does not imply a specific dependency resolution behaviour for a given tool.

Conversely, there is a section in PEP-440 that explains the expected behaviour for an exact / pegged version matcher that any compliant tool is expected to implement: https://www.python.org/dev/peps/pep-0440/#version-matching. This section does indicate that tools should allow for "strict" vs "non-strict" comparisons but that tools should enable strict enforcement by default. This means that if my pegged revision is "==1.0.0" and there is a 1.0.0.post1 release in the index, that it should (by default) be ignored.

However, all this being said, after the initial dependency tree has been generated (ie: while generating the poetry.lock file) the package manager should be able to precisely and unambiguously locate the exact matching package to be deployed on subsequent runs.

@TheFriendlyCoder
Copy link

Another case would be having a universal wheel locally while there's a platform-specific wheel for the same index on PyPI, the resolver has no way of knowing that there's a more precise match unless it talks to the server.

Similar to my previous argument, I can't see how this would lead to any change in behaviour wrt the dependency resolver. At the time the poetry.lock file is generated, the content of the lock file will be based on the current state of the pypi index. If the state of the index changes in any way then I would expect subsequent runs to deploy the exact version that is specified in the lock file, not a comparable / compatible version that may have since been added to the index.

I guess the point I'm trying to make is, any potential subtleties wrt cache management and dependency resolution should be avoidable by relying on the package metadata stored in the lock file.

@TheFriendlyCoder
Copy link

Besides, if a version got marked as bad and removed or yanked, the installer should know about that too.

I am pretty sure that pypi.org no longer allows maintainers to delete / yank / pull bad packages from the public repository. However, even if that were possible there are many other ways for that to break dependency resolution tools (ie: consider the case of repository mirrors) which is probably why it is prevented or strongly discouraged.

Conversely, in the case where a system has a cached dependency that has been removed from the remote repository, I would expect subsequent Poetry executions to still resolve and deploy the dependency correctly so long as that exact dependency / version is still referenced in the lock file and it can be discovered in the local cache.

In general, whether the tool should purge a package from the local cache if it has been removed from the remote repository when installing dependencies without a lock file could be debatable (I can see rationale for and against it), this need not dictate how the cache should be managed when installing dependencies from a lock file.

@TheFriendlyCoder
Copy link

I would suggest basing the mechanics of cache management on comparable tools from other languages like Gradle, NuGet, etc. In those cases the tools mostly allow deploying of packages from local caches, either on demand or when they are unable to connect to the remote repository.

@webknjaz
Copy link

webknjaz commented Feb 8, 2021

This section does indicate that tools should allow for "strict" vs "non-strict" comparisons but that tools should enable strict enforcement by default. This means that if my pegged revision is "==1.0.0" and there is a 1.0.0.post1 release in the index, that it should (by default) be ignored.

I simply pointed out how things work right now without speculating about how they should work. I recommend raising issues on respective trackers of the upstream projects if you're looking for a change.

I am pretty sure that pypi.org no longer allows maintainers to delete / yank / pull bad packages from the public repository.

This is 100% incorrect. PyPI has been allowing deleting packages from the index for many years and there are no plans to disallow this (as long as a given user's role is an "owner", not just a "maintainer" — https://pypi.org/help/#collaborator-roles). One remark tho: deleting a version removes it from the available version lists but not from the CDN servers that host actual sdists and wheels. Another remark: "yank" is not the same as "delete", it's basically a special flag that tells clients like pip that is strongly discouraged to use a yanked package version but doesn't prevent them from doing so. Old pip versions ignore this flag, while the new ones try to respect it when the user requests an unpinned package version but if it's pinned, it'll go ahead and consider installing a yanked version. Also, PyPI has red badges for yanked versions, here's an example of how this looks:
https://pypi.org/project/octomachinery/#history.

And here's additional info on yanking:

Another reason for removing packages is if they contain malware/spam/namesquatting/etc per https://www.python.org/dev/peps/pep-0541/#invalid-projects.

Is using the malware from cache more important than taking into account such security issues?

The index must be used for resolution for as long as it's enabled. If you want local cache instead, there just should be a mechanism for disabling the unwanted indexes (FWIW pip already knows how to treat local dirs as indexes).

@amotl
Copy link
Member Author

amotl commented Feb 8, 2021

Dear all,

thank you very much for your feedback on this topic. Very much appreciated.

The index must be used for resolution for as long as it's enabled. If you want local cache instead, there just should be a mechanism for disabling the unwanted indexes (FWIW pip already knows how to treat local dirs as indexes).

Exactly. So, I conclude that we should investigate whether Poetry also offers such an option and whether it will be able to deliver respective performance gains. Otherwise, going through pip as outlined at #332 (comment) is another option which worked well.

With kind regards,
Andreas.

@TheFriendlyCoder
Copy link

I can see how malware being introduced into the dependency tree would be one of those extenuating circumstances where you may want to fail a deployment even when a revision is pegged. Thanks for pointing that out.

If there was at least an option or flag to disable the index lookup, that would put the user / developer back in control, which would satisfy my use case.

I hope that this will be implemented in Poetry. Juggling multiple package management tools, using pip for some things and Poetry for others, would just introduce unnecessary complexity into the tool chain.

@amotl
Copy link
Member Author

amotl commented Feb 9, 2021

Hi again,

it looks like discussions around this topic are going on on the Poetry issue tracker since quite some time already, see python-poetry/poetry#331 and python-poetry/poetry#559, eventually leading to python-poetry/poetry#908. Then, it goes on with python-poetry/poetry#1554 and python-poetry/poetry#1070.

Finally, the main issue tracking respective improvements seems to be python-poetry/poetry#1632, eventually leading to python-poetry/poetry#2074 by @Zyava. However, progress on this apparently has stalled and @BrandonLWhite and @mcouthon haven't been able to take over.

Now, @ZeroAurora started another attempt with python-poetry/poetry#3624 which also relates to python-poetry/poetry#3479 by @zillionare.

I wish you the best of luck to be able to finally tackle this, @ZeroAurora! It would be really nice if this patch can also bring in a feature to not use any index at all by adding a --no-index commandline option as outlined by @webknjaz at #332 (comment) like

pip install --no-index -f ./wheelhouse-dir <whatever requirements you have>

With kind regards,
Andreas.

P.S.: I've just added a comment about our use case at python-poetry/poetry#3624 (comment).

@ZeroAurora
Copy link

ZeroAurora commented Feb 10, 2021

Hey everyone. Thank you for your encouragement, but this kind of issue is somewhat out of the scope of my PR. My PR focuses on those who can't connect to PyPI (or can but with a slow bandwidth). I've mentioned a probable solution in python-poetry/poetry#3624 (comment) but I'm sorry that I can't work on this since I don't have extra time and energy. I hope someone will work on this (or an offline mode).
btw, I couldn't understand why caching poetry's cache won't work 🤔

Good luck everyone! Hope this issue get resolved!

@amotl
Copy link
Member Author

amotl commented Feb 10, 2021

Hi @ZeroAurora,

thanks for your kind response.

This kind of issue is somewhat out of the scope of my PR.

I see. Thanks for your explanation.

Btw, I couldn't understand why caching poetry's cache won't work 🤔

It actually does, but Poetry will still reach out to the package index and this apparently adds some slowness compared to running with pip --no-index.

I hope someone will work on this (or an offline mode).

This! 👍

With kind regards,
Andreas.

@amotl
Copy link
Member Author

amotl commented Feb 10, 2021

@sinoroc just suggested at python-poetry/poetry#3624 (comment):

I have not exactly followed the discussion in details, but looks like you might be interested in simpleindex.

Basically, it allows you to easily start a local index server based on simple rules, and the files can be served from the local filesystem. So I guess maybe if configured right, it could do what you are after.

I answered:

simpleindex looks nice, indeed! But within a CI job like Travis CI oder GitHub Actions, even that would be additional hassle to bring up a sidecar service and wrap that around the wheelhouse directory just for that purpose. However, it might be doable. Thank you very much for your suggestion.

@amotl amotl changed the title Improve speed on CI Improve speed of Poetry on CI Feb 10, 2021
@webknjaz
Copy link

Well you could look into dumb-pypi or bandersnatch then. But it still feels like an overkill.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants