Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache dependencies in CI runner #108

Closed
N-Wouda opened this issue Nov 11, 2022 · 16 comments · Fixed by #119
Closed

Cache dependencies in CI runner #108

N-Wouda opened this issue Nov 11, 2022 · 16 comments · Fixed by #119
Assignees
Labels
maintenance Maintenance chore

Comments

@N-Wouda
Copy link
Owner

N-Wouda commented Nov 11, 2022

The CI runner takes an increasing amount of time to resolve package dependencies. This has resulted in build times running up to 10min for the Python 3.10 build. We should cache the dependencies for a while on the CI runner, so as to avoid these very long runtimes.

EDIT: I'm pretty sure scipy/statsmodels solved this in some way, so their CI scipts would be a good place to start.

@N-Wouda N-Wouda added the maintenance Maintenance chore label Nov 11, 2022
@N-Wouda N-Wouda self-assigned this Nov 11, 2022
@N-Wouda
Copy link
Owner Author

N-Wouda commented Nov 11, 2022

It's mostly poetry that cannot resolve the dependencies very quickly. That seems to be a common issue: python-poetry/poetry#2094.

@N-Wouda
Copy link
Owner Author

N-Wouda commented Nov 13, 2022

Relevant pre-commit stuff: https://tobiasmcnulty.com/posts/caching-pre-commit/

@markkvdb
Copy link

markkvdb commented Nov 15, 2022

The setup-python action for GitHub Actions supports caching poetry dependencies out of the box. Minimal example to make things run:

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ["3.9"]
      fail-fast: false

    steps:
      - uses: actions/checkout@v3
      - name: Install Poetry
        run: pipx install poetry==1.2.0
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: ${{ matrix.python-version }}
          cache: "poetry"
      - name: Install Dependencies
        run: poetry install
      - name: Test
        run: poetry run bash scripts/test.sh

I don't know why but pipx is installed by default for the GitHub runners.

@markkvdb
Copy link

Btw, the code (structure) of this project is really nice. Once, every few months I have a peek at how it's evolving!

If you have to get some more inspiration for good coding practices (in Python) I suggest to have a look at the work of Tiangolo. He's the author of the excellent FastAPI and Typer packages. Next to the code, his packages also have very clear and extensive documentation from which you could get some inspiration.

@N-Wouda
Copy link
Owner Author

N-Wouda commented Nov 15, 2022

@markkvdb thanks! I tried this a few days ago in a commit in #107, but the issue is that we do not ship a lockfile (ALNS is a library, not an application, so we do not require fixed versions of our dependencies). I have been thinking of testing with fixed dependency versions, but haven't gotten around to anything yet.

@markkvdb
Copy link

You don't actually need a lock file (poetry.lock). You can let the CI run the poetry install (with dependency resolving) run once and reuse this across runs.

@markkvdb
Copy link

The caching will be done based on the hash of the python version and the pyproject.toml file, so as long as nothing changes there you don't need to resolve for each run.

@markkvdb
Copy link

markkvdb commented Nov 15, 2022

If I'm wrong about that, you can use some version of the CI file here:

https://github.com/tiangolo/fastapi/blob/4638b2c64e259b90bef6a44748e00e405825a111/.github/workflows/test.yml#L30

Let me know if none of my suggestions work because then I'll have a look as well to understand what I'm not understanding 😃

@N-Wouda
Copy link
Owner Author

N-Wouda commented Nov 15, 2022

The caching will be done based on the hash of the python version and the pyproject.toml file, so as long as nothing changes there you don't need to resolve for each run.

Are you sure? Based on my understanding of the action, it'll also hash in the lock file contents. And that will be resolved anew by poetry every time (which is very slow), and likely not match an existing cached set of dependencies (since our dependencies have their own dependencies, and if any one of those has recently been updated, our cache is invalidated).

But I could be misunderstanding this. I'll have a look around at how other projects solve this (soon-ish).

@N-Wouda
Copy link
Owner Author

N-Wouda commented Nov 15, 2022

If I'm wrong about that, you can use some version of the CI file here:

https://github.com/tiangolo/fastapi/blob/4638b2c64e259b90bef6a44748e00e405825a111/.github/workflows/test.yml#L30

This looks promising, thanks!

@markkvdb
Copy link

I'm quite surprised that setup-python does not save the environment (with temporary lock file) across runs. I never tested it but if it's really like that, then adding a manual caching step should definitely work. In my opinion, this is cleaner than expecting developers to consistently run the pre-commit hooks.

If you do want to use the pre-commit workflow, then it might be wise to add a step in the CI to check for consistency between your lock file and the configuration file. I know it exists for pyproject.toml <-> poetry.lock but I suspect that it should also work for pyproject.toml <-> requirements.txt.

@N-Wouda
Copy link
Owner Author

N-Wouda commented Nov 15, 2022

@markkvdb I don't think I understand the link between the dependencies in the pyproject.toml file (and, by extension, in poetry.lock and the like), and the pre-commit workflow. As far as I know, pre-commit gets its steps/dependencies directly from the repositories linked in .pre-commit-config.yaml, at the revs fixed there. Can you explain?

@markkvdb
Copy link

Never mind, I misunderstood how you are using pre-commit. I was thinking that pre-commit was responsible for making sure that the requirements.txt file was up-to-date but it's something completely different...

It seems that manually handling the caching might the best if setup-python doesn't handle this. It does require you to wait for the dependency resolver once but that's what it is in that case.

One note regarding long resolving times is that it might make sense to have a look which package poetry has difficulty resolving. You can get an idea by running poetry lock in verbose mode with poetry lock -vv. Whenever I had this issue it usually came down to one dependency struggling to find a match. You can then manually set the right version (for the time being) to not have these long resolve times.

@N-Wouda
Copy link
Owner Author

N-Wouda commented Nov 15, 2022

@markkvdb thanks! I'm not entirely sure how we should approach dependency pinning yet. We have some minimum versions we could pin, to at least make sure those work. At the same time, we also want to make sure recent(-ish) versions of our dependencies work well too. I'm not sure if we should pin those, or leave it to pypi to get us whatever's most recent.

I'll have a look at how scipy, statsmodels, and the like handle this soon!

@fahaddd-git
Copy link

fahaddd-git commented Nov 18, 2022

Just brainstorming ideas in this thread here to make things faster in the CI. Without the lockfile this seems to be a little more challenging.

  1. Cache the install of Poetry itself
  2. Run a cronjob at some interval to create cache using deps that are downloaded. Hash the pyproject.toml as the key.
  3. User triggers CI- check if cache hit on the pyproject.toml file. If so deps are gotten from cache. If not, deps are installed then cached with hashed pyproject.toml as key.

I'm not sure if the cache from the previous cron run in step 2 would need to be invalidated or if the generated poetry.lock needs to be stored as an artifact.

Per the first item I'm happy to open a PR for this if it's a useful addition.

@N-Wouda
Copy link
Owner Author

N-Wouda commented Nov 19, 2022

@markkvdb @fahaddd-git thanks for your inputs! I have added some caching in #119 that reduces CI runtimes around 10x to one minute or less. That is more than sufficient for now, so I closed this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Maintenance chore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants