Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More granularity in the CI, separating code and docs changes? #8705

Closed
etienneschalk opened this issue Feb 4, 2024 · 7 comments · Fixed by #9006
Closed

More granularity in the CI, separating code and docs changes? #8705

etienneschalk opened this issue Feb 4, 2024 · 7 comments · Fixed by #9006
Labels
Automation Github bots, testing workflows, release automation

Comments

@etienneschalk
Copy link
Contributor

etienneschalk commented Feb 4, 2024

What is your issue?

Hi,

TLDR: Is there a way to only run relevant CI checks (eg documentation) when a new commit is pushed on a PR's branch?

The following issue is written from a naive user point of view. Indeed I do not know how the CI works on this project. I constated that when updating an existing Pull Request, the whole test battery is re-executed. However, it is a common scenario that someone wants to update only the documentation, for instance. In that case, it might make sense to only retrigger the documentation checks. A little bit like pre-commit that only runs on the updated files. Achieving such a level of granularity is not desirable as even a small code change could make geographically remote tests in the code fail, however, a high-level separation between code and docs for instance, might relieve a little bit the pipelines. This is assuming the code does not depend at all on the code. Maybe other separations exists, but the first I can think of is code vs docs.

Another separation would be to have an "order" / "dependency system" in the pipeline. Eg, A -> B -> C ; if A fails, there is no point into taking resources to compute B as we know for sure the rest will fail. Such a hierarchy might be difficult for the test matrix that is unordered (eg Python Version x OS, on this project it seems to be more or less (3.9, 3.10, 3.11, 3.12) x (Ubuntu, macOS, Windows)

There is also a notion of frequency and execution time: pipelines' stages that are the most empirically likely to fail and the shortest to runshould be ran first, to avoid having them fail due to flakiness and out of bad luck when all the other checks passed before. Such a stage exists: CI / ubuntu-latest py3.10 flaky (it is in the name). Taking that into account, the CI Additional / Mypy stage qualifies for both criteria should be ran before everything else for instance. Indeed, it is static code checking and very likely to fail, something a developer might also run locally before committing / pushing, and only takes one minute to run (compared to several minutes for each of stages of the Python Version x OS matrix). The goal here is to save resources (at the cost of losing the "completeness" of the CI run)

@etienneschalk etienneschalk added the needs triage Issue that has not been reviewed by xarray team member label Feb 4, 2024
Copy link

welcome bot commented Feb 4, 2024

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@dcherian
Copy link
Contributor

dcherian commented Feb 4, 2024

Thanks @etienneschalk These are good ideas.

Some things to consider:

  1. The docs includes docstrings that are in the code, so we do want to build this on every change. It'd be nice if RTD was smarter about caching previous builds though.
  2. In my experience, it is common to have the shortest runs .e.g mypy or the minimal-deps env, fail in the early stages of iteration, and then fix them up when the PR gets close to complete. So it wouldn't be nice to cancel the rest if these fail.
  3. We used to have mypy as a pre-commit stage but it places a high bar on new contributors.

@etienneschalk
Copy link
Contributor Author

Hi @dcherian,

Thanks for your answer!

  1. OK, a code change should trigger a doc build because of the docstrings, but what about the opposite? Here is how I see it (doc is always built):
code change doc change
code checks
doc build
  1. OK, I see the point. I understand: shortest runs can be more cosmetic, and even if they fail, one might want to see if the battery of tests pass before fixing them at the end of the PR.

  2. Indeed satisfying MyPy is not an easy task... I saw there is also Pyright stage that seems to be always skipped. Has it been considered switching from MyPy to Pyright? I recently tested Pyright (also including it in pre-commit), and I found the error messages to be rather insightful, reminding me of the TypeScript development experience.

Flaky Tests

I define a flaky test as a test that satisfy those conditions:

  • The test seems unrelated to the content of the pull request. This condition solely is not enough as some small code changes can have impacts far away in the codebase (hence the need of the following second condition) ;
  • it seems to be randomly failing in one stage of the Python Version x OS matrix, but not the others ;
  • it may have succeeded in a previous similar run for the same PR.

Example

Example of flaky tests from my recent experience, on the same PR, failing on two different runs:

  • FAILED xarray/tests/test_backends.py::TestDask::test_dask_roundtrip only for stage windows-latest py3.12
  • Multiple tests failing due to Timeout >180.0s only for stage macos-latest py3.11 (note: TestDask.test_dask_roundtrip is mentioned)

Examples from other PRs:

  • Only the ubuntu-latest py3.12 stage fails because of a Timeout >180.0s and TestDask.test_dask_roundtrip mentioned

-> Conclusion: test_dask_roundtrip seems to be pretty flaky at the current moment

@benbovy
Copy link
Member

benbovy commented Feb 5, 2024

I saw there is also Pyright stage that seems to be always skipped. Has it been considered switching from MyPy to Pyright?

Before that we'll need to do something with the 2000+ errors reported from pyright after parsing the whole codebase :).

(with some configuration tweaks and excluding some folders that can be safely ignored it is reduced down to ~1400 errors, which is still a lot).

@dcherian
Copy link
Contributor

dcherian commented Feb 6, 2024

@etienneschalk Yes you're right! We could run the code checks only if files in xarray/* or ci/* are changed (I think).

A PR would be welcome, here are some docs I found.
https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#onpushpull_requestpull_request_targetpathspaths-ignore

@max-sixty
Copy link
Collaborator

Example of flaky tests from my recent experience, on the same PR, failing on two different runs:

* `FAILED xarray/tests/test_backends.py::TestDask::test_dask_roundtrip` only for stage [windows-latest py3.12](https://github.com/pydata/xarray/actions/runs/7776497338/job/21203724620?pr=8698#logs)

* Multiple tests failing due to `Timeout >180.0s` only for stage [macos-latest py3.11](https://github.com/pydata/xarray/actions/runs/7768517575/job/21186540956#logs) (note: `TestDask.test_dask_roundtrip` is mentioned)

I have been marking flaky tests as xfailed, and think we should continue to do that.

If someone has a plan to fix them, that's even better, but in lieu of that I would support more xfailing...

@etienneschalk
Copy link
Contributor Author

@benbovy this seems like a colossal task indeed! This may be reasonably impossible actually. Even for small codebases, this can rapidly get out of hand to type a posteriori, and even sometimes may lead to "forced typed code" that is less elegant or even capable of introducing bugs...

@max-sixty

I have been marking flaky tests as xfailed, and think we should continue to do that.

From my recent experience, I would be in for marking test_dask_roundtrip as flaky

I see there is a only instance of @pytest.mark.flaky from seven years ago

Otherwise, to use xfail, this sentence seems to summarize the best:

@pytest.mark.xfail(reason="Flaky test. Very open to contributions on fixing this")

@dcherian dcherian added Automation Github bots, testing workflows, release automation and removed needs triage Issue that has not been reviewed by xarray team member labels Feb 15, 2024
dcherian added a commit to dcherian/xarray that referenced this issue May 6, 2024
dcherian added a commit that referenced this issue May 6, 2024
* Trigger CI only if code files are modified.

Fixes #8705

* Apply suggestions from code review

Co-authored-by: Maximilian Roos <[email protected]>

---------

Co-authored-by: Maximilian Roos <[email protected]>
andersy005 pushed a commit that referenced this issue May 10, 2024
* Trigger CI only if code files are modified.

Fixes #8705

* Apply suggestions from code review

Co-authored-by: Maximilian Roos <[email protected]>

---------

Co-authored-by: Maximilian Roos <[email protected]>
andersy005 pushed a commit that referenced this issue May 10, 2024
* Trigger CI only if code files are modified.

Fixes #8705

* Apply suggestions from code review

Co-authored-by: Maximilian Roos <[email protected]>

---------

Co-authored-by: Maximilian Roos <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Automation Github bots, testing workflows, release automation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants