-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize Helm tests with multiple job runners #30672
Conversation
Once this is complete, we should get back to ~ 20 minutes elapsed time of feedback EVEN if helm unit tests are involved. It should go down to some 8 minutes elapsed time vs. 22 minutes for Helm tests (at the expense of using more machines). |
c63d11e
to
f35513c
Compare
It should also help to run helm tests locally, per package. I tried to group the helm tests in some logical packages, where each package have ~similar number of tests in each package, and you can easily run tests in each package by selecting it with (for esxample):
|
f35513c
to
8fccbd1
Compare
Looks like we have 4-6 minutes per "test-package" instead of 21 minutes for "all" -> not bad. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for me!
I have a concern regarding the use of a lot of parallelism in our workflows. While it can greatly enhance our development speed, there is a possibility that a few number of PRs could occupy all the runners, potentially delaying the progress of other PRs. It would be beneficial to have some kind of configurations in place to prevent this from happening, such as limiting the number of runners a single PR can use. This would help ensure that all PRs have access to the necessary resources and can be processed efficiently.
I am not sure if we have any metrics or analytics available to measure the average duration in queue and the overall duration of a workflow. It would be helpful to have access to this information to evaluate the impact of any changes we make to our CI workflow. By tracking the average queue duration and the total duration of a workflow, we can better understand how changes affect our development process and identify areas for improvement.
For this one, LGTM.
d5e8408
to
8c9a64a
Compare
Ok. Not sure if you know what you've asked for, but well. you did @hussein-awala :).
Looking at the available metrics I don't think we need to do anything about it. I am regularly checking those that we have (see below) and every time I see an optimisation opportunity, I do implement it, however I would love someone else to take a deep dive and understand our system and improve it. So if you would like to join me - I am all ears, and happy to share everything about it. I think it would be best if you try to understand how our system works in detail, and then we could discuss some ways how things can be improved. If you would have some ideas (after getting the grasp of it) how to improce it and what it might take, I would love to see it. The "noisy-neighbour" issue has been on-going debate for the Apache Software Foundation and Airflow for years now and it used to be very bad at times, but with a lot of our effort, helping others to optimise their jobs and especially with GitHub providing more public runners and us having sponsored self-hosted runners, the issue is non-existing currently IMHO. Just for the context: all non-committers use public runners (except image building), all our public runners run in a shared pool of 900 jobs that the ASF shares between all of theor ASF projects - there are ~ 700 projects in the ASF using GitHub Actions) and for Airlfow for commiter jobs and image building we have self-hosted runners in an auto-scaling group of up to 35 runners (8CPU , 64 GB) that can be scaled on-demand - we run them all on AWS provided credits and if needed, sponsored by Astronomer, when the credits run out. Here is one of the charts that shows how many projects are there for the ASF, using GitHub Actions: For the ASF metrics and solving "noisy-neighbour" problems were up for about 3 years now and I doubt we can solve it quickly. though if you have some ideas, I am all ears. We've been discussing similar subject at lengths (I will post some resources) and if you can add something to the discussion (after reading that and getting the context, it would be great). Adding some metrics that could give us more value would be great. Also we have a big change to do - convert our ci-infra (developed initially by @ashb based on some AWS components - Dynamo DB, AutoScaling Groups and few others to Kuberentes - since the time we had to implement our own custom solution, there are a few (even supported by GitHub) ways how to deploy it on Kubernetes, and we have a promise also from Google to get free credits on thoir cloud if wa mange to do it as standard Kubernetes deployment, which woudl increase our CI capacity greatly. The infrastructure of ours for the auto-scaling is in https://github.com/apache/airflow-ci-infra (I am happy to guide your way - it's not super-documented, but pretty "up-to-date" and code describes what we do pretty well. Note - I am not going to discourage you quite the opposite - I'd love to get a helping hand that would start understanding and contributing to our system. So I will add some pointers so that you can take a look and see.
I am so glad someone else would like to look at that :) . And I would love to get some PRs improving the test harness we have on CI. Or if you could get some metrics and analyse them would be good, but I am afraid metrics for GitHub Actions are very poor. But if you could improve them it would be good. You can read some history of that and there is some metrics that a friend of mine and contributor @TobKed had developed in the absence of lack of those metrics for GitHub Actions for the whole Apache Software Foundation (we continue gathering this metrics for ~2 years while Github with the "duck-tape" solution of our and continuous promises to get a better metrics from GitHub, but they are not yet available - though promised) if you send me your gmail address, I can shere access to the report (ping me om slack). Here is an example "workflow queued average" for all ASF projects in public runners: This one shows that airflow is not even on the radar of queued worklfows because of the heavy optimisations we've implemented. It used to be much more, and we had a lot of troubles ~ 3 years ago, but then I implemented plenty of optimisations and checks and @ashb implemented the self-hosted infrastructure of ours (and we got credits to run it). Since then also the ASF got from 50 to 900 jobs sponsored by GitHub, which (at least temporary) solved the noisy-neighbour problems we had, but since then we try to use the runners in as optimized way as possible (see below). The docs describing some status of GitHub Actions in the ASF (I try to keep it updated though recently there were not many updates ) are here: https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status and Regarding the optimizations we have - I think you should look at "selective-checks" part. We go to a GREAT length to make sure that each PR only runs whatever it needs to run. Basically PRs usually only run a very small selection of tests - only those that are very likely to be affected. For example Unit tests for helm charts are only run when helm chart are modified. There are many more rules (for example Provider tests are only run for a subset of those - only the providers that are changed and those that have cross-dependencies with the changed providers. And there are more rules, all of them described in https://github.com/apache/airflow/blob/main/dev/breeze/SELECTIVE_CHECKS.md (which also reminds me that I should review and update it as there are few more such selective check rules which are not documented there. Also vast majority of those (with a bit more complex logic) selective tests are nicely tested in breeze via selective checks unit tests (rather comprehensive): https://github.com/apache/airflow/blob/main/dev/breeze/tests/test_selective_checks.py You can get even more context here by reading the description (I keep it up-to-date regularly) of our CI system: https://github.com/apache/airflow/blob/main/CI.rst Including those charts showing how our jobs work (and explaining some of the built-in optimisations, for example we only build images once and re-use them accross all the builds, using super-efficient remote caching that docker-buildkit added some time ago: https://github.com/apache/airflow/blob/main/CI_DIAGRAMS.md I know it is a lot of information, but hey, you asked :). And I am would really love to have someone else look at all those, not only me :). I am quite a bit SPOF for it, so if you would like to and have more questions, I am happy to share everything (though it might take some time to grasp it, I am afraid). |
a79fe72
to
d533171
Compare
That's really a lot of information, I need a couple of days to read the resources you've shared and process them. But yeah, there is a lot of possible improvements we can try:
I will be glad to contribute to improving the CI of the project, which can have a significant impact on the development process Let me check out the resources you've shared and look up for some of the recently shared best practices for GHA runners. |
Final results after some optimisation.
Comparing it with:
Total time: 30 m 12s "build time" -> just 2 "build-time" minutes longer, but the feedback / elapsed time down from 27 m to < 9 m in case where our workers are not all fully occupied (so the build can get green 3x faster as it was by far the slowest test). The allocation to packages is somewhat aribitrary, I preferred to split them more-or-less equally to get bettter timings rather than be perfect in package allocation., we could also join security and webserver packages for one without great loss of elapsed time. |
Looking forward to it. BTW some of those points you mentioend are already done (we are using both AMD/ARM hardware together to build multi-platform images for one), but yeah I would love to improve many of those you mentioned :) |
cc: @ephraimbuddy @jedcunningham @dstandish WDYT about the split and optimisations (I know you've been developing a lot of those tests and "Feedback" time for all of them was a big concern for you as well. I would not want to make the split without your consent, as it is somewhat arbitrary (but it can be adapted in the future and we can manually balance it out if it falls out of balance in the future), It is developed in the way that it will automatically adapt to as many packages you have, the only (reasonable IMHO) requirement is that we have no tests at the "top-level" - only "common utils" at at top level (I could even enforce that one if needed)
|
Helm Unit tests are using template rendering and the rendering uses a lot of CPU for `helm template command`. We have a lot of those rendering tests (>800) so even running the tests in parallel on a multi-cpu machine does not lead to a decreased elapsed time to execute the tests. However, each of the tests is run entirely independently and we should be able to achieve much faster elapsed time if we run a subset of tetsts on separate, multi-CPU machine. This will not lower the job build time, however it might speed up elapsed time and thus get a faster feedback. This PR achieves that.
d533171
to
496db90
Compare
Merging, we can further move those tests around later if needed. |
Sorry, I was traveling. High level, no objections here. |
Helm Unit tests are using template rendering and the rendering uses a lot of CPU for `helm template command`. We have a lot of those rendering tests (>800) so even running the tests in parallel on a multi-cpu machine does not lead to a decreased elapsed time to execute the tests. However, each of the tests is run entirely independently and we should be able to achieve much faster elapsed time if we run a subset of tetsts on separate, multi-CPU machine. This will not lower the job build time, however it might speed up elapsed time and thus get a faster feedback. This PR achieves that.
It's amazing improvement! |
🙇 Thanks 🙇 |
Helm Unit tests are using template rendering and the rendering uses a lot of CPU for `helm template command`. We have a lot of those rendering tests (>800) so even running the tests in parallel on a multi-cpu machine does not lead to a decreased elapsed time to execute the tests. However, each of the tests is run entirely independently and we should be able to achieve much faster elapsed time if we run a subset of tetsts on separate, multi-CPU machine. This will not lower the job build time, however it might speed up elapsed time and thus get a faster feedback. This PR achieves that. (cherry picked from commit 731ef3d)
Helm Unit tests are using template rendering and the rendering uses a lot of CPU for
helm template command
. We have a lot of those rendering tests (>800) so even running the tests in parallel on a multi-cpu machine does not lead to a decreased elapsed time to execute the tests.However, each of the tests is run entirely independently and we should be able to achieve much faster elapsed time if we run a subset of tetsts on separate, multi-CPU machine. This will not lower the job build time, however it might speed up elapsed time and thus get a faster feedback.
This PR achieves that.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.