Parallelize Helm tests with multiple job runners #30672

potiuk · 2023-04-16T20:12:58Z

Helm Unit tests are using template rendering and the rendering uses a lot of CPU for helm template command. We have a lot of those rendering tests (>800) so even running the tests in parallel on a multi-cpu machine does not lead to a decreased elapsed time to execute the tests.

However, each of the tests is run entirely independently and we should be able to achieve much faster elapsed time if we run a subset of tetsts on separate, multi-CPU machine. This will not lower the job build time, however it might speed up elapsed time and thus get a faster feedback.

This PR achieves that.

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

potiuk · 2023-04-16T20:16:34Z

Once this is complete, we should get back to ~ 20 minutes elapsed time of feedback EVEN if helm unit tests are involved. It should go down to some 8 minutes elapsed time vs. 22 minutes for Helm tests (at the expense of using more machines).

potiuk · 2023-04-16T20:48:36Z

It should also help to run helm tests locally, per package. I tried to group the helm tests in some logical packages, where each package have ~similar number of tests in each package, and you can easily run tests in each package by selecting it with (for esxample):

breeze testing helm-tests --helm-test-package airflow_core

potiuk · 2023-04-16T20:59:00Z

Looks like we have 4-6 minutes per "test-package" instead of 21 minutes for "all" -> not bad.

hussein-awala

Looks good for me!

I have a concern regarding the use of a lot of parallelism in our workflows. While it can greatly enhance our development speed, there is a possibility that a few number of PRs could occupy all the runners, potentially delaying the progress of other PRs. It would be beneficial to have some kind of configurations in place to prevent this from happening, such as limiting the number of runners a single PR can use. This would help ensure that all PRs have access to the necessary resources and can be processed efficiently.

I am not sure if we have any metrics or analytics available to measure the average duration in queue and the overall duration of a workflow. It would be helpful to have access to this information to evaluate the impact of any changes we make to our CI workflow. By tracking the average queue duration and the total duration of a workflow, we can better understand how changes affect our development process and identify areas for improvement.

For this one, LGTM.

potiuk · 2023-04-16T22:46:00Z

Ok. Not sure if you know what you've asked for, but well. you did @hussein-awala :).

I have a concern regarding the use of a lot of parallelism in our workflows. While it can greatly enhance our development speed, there is a possibility that a few number of PRs could occupy all the runners, potentially delaying the progress of other PRs. It would be beneficial to have some kind of configurations in place to prevent this from happening, such as limiting the number of runners a single PR can use. This would help ensure that all PRs have access to the necessary resources and can be processed efficiently.

Looking at the available metrics I don't think we need to do anything about it. I am regularly checking those that we have (see below) and every time I see an optimisation opportunity, I do implement it, however I would love someone else to take a deep dive and understand our system and improve it. So if you would like to join me - I am all ears, and happy to share everything about it. I think it would be best if you try to understand how our system works in detail, and then we could discuss some ways how things can be improved. If you would have some ideas (after getting the grasp of it) how to improce it and what it might take, I would love to see it.

The "noisy-neighbour" issue has been on-going debate for the Apache Software Foundation and Airflow for years now and it used to be very bad at times, but with a lot of our effort, helping others to optimise their jobs and especially with GitHub providing more public runners and us having sponsored self-hosted runners, the issue is non-existing currently IMHO.

Just for the context: all non-committers use public runners (except image building), all our public runners run in a shared pool of 900 jobs that the ASF shares between all of theor ASF projects - there are ~ 700 projects in the ASF using GitHub Actions) and for Airlfow for commiter jobs and image building we have self-hosted runners in an auto-scaling group of up to 35 runners (8CPU , 64 GB) that can be scaled on-demand - we run them all on AWS provided credits and if needed, sponsored by Astronomer, when the credits run out.

Here is one of the charts that shows how many projects are there for the ASF, using GitHub Actions:

For the ASF metrics and solving "noisy-neighbour" problems were up for about 3 years now and I doubt we can solve it quickly. though if you have some ideas, I am all ears. We've been discussing similar subject at lengths (I will post some resources) and if you can add something to the discussion (after reading that and getting the context, it would be great). Adding some metrics that could give us more value would be great.

Also we have a big change to do - convert our ci-infra (developed initially by @ashb based on some AWS components - Dynamo DB, AutoScaling Groups and few others to Kuberentes - since the time we had to implement our own custom solution, there are a few (even supported by GitHub) ways how to deploy it on Kubernetes, and we have a promise also from Google to get free credits on thoir cloud if wa mange to do it as standard Kubernetes deployment, which woudl increase our CI capacity greatly.

The infrastructure of ours for the auto-scaling is in https://github.com/apache/airflow-ci-infra (I am happy to guide your way - it's not super-documented, but pretty "up-to-date" and code describes what we do pretty well.

Note - I am not going to discourage you quite the opposite - I'd love to get a helping hand that would start understanding and contributing to our system. So I will add some pointers so that you can take a look and see.

I am not sure if we have any metrics or analytics available to measure the average duration in queue and the overall duration of a workflow. It would be helpful to have access to this information to evaluate the impact of any changes we make to our CI workflow. By tracking the average queue duration and the total duration of a workflow, we can better understand how changes affect our development process and identify areas for improvement.

I am so glad someone else would like to look at that :) . And I would love to get some PRs improving the test harness we have on CI. Or if you could get some metrics and analyse them would be good, but I am afraid metrics for GitHub Actions are very poor. But if you could improve them it would be good. You can read some history of that and there is some metrics that a friend of mine and contributor @TobKed had developed in the absence of lack of those metrics for GitHub Actions for the whole Apache Software Foundation (we continue gathering this metrics for ~2 years while Github with the "duck-tape" solution of our and continuous promises to get a better metrics from GitHub, but they are not yet available - though promised) if you send me your gmail address, I can shere access to the report (ping me om slack). Here is an example "workflow queued average" for all ASF projects in public runners:

This one shows that airflow is not even on the radar of queued worklfows because of the heavy optimisations we've implemented. It used to be much more, and we had a lot of troubles ~ 3 years ago, but then I implemented plenty of optimisations and checks and @ashb implemented the self-hosted infrastructure of ours (and we got credits to run it). Since then also the ASF got from 50 to 900 jobs sponsored by GitHub, which (at least temporary) solved the noisy-neighbour problems we had, but since then we try to use the runners in as optimized way as possible (see below).

The docs describing some status of GitHub Actions in the ASF (I try to keep it updated though recently there were not many updates ) are here: https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status and

Regarding the optimizations we have - I think you should look at "selective-checks" part. We go to a GREAT length to make sure that each PR only runs whatever it needs to run. Basically PRs usually only run a very small selection of tests - only those that are very likely to be affected. For example Unit tests for helm charts are only run when helm chart are modified. There are many more rules (for example Provider tests are only run for a subset of those - only the providers that are changed and those that have cross-dependencies with the changed providers. And there are more rules, all of them described in https://github.com/apache/airflow/blob/main/dev/breeze/SELECTIVE_CHECKS.md (which also reminds me that I should review and update it as there are few more such selective check rules which are not documented there. Also vast majority of those (with a bit more complex logic) selective tests are nicely tested in breeze via selective checks unit tests (rather comprehensive): https://github.com/apache/airflow/blob/main/dev/breeze/tests/test_selective_checks.py

You can get even more context here by reading the description (I keep it up-to-date regularly) of our CI system:

https://github.com/apache/airflow/blob/main/CI.rst

Including those charts showing how our jobs work (and explaining some of the built-in optimisations, for example we only build images once and re-use them accross all the builds, using super-efficient remote caching that docker-buildkit added some time ago:

https://github.com/apache/airflow/blob/main/CI_DIAGRAMS.md

I know it is a lot of information, but hey, you asked :).

And I am would really love to have someone else look at all those, not only me :). I am quite a bit SPOF for it, so if you would like to and have more questions, I am happy to share everything (though it might take some time to grasp it, I am afraid).

potiuk · 2023-04-16T22:51:11Z

Also here is last week dashboard sample from our auto-scaling group in AWS.

I also look there from time to time and I do not see any signs of excessive overloading.

hussein-awala · 2023-04-17T00:00:36Z

That's really a lot of information, I need a couple of days to read the resources you've shared and process them.

But yeah, there is a lot of possible improvements we can try:

upgrading the docker buildx and compose versions in the runners image (last update was one year ago in the CI repo)
review the used EC2 instance types and fine tune the resources based on the metrics you shared, and since we're decoupling the tests into smaller jobs to parallelize them, we can use the spot instances for these small jobs (if it is not already the case) to save up to 90% of the cost
moving to K8S runners for better auto scaling, and to be cloud agnostic, in this case we can distribute the load between EKS and GKE (will satisfy Google's condition) using label selectors, and we will be ready to use any new sponsored K8S cluster from other cloud providers
using dedicated docker RUN cache (RUN --mount type=cache) together with GHA cache for pip cache (and maybe other packages)
use Graviton2 with buildx to build amd64 images on arm64 (yes instead of building arm image on amd), it can reduce the cost by 40%, which we can use to pay for more instances

So if you would like to join me - I am all ears, and happy to share everything about it

I will be glad to contribute to improving the CI of the project, which can have a significant impact on the development process

Let me check out the resources you've shared and look up for some of the recently shared best practices for GHA runners.

potiuk · 2023-04-17T00:04:42Z

Final results after some optimisation.

airflow_aux: 7m 12s
airflow_core: 8m 38s
other: 6m 29s
security: 3m 38s
webserver: 4m 15s

Comparing it with:

all: 27m 58s (previously)

Total time: 30 m 12s "build time" -> just 2 "build-time" minutes longer, but the feedback / elapsed time down from 27 m to < 9 m in case where our workers are not all fully occupied (so the build can get green 3x faster as it was by far the slowest test).

The allocation to packages is somewhat aribitrary, I preferred to split them more-or-less equally to get bettter timings rather than be perfect in package allocation., we could also join security and webserver packages for one without great loss of elapsed time.

potiuk · 2023-04-17T00:41:05Z

I will be glad to contribute to improving the CI of the project, which can have a significant impact on the development process

Looking forward to it. BTW some of those points you mentioend are already done (we are using both AMD/ARM hardware together to build multi-platform images for one), but yeah I would love to improve many of those you mentioned :)

potiuk · 2023-04-17T07:38:41Z

cc: @ephraimbuddy @jedcunningham @dstandish

WDYT about the split and optimisations (I know you've been developing a lot of those tests and "Feedback" time for all of them was a big concern for you as well. I would not want to make the split without your consent, as it is somewhat arbitrary (but it can be adapted in the future and we can manually balance it out if it falls out of balance in the future),

It is developed in the way that it will automatically adapt to as many packages you have, the only (reasonable IMHO) requirement is that we have no tests at the "top-level" - only "common utils" at at top level (I could even enforce that one if needed)

Final results after some optimisation.

airflow_aux: 7m 12s
airflow_core: 8m 38s
other: 6m 29s
security: 3m 38s
webserver: 4m 15s

Comparing it with:

all: 27m 58s (previously)

Helm Unit tests are using template rendering and the rendering uses a lot of CPU for `helm template command`. We have a lot of those rendering tests (>800) so even running the tests in parallel on a multi-cpu machine does not lead to a decreased elapsed time to execute the tests. However, each of the tests is run entirely independently and we should be able to achieve much faster elapsed time if we run a subset of tetsts on separate, multi-CPU machine. This will not lower the job build time, however it might speed up elapsed time and thus get a faster feedback. This PR achieves that.

potiuk · 2023-04-17T20:41:53Z

Merging, we can further move those tests around later if needed.

jedcunningham · 2023-04-17T22:01:20Z

Sorry, I was traveling. High level, no objections here.

Helm Unit tests are using template rendering and the rendering uses a lot of CPU for `helm template command`. We have a lot of those rendering tests (>800) so even running the tests in parallel on a multi-cpu machine does not lead to a decreased elapsed time to execute the tests. However, each of the tests is run entirely independently and we should be able to achieve much faster elapsed time if we run a subset of tetsts on separate, multi-CPU machine. This will not lower the job build time, however it might speed up elapsed time and thus get a faster feedback. This PR achieves that.

dnskr · 2023-04-19T16:55:24Z

It's amazing improvement!

potiuk · 2023-04-19T20:29:41Z

It's amazing improvement!

🙇 Thanks 🙇

Helm Unit tests are using template rendering and the rendering uses a lot of CPU for `helm template command`. We have a lot of those rendering tests (>800) so even running the tests in parallel on a multi-cpu machine does not lead to a decreased elapsed time to execute the tests. However, each of the tests is run entirely independently and we should be able to achieve much faster elapsed time if we run a subset of tetsts on separate, multi-CPU machine. This will not lower the job build time, however it might speed up elapsed time and thus get a faster feedback. This PR achieves that. (cherry picked from commit 731ef3d)

potiuk requested review from ashb, jedcunningham and kaxil as code owners April 16, 2023 20:12

boring-cyborg bot added area:dev-tools kind:documentation labels Apr 16, 2023

potiuk requested review from ephraimbuddy, ferruzzi, uranusjr and hussein-awala April 16, 2023 20:13

potiuk force-pushed the parallelise-helm-tests-more branch 3 times, most recently from c63d11e to f35513c Compare April 16, 2023 20:39

potiuk force-pushed the parallelise-helm-tests-more branch from f35513c to 8fccbd1 Compare April 16, 2023 20:53

hussein-awala approved these changes Apr 16, 2023

View reviewed changes

potiuk force-pushed the parallelise-helm-tests-more branch 2 times, most recently from d5e8408 to 8c9a64a Compare April 16, 2023 21:45

potiuk force-pushed the parallelise-helm-tests-more branch 3 times, most recently from a79fe72 to d533171 Compare April 16, 2023 23:40

potiuk mentioned this pull request Apr 17, 2023

Split installation of sdist providers into parallel chunks #29223

Merged

potiuk force-pushed the parallelise-helm-tests-more branch from d533171 to 496db90 Compare April 17, 2023 07:41

potiuk merged commit 731ef3d into apache:main Apr 17, 2023

potiuk deleted the parallelise-helm-tests-more branch April 17, 2023 20:41

potiuk mentioned this pull request Apr 18, 2023

Reformat chart templates part 3 #30312

Merged

ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Apr 23, 2023

ephraimbuddy added this to the Airflow 2.6.0 milestone Apr 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize Helm tests with multiple job runners #30672

Parallelize Helm tests with multiple job runners #30672

potiuk commented Apr 16, 2023

potiuk commented Apr 16, 2023

potiuk commented Apr 16, 2023

potiuk commented Apr 16, 2023

hussein-awala left a comment

potiuk commented Apr 16, 2023 •

edited

Loading

potiuk commented Apr 16, 2023 •

edited

Loading

hussein-awala commented Apr 17, 2023

potiuk commented Apr 17, 2023 •

edited

Loading

potiuk commented Apr 17, 2023 •

edited

Loading

potiuk commented Apr 17, 2023 •

edited

Loading

potiuk commented Apr 17, 2023

jedcunningham commented Apr 17, 2023

dnskr commented Apr 19, 2023

potiuk commented Apr 19, 2023

Parallelize Helm tests with multiple job runners #30672

Parallelize Helm tests with multiple job runners #30672

Conversation

potiuk commented Apr 16, 2023

potiuk commented Apr 16, 2023

potiuk commented Apr 16, 2023

potiuk commented Apr 16, 2023

hussein-awala left a comment

Choose a reason for hiding this comment

potiuk commented Apr 16, 2023 • edited Loading

potiuk commented Apr 16, 2023 • edited Loading

hussein-awala commented Apr 17, 2023

potiuk commented Apr 17, 2023 • edited Loading

potiuk commented Apr 17, 2023 • edited Loading

potiuk commented Apr 17, 2023 • edited Loading

potiuk commented Apr 17, 2023

jedcunningham commented Apr 17, 2023

dnskr commented Apr 19, 2023

potiuk commented Apr 19, 2023

potiuk commented Apr 16, 2023 •

edited

Loading

potiuk commented Apr 16, 2023 •

edited

Loading

potiuk commented Apr 17, 2023 •

edited

Loading

potiuk commented Apr 17, 2023 •

edited

Loading

potiuk commented Apr 17, 2023 •

edited

Loading