Skip to content
This repository has been archived by the owner on May 14, 2024. It is now read-only.

GitHub Actions CI contention #237

Closed
briantist opened this issue Jun 4, 2023 · 32 comments
Closed

GitHub Actions CI contention #237

briantist opened this issue Jun 4, 2023 · 32 comments

Comments

@briantist
Copy link

Summary

As more ansible collections within the ansible-collections organization use GitHub Actions as their CI, we're seeing increased contention for CI runners. This has been painful in the past, even for some single collections, but is getting worse, primarily because concurrency limits are at the account, not repository, level.

See:

This means that all repositories in this org are sharing the 20 concurrent job limit. To use my small collection as an example, a single PR to community.hashi_vault generates 27 jobs for CI, plus 4 for docs build. If there are multiple PRs, or even several commits within a PR before the previous runs finish, the wait times grow and grow. This is after recently removing two older versions of ansible-core from the CI matrix.

Doing releases, which end up with many runs (from the release PR, the push from merging the release PR, the push of the tag, etc.) ends up taking like 2 hours when the actual steps take minutes.


New collections continue to default to GHA, and there seems to be a push to move existing CI out of Zuul and Azure into GHA (just two examples):

I think this is a good thing, but we will see this problem compound as a result.


Suggestion

As shown in the billing page linked above, it is possible to get more concurrency with a non-free GitHub plan.

image

If we can get the ansible-collections organization (at a minimum) into a non-free plan, we can get increased concurrency, which would be a huge improvement for everyone. The more concurrency, the better.

This probably has to be done by RedHat since they own the organization. My hope is that they already have a non-free account that the org could be moved into, and won't require a new sign-up/agreement or whatever, but I really don't know any details about that.

Potential cost

Disregarding the cost of a non-free plan itself, GitHub runners have per-minute rates:

Public repositories can use GHA for free, and so are not billed per-minute rates, and I think this still applies to public repos in an paid account.

Based on a number of assumptions that need to be confirmed, there's a chance that we can get increased concurrency without any additional spend. The assumptions are:

  • there is already a paid account where the organization can be moved into, and that this will not increase the cost of that plan
  • public repositories in the paid account will still not need to pay for GHA hosted runners
  • the additional concurrency limits granted by the paid plan will also apply to the free runners in public repos

All of these assumptions need to be checked.


I really cannot overstate how helpful this will be for the community.

@felixfontein
Copy link
Contributor

Some of the AZP based collections (community.general, community.crypto, community.docker) now also make more use of GHA since we no longer should use AZP for EOL versions of ansible-core/ansible-base/Ansible 2.9, some of whose these collections still supports. This increases the general CI load for all collections in gh.com/ansible-collections.

(And yes, CI feels incredibly slow in all the collections I'm maintaining currently, especially if there is more than one PR - in the same or across multiple collections -, but also if there is a single PR, but probably others in other collections I don't actively watch.)

@mariolenz
Copy link
Contributor

The integration tests for community.vmware running on Zuul take hours, so I can live with GH actions taking several minutes. But I understand your problem. Especially since it looks like there will be even more tests moved to GH action. Quoting from ansible-collections/community.vmware#1746:

We are also in the process of migrating the other collections the team manages to Github Actions and while we don't have a migration plan for vmware.vmware_rest's CI yet we will need to evaluate this in the long term. The complexity of the Zuul platform is not meeting our needs for other collections and we'd like to reduce the amount that we depend on this system and have to maintain expertise in it.

@briantist
Copy link
Author

The integration tests for community.vmware running on Zuul take hours, so I can live with GH actions taking several minutes. But I understand your problem. Especially since it looks like there will be even more tests moved to GH action.

I definitely am not advocating for not moving things to GHA; it's certainly an improvement in this case. And I am less concerned with how long a particular test takes to run on GHA once it starts, the issue is with how long the run itself will be queued due to hitting our concurrency limit.

As another example, some of my individual job runs take 10+ minutes (while the whole of CI can take 30+); this is partially because the overall test time would be longer if I made the tests more parallel, due to queueing times. If we had massive concurrency (like the 500 limit we'd get with an enterprise account) I would restructure my CI runs and they would complete much faster overall, and be simpler in design.

So the issue does not just result in longer, painful times, it can also redirect our limited engineering time and effort toward working around the problem, effort that could be better spent on the collections themselves, often resulting in more complex and harder to maintain CI.

@cybette
Copy link
Member

cybette commented Jun 5, 2023

Appreciate the write-up and additional details everyone! Just a note to say this is on our radar and we're looking into it.

@mariolenz
Copy link
Contributor

I definitely am not advocating for not moving things to GHA; it's certainly an improvement in this case.

I didn't say you're advocating against moving things to GHA. After all, you've said:

New collections continue to default to GHA, and there seems to be a push to move existing CI out of Zuul and Azure into GHA (just two examples):

* [Move some tests to GH Actions ansible-collections/community.vmware#1747](https://github.com/ansible-collections/community.vmware/pull/1747)

* [Add github action for sanity and unit tests ansible-collections/amazon.aws#1393](https://github.com/ansible-collections/amazon.aws/pull/1393)

I think this is a good thing, but we will see this problem compound as a result.

What I wanted to say is: It looks like RH plans to move CI jobs from their own Zuul CI to GH actions, which makes the current situation even worse.

@GregSutcliffe
Copy link
Contributor

Just wanted to put an update here because I figure it might come up again soon. We've been talking about a few options internally, and I want to make sure we get early input before "suggestions" start looking like "decisions" :)

There's a couple of options we could go for:

  • The Community Team could pay for a non-free GitHub Enterprise (probably at the Team level as Enterprise is significantly more) account for the Ansible Community
    • Pros:
      • Concurrent runners are just for this org
      • We can have the Steering Committee as org admins
      • Maintainers for repos can be "outside collaborators" which will keep costs down
      • Pretty cheap ($4 / seat)
    • Cons:
      • Only upgrades us to 3000 mins/month (from 2000 for a free org)
      • Another org & account to maintain (minor burden)

This is pretty easy to justify, but is a 50% increase in CI minutes enough? If we feel the 50k minutes of Enterprise is needed, how many seats do we think we need (my estimate is ~20)?

Another option we're thinking about is self-hosted runners:

  • Pros:
    • Can be added to a free org, so no direct monetary cost
    • Allows a way for interested companies / people to "donate" hardware to the project (easier than donating money)
    • Gives us some flexibility in running intensive collections on specific hosts
  • Cons:
    • Need to source the runners
    • Need to administer the runners

@Spredzy @leogallego did I miss anything?

Very interested to hear some thoughts on either of these, or other possible solutions, please let us know!

@GregSutcliffe GregSutcliffe added the next_meeting Topics that needs to be discussed in the next Community Meeting label Jun 28, 2023
@leogallego
Copy link
Contributor

Great summary @GregSutcliffe, I would only include that there is the option to "Add Action Runners" for an additional cost, even if we use the Team plan. At the time of writing this comment:

GitHub-managed Standard 2 cores machine with default GitHub images.
Ubuntu Linux, $0.48 /hr
Microsoft Windows, $0.96 /hr
macOS, $4.80 /hr

@felixfontein
Copy link
Contributor

I think the by far cheapest thing is to use self-hosted runners - if you ignore the administration costs.

The main problem with self-hosted runners is (IMO) making them sufficiently secure. Having a system which creates a new VM for every CI job and shuts down the VM at the end of the run is the safest option (and also what GitHub itself is doing), but such a system doesn't just set itself up (not sure how much GH's public software does). Also we definitely want some caching of docker/podman images, since almost every ansible-test run pulls at least one image, and you want to avoid the extreme amounts of traffic this can easily cost.

Another option might be using a third-party CI service (I don't really have experience with any, so I won't name anything here but AZP which is already used by some collections and by ansible-core itself).

@briantist
Copy link
Author

briantist commented Jun 28, 2023

@GregSutcliffe thanks very much for looking into all this.

I want to first clarify the "minutes" in GHA, since I had a different understanding of how they work.

My understanding is that per-minute rates apply only to use within private repositories, and there are no per-minute charges for public repositories.

The number of minutes "included" in a plan only counts toward billable minutes (larger runners, or any runners in private repos). Similarly, this is what the multipliers refer to for macos and windows runners (they apply to the number of included minutes, not to the billing rate).

Evidence toward that:

GitHub Actions usage is free for standard GitHub-hosted runners in public repositories, and for self-hosted runners. For private repositories, each GitHub account receives a certain amount of free minutes and storage for use with GitHub-hosted runners, depending on the product used with the account. Any usage beyond the included amounts is controlled by spending limits. For more information

With this being the case, it's not number of minutes that we need to worry about, our issue is purely about the number of concurrent jobs that can be running at a given time, so it's basically the only metric we have to consider imo.

It's also possible I am not understanding this correctly... but if I am, then upgrading the plan should by far be the most straightforward option.


Re: the idea of using self-hosted runners: I personally think that the work involved with far exceed the benefit unless we've (you've?) engineered a pretty solid implementation with feature parity to GH-hosted runners, keeping in mind that this covers more than just linux (though if we covered linux it would take care of the bulk of jobs, so still helpful).

I think the by far cheapest thing is to use self-hosted runners - if you ignore the administration costs.

Emphasis on the last bit.

That being said, I do like the "donation" option, but that can be done now without the need for making any changes to the GitHub plan. It's more of a policy thing so it feels like a conversation we could have, but unrelated to this.

Self-hosted runners do allow for larger/faster runners, and increased scaling (potentially), but I think it hinges entirely on having ephemeral runner pools.

I personally think it's a better cost-benefit and experience to go ahead and pay GitHub for more capacity, but that's easy to say when it's not my money ;)

If we did want to look at automating self-hosted ephemeral runners, I've seen this service around, but I am not affiliated and have never used it myself: https://cirun.io/

@Andersson007
Copy link
Contributor

In addition, created a PR against collection_template GHA matrix template to make future maintainers aware of the limitations ansible-collections/collection_template#62. FYI

@GregSutcliffe
Copy link
Contributor

Thanks @briantist - I had missed that nuance. I think I agree that self-hosted runners are more work than we'd like, but I didn't just want to present a single fait-accompli "option" :)

I had a look at the billing data for ansible-collections, and at least according to GitHub, we've used exactly 0 minutes of CI:

image

I suspect this is because public repos don't count, as discussed above, so it's not logged - but that makes it hard to know what we've actually used. However, if we're sure that a Team account will help, then I think that's a fairly low-cost option anyway. Thoughts? I'll have a dig to see if I can get more "real" data, but if anyone already knows how, get in touch ;)

@briantist
Copy link
Author

Indeed showing 0 for usage is accurate for billing reasons, but not helpful data!

I'm not sure where to see actual usage in aggregate, but on a per-run level, you can look at any GHA run and click the Usage link in the lower left below the jobs. That will break down per-minute usage of each job and then give a total for the run, both for actual and billable usage. If that were available at a repository or organization level it would be nice, but I haven't been able to find it on projects I have more access to.

If you have a contact or rep at GitHub, it would be great to confirm our understanding of it all, and maybe they know how to get better data too.

If we're right about only needing to worry about concurrency, then based on the screenshot in my original post, a team plan triples it from 20 to 60 concurrent jobs, which I think would be a big noticeable improvement!

Re: self-hosted, thank you for including that option, you're right to include that in consideration and offer alternatives.

Thanks very much for all of this!

@Andersson007
Copy link
Contributor

Also added a corresponding note to Collection requirements ansible/ansible-documentation#40, FYI

@mariolenz
Copy link
Contributor

Sorry, I'm no expert on GHA but does this:

https://github.com/ansible-collections/collection_template/blob/21bae2b5c56c6e758e4f2780646b8277e1ad5ed9/.github/workflows/ansible-test.yml#L286-L287

mean that tests are run against ansible-core 2.15 with Python 2.7? Does 2.15 even support 2.7 still? I mean on the controller node. I might be wrong, but if I'm not this would be a test we can get rid of. And maybe there are some more.

@briantist
Copy link
Author

Sorry, I'm no expert on GHA but does this:

https://github.com/ansible-collections/collection_template/blob/21bae2b5c56c6e758e4f2780646b8277e1ad5ed9/.github/workflows/ansible-test.yml#L286-L287

mean that tests are run against ansible-core 2.15 with Python 2.7? Does 2.15 even support 2.7 still? I mean on the controller node. I might be wrong, but if I'm not this would be a test we can get rid of. And maybe there are some more.

Thanks, good question @mariolenz yes, on targets 2.7 is still supported in 2.15: https://docs.ansible.com/ansible/latest/reference_appendices/release_and_maintenance.html#support-life


I think you won't find all that many collections testing that combination. It's a good thing to raise, and we should still look to reduce where we can, but the larger issue will not be meaningfully solved by trimming a few jobs here and there.

While the issue of start times in particular is getting worse due to more collections using GHA, contention has been a problem for a long time even for single collections like mine just due to having more than 20 jobs per run. Increased concurrency will be a big quality of life improvement and increase in velocity.

@mariolenz
Copy link
Contributor

Thanks, good question @mariolenz yes, on targets 2.7 is still supported in 2.15: https://docs.ansible.com/ansible/latest/reference_appendices/release_and_maintenance.html#support-life

A lot of collections don't ssh somewhere and run there, but run on the controller and connect to a remote API. So they only need to run the tests against Python versions ansible-core supports on the controller node. Just wanted to mention it.

While the issue of start times in particular is getting worse due to more collections using GHA, contention has been a problem for a long time even for single collections like mine just due to having more than 20 jobs per run. Increased concurrency will be a big quality of life improvement and increase in velocity.

I understand the problem, and my suggestion wouldn't help much. Still, I thought I should mention that there might be some opportunity for improvements. Slight improvements, though... nothing to really fix the basic problem.

@briantist
Copy link
Author

A lot of collections don't ssh somewhere and run there, but run on the controller and connect to a remote API. So they only need to run the tests against Python versions ansible-core supports on the controller node. Just wanted to mention it.

Interesting, I didn't know that! I thought the majority of collections tested against containers.

@mariolenz
Copy link
Contributor

@briantist Take dellemc.openmanage as an example:

Dell OpenManage Ansible Modules allows data center and IT administrators to use RedHat Ansible to automate and orchestrate the configuration, deployment, and update of Dell PowerEdge Servers and modular infrastructure by leveraging the management automation capabilities in-built into the Integrated Dell Remote Access Controller (iDRAC), OpenManage Enterprise (OME) and OpenManage Enterprise Modular (OMEM).

The classic Ansible approach to copy the module to the target and run it there doesn't apply. I'm pretty sure it's technically impossible to run Python code on iDRAC and quite sure that it's at least not supported with OME (I don't know OMEM, though). So what those modules do is running on the controller talking to a remote API. Since there's no code executed on the target, there's no need to test against target Python versions.

I don't know how many collections work like this. But I think there are quite a lot. I should say that all collections dealing with cloud infrastructure don't run the modules directly "on" the cloud target, they talk to an API.

There are quite some collections that automate things where the "natural" approach is to call an API (a lot of storage arrays, firewalls, network devices...) because you can't run python code on the target or this isn't supported or at lease not best practice.

Classic Ansible:

  1. Run Playbook
  2. Copy modules to target
  3. Run modules on target

"API collections":

  1. Run Playbook
  2. Run modules on controller node (delegate_to: localhost)
  3. One or more API calls

At least that's the usual workflow for me when using community.vmware. Of course, you can delegate to another host (not localhost / the controller node) but I don't know the Python requirements in this case: Controller node or target requirements for the Python version?

@felixfontein
Copy link
Contributor

A lot of collections don't ssh somewhere and run there, but run on the controller and connect to a remote API. So they only need to run the tests against Python versions ansible-core supports on the controller node. Just wanted to mention it.

Interesting, I didn't know that! I thought the majority of collections tested against containers.

These are (usually) also tested in containers, it's just that no specific target container is needed (resp. target = controller).

(Also there are some special cases where also such modules are run on target != controller, namely when the machine/API you need to talk to isn't reachable from your machine, but only through some jump host. Then you can run ansible-playbook on your machine, while these modules run on the jump host :) I guess that isn't very common though - in fact this is probably very rare.)

@felixfontein
Copy link
Contributor

mean that tests are run against ansible-core 2.15 with Python 2.7? Does 2.15 even support 2.7 still? I mean on the controller node. I might be wrong, but if I'm not this would be a test we can get rid of. And maybe there are some more.

This is needed for all collections that have content that is intended to run on a target (and don't have a restriction on the target Python that disallows 2.7). For all these collections, it could be needed. (You don't have to test every single supported ansible-core release with a Python 2.7 target, but at least some; whether 2.7 belongs to the list is up for the collection maintainers to decide).

@briantist
Copy link
Author

A lot of collections don't ssh somewhere and run there, but run on the controller and connect to a remote API. So they only need to run the tests against Python versions ansible-core supports on the controller node. Just wanted to mention it.

Interesting, I didn't know that! I thought the majority of collections tested against containers.

These are (usually) also tested in containers, it's just that no specific target container is needed (resp. target = controller).

(Also there are some special cases where also such modules are run on target != controller, namely when the machine/API you need to talk to isn't reachable from your machine, but only through some jump host. Then you can run ansible-playbook on your machine, while these modules run on the jump host :) I guess that isn't very common though - in fact this is probably very rare.)

I know about these, I just didn't know that the number was "a lot" ;)

@mariolenz
Copy link
Contributor

@briantist I didn't have a closer look at how many collections don't run on the target, but on the controller and talk to an API. And anyway, "a lot" isn't really defined. But I should say we're talking about 10 to 20% of the collections in the community package. Maybe a bit more, but not less. However, this is just a guess from my side.

@briantist
Copy link
Author

Thanks @mariolenz , really appreciate the info!

@samccann samccann removed the next_meeting Topics that needs to be discussed in the next Community Meeting label Jul 19, 2023
@briantist
Copy link
Author

Hi @GregSutcliffe , wondering if there's any news on this?

@GregSutcliffe
Copy link
Contributor

Oof, I lost track of this with all the other fun we've been having. Apologies!

I've re-read the posts I missed, but I don't see anything that changes the current plan, which I believe is:

  • Ask GH for some help to understand our numbers
  • Look at getting a Team plan in place to help with the concurrency

I'll find out who our GH contact is and get in touch with them.

@briantist
Copy link
Author

Hey @GregSutcliffe , I know things have been busy with the Ansible forum rollout and such, just want to check in on this again because it's still quite an issue.

@GregSutcliffe
Copy link
Contributor

GregSutcliffe commented Oct 2, 2023

Apologies, indeed it has been a busy month. I have just emailed a contact at GH, likely they are not the right person to speak to but they should be able to help me speak to whoever is. Will update once I know more, apologies for the delay.

@GregSutcliffe
Copy link
Contributor

GregSutcliffe commented Oct 5, 2023

So, I have news :)

GitHub have kindly upgraded up to the Team plan for this org, which gives us 50 concurrent jobs, instead of 20. Hopefully, that will help things to feel better right away. Seats should not be an issue, we have enough to cover all the org members, plus a bit of headroom, so I'm not worried about that.

We're also looking into why the usage report doesn't actually report usage (billing and usage are not the same thing). We'll give it a week or so on the new plan to see if data starts to come through, and then I'll check in with GitHub again if not. Once we have usage data, we'll have the tools to check what's going on if we start to hit issues again.

Thanks for your patience folks! Sorry it took so long, that's entirely on me - and obviously, big thanks to GitHub for the upgrade.

@felixfontein
Copy link
Contributor

@GregSutcliffe that's really awesome news! :)

@briantist
Copy link
Author

briantist commented Oct 5, 2023

🎉🎊🥳

@GregSutcliffe amazing! thank you so much! I can confirm I was able to run my CI today (30 jobs?) with no queued jobs, so the higher concurrency is definitely in effect.

@felixfontein I'm especially interested in your anecdotal experiences since you see so many more runs than I do, in many different collections

@felixfontein
Copy link
Contributor

I don't have much anecdotal experience yet, but so far GHA feels a lot smoother than before.

@Andersson007
Copy link
Contributor

Great new!
So closing the issue. If anyone thinks this topic needs more discussion, just reopen it
thanks much to everyone!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants