-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ephemeral mode with runner Controller does not work as expected #1044
Comments
@achebel This is expected behavior. As there's really no "guarantee" that we can receive a github webhook on a queued job(triggers scale up) and completed job(triggers scale down) we can't rely solely on those triggers for scaling. Instead, we only use those webhook events to add and remove the capacity reservations, which results in this behavior. Do you have any specific issue due to that behavior? |
No specific issue but what we expected it's just instanciate a POD when a job is triggered and terminate the POD when the job is handled without instancing a new POD based on no webhook events. |
@achebel Thanks. Yep, that's the expected behavior. It will scale down soon after you get "completed" |
Also- a capacity reservation eventually expire so it won't result in redundant runners taking your cluster resource for so long, even if it somehow failed to receive a workflow_job event of "completed" status. |
Yep this is what we understood by setting up durations and scaleDownDelaySecondsAfterScaleOut. |
@achebel I hear you, but I'm not eager to add that myself, because webhook is definitely not a reliable communication mechanism and it can easily miss the "demand" when there's something wrong on GitHub, the network infrastructures in between, your K8s cluster, your loadbalancer, and finally actions-runner-controller's webhook server. If you used your "preferred" method, every missing webhook event can increase the discrepancy between the desired and the current state, which isnt' what you'd want. I can't stop thinking that you're considering it too easy. It's not. I don't want to spend my spare time to develop a feature that doesn't pay off and only helps shooting your own foot. But if you're going to contribute it though, I'll definitely review it. But you'd need a throughout documentation to note every gotchas. |
If the feature does get developed by someone it definitely should not be the default behaviour as that will open up a flood gate of support issues I imagine. Default should be as it is. @achebel Whilst I understand why logically your described behaviour makes sense (assuming never having any connectivity issues which seems very optimistic to me, but, perhaps someone would be OK with hitting the re-run button every now and then when those crop up), why would you want this behaviour to begin with? The only reason I can think of is for security perhaps, if that is the driver then ARC, to me, is the wrong place to be putting that security. GitHub should be ensuring workflow runs can only run against runners of that label combo and / or runners attached to the runner groups associated with those repositories. At which point whether runners hang around or not shouldn't matter. If the runner has a powerful role attached that you need to protect those runners, the runner should be at the repo level or in a runner group, labels are client side and can be changed in a PR and that change will be respected, as a result they don't provide any protection whatsoever and so should not be relied upon for any level of security. The only other reason I can think of is so you can aggresssively spin down nodes, however, scaling nodes is a slow process so again, this doesn't seem appropriate either. |
Rephrasing my previous satetement- if you have an enterpris-ish requirement that forces you to run the necessary number of runners only, it can't be implemented reliably today, due to missing GitHub API. You should better ask GitHub about that first. More concretely, there should be something like a "assign this runner to that workflow job" and a "list all workflows jobs that no runners are assigned by you yet". With those two APIs, it would be straightforward for us to keep only the necessary number of runners. |
We should! bug Github for those API endpoints 🐛 😂 You can try using very low capacity time reservations (I have), but then all these issues @mumoshu describes will come into play. Starvation can happen, but I think runners which are running a job will infact continue running (I think). I have not yet witnessed api calls getting lost on GKE, It's something to look out for in the future hopefully 👀 |
Thanks you all for your different feebacks |
@achebel can you explain why though? I understand the behaviour you were expecting but not a reason for it beyond because reasons. At the moment I'm struggling to see an actual issue with the current behaviour other than it doesn't feel intuative to you? Have you got a more specific reason for the current behaviour not working for you? I think it's probably worth adding to the docs a bit more to make it a bit clearer how the current logic works as I can see how it might not seem as logical as the behaviour you've outlined without thinking more deeply on it. |
@mumoshu , why would you need such an API ? |
At the moment the runners are There are other difficulties, for example we ourselves just started using this software. Yet right now it's not a big issue, because many of those jobs are relatively fast (no changes -> exit) and the waiting jobs get picked up. From what I can tell about github webhooks is that they are quite reliable. But even issues like the one I'm describing would benefit from an API to check "reality" from githubs point of view, just to safeguard against missing webhooks and starvation. Without an API using that event approach I couldn't be using this software [edit: right now]. Stuff happens, sometimes webhooks don't arrive and it doesn't matter whoose fault it is (it could be of course githubs, 3rd party in between, or your/our own). Not having enough workers and having to be "hands on" babysitting this system, I don't think would be good. Using a low reserve time for capacity, an no scale down backoff I think does pretty much what you describe. |
@EricDales In other words, that's because github webhook (or any sort of webhook in general) and many implementation of webhook servers are based on implicit assumption that the event delivery is guaranteed, which isn't true.
That works only when your event source is durable and consistent and you can freely replay the history. Webhook isn't. |
@mumoshu , first I would like to warmly thank you for taking time to answer, I really appreciate this discussion. I confirm that I want to rely on webhook & ephemeral mode . I understand that actions-runner-controller is intended to work in different contexts (high latency network ...), then if you consider that webhook delivery is not safe, why did you implement webhook scaling ? Today our company is running daily more than 50k Jenkins builds, relying on webhook deliveries, and 100% of webhooks are delivered. And even if there were some failures, applying SRE rules, a 5% error budget would still be acceptable. On a side note, a great improvement from github would be to have some red light on the right column of a repo page that warns that a webhook hasn't been delivered and a button to redeliver the payload (which is actually possible by digging in repository settings but a bit cumbersome). That could also be a commit status. @Link-, can you please submit that to the product management team ? |
@EricDales To make it scale quickly in case it did receive the webhook event successfully. Not 100% guaranteed doesn't mean we can't utilize it at all. 99% guaranteed means we have to design it so that 1% failure doesn't break it forever.
I hear you. But that doesn't mean it "won't" fail in the future, right? That's my point. BTW, in case you're fine assuming it's delivered 100%, that's ok. You can just make capacity reservation expiration very short in your RunnerDeployment spec. Then you'll mostly get what you want. |
I think what is getting lost here in this issue is that the solution is primarly aimed at github.com users. The docs do say https://github.com/actions-runner-controller/actions-runner-controller#github-enterprise-support:
github.com reguarly has outages, with Actions impacted at least once a month or greater https://www.githubstatus.com/history. Often when these outages hit, webhooks do fail. I'm sure within a self-hosted GHES environment webhooks will be very reliable, but that isn't the environment primarily targeted. I'd again though ask what the technical reasoning is for wanting 0 slack?
I would have a play with the reservation expiration stuff first and see if you can achieve what you want. We can probably do a better job docs wise highlighting this sort of possible configuration so we'll look at updating them. |
👋 everyone, @mumoshu @toast-gear your and the other maintainer's time on this thread is invaluable, thank you 🙇♂️ @EricDales there's already a view in GitHub Apps as well as on the Enterprise level indicating the state of webhook deliveries. Webhooks can be redelivered with the exact same payload on demand in case of a failure, both on GitHub.com and GHES. I believe the reasons for wanting 0 slack are:
However, I also understand and empathise with @mumoshu and @toast-gear's perspective especially after having a look at the implementation. Irrespective of GitHub's SLAs and service reliability, webhooks are not transactional and stateless by design. They can be looked at as part of an eventual consistency model. This controller has been implemented to guarantee this eventual consistency by scheduled reconciliation. Changing this behaviour could lead to inconsistencies because the It would be interesting to explore whether there is a model that can provide us with the best of both worlds. Also, I think it'll be great if we can determine whether the limitations are:
If it's #1 there could be ways around it, if it's #2 I'm afraid we have to either accept the current state or look for alternative solutions. |
LInking #911 (comment) |
Maybe both? 😅 For me, it's 1 because of 2. I don't believe it's feasible and I'm not willing to spend my spare time just to prove it's un-feasibility. But if anyone is actually trying to implement it I'd happily advise or review. |
@achebel @genisd @Link- @EricDales @toast-gear Hey everyone! I think I've managed to improve ephemeral runners to scale on webhook much as you've expected. Please read my lengthy comment in #911 (comment). At least it's very unlikely for you to see something below in the next version of ARC.
Still keep in mind that webhook delivery is 100% guaranteed so you'd better keep a minReplicas of 1, instead of setting it to 0, if your workflow is rarely run while important. |
I'm testing current master right now (with webhooks only) |
I will test it in our side as well after for your feedback Daniel 😊
Regards
From: Daniel ***@***.***>
Sent: Tuesday, February 22, 2022 1:44 PM
To: actions-runner-controller/actions-runner-controller ***@***.***>
Cc: CHEBEL Ali (EXT) ResgDdsItfDef ***@***.***>; Mention ***@***.***>
Subject: Re: [actions-runner-controller/actions-runner-controller] Ephemeral mode with runner Controller does not work as expected (Issue #1044)
[EMETTEUR EXTERNE] / [EXTERNAL SENDER]
Soyez vigilant avant d'ouvrir les pièces jointes ou de cliquer sur les liens. En cas de doute, signalez le message via le bouton "Message suspect" ou consultez go/secu.
Be cautious before opening attachments or clicking on any links. If in doubt, use "Suspicious email" button or visit go/secu.
I'm testing current master right now (with webhooks only)
—
Reply to this email directly, view it on GitHub<#1044 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF6G7NGHZZ6LGTEZSRIKYTLU4OAJBANCNFSM5LVU3OXQ>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>
=========================================================
Ce message et toutes les pieces jointes (ci-apres le "message")
sont confidentiels et susceptibles de contenir des informations
couvertes par le secret professionnel. Ce message est etabli
a l'intention exclusive de ses destinataires. Toute utilisation
ou diffusion non autorisee interdite.
Tout message electronique est susceptible d'alteration. La SOCIETE GENERALE
et ses filiales declinent toute responsabilite au titre de ce message
s'il a ete altere, deforme falsifie.
=========================================================
This message and any attachments (the "message") are confidential,
intended solely for the addresses, and may contain legally privileged
information. Any unauthorized use or dissemination is prohibited.
E-mails are susceptible to alteration. Neither SOCIETE GENERALE nor any
of its subsidiaries or affiliates shall be liable for the message
if altered, changed or falsified.
=========================================================
|
Seems to be working fine here with webhooks only (minimum set to 1, not 0) I noticed one kubernetes configuration issue, which might be slightly related and easily fixed in fact |
It would be great if you could test #1127 which isn't merged to master yet 😄 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@toast-gear Can we probably close this as resolved? |
Definitely |
Describe the bug
After triggering a job :
1- The runner is started => OK as expected
2- it registered itself to the organization (Runner was in idle status) => As expected
3- Github looked for an online and idle runner that matched the job's runs-on label (Status of the runner switched to from idle to active) => As expected
5- Then after running the job, the runner is automatically unregistered from our GitHub instance
the runner was automatically shutdown => OK as expected.
However after a few seconds later :
6- A new runner is provisonned based on the capacity reservation => Not as we expect
Unfortunately It seems that we are not able to disabled this behavior?
Checks
To Reproduce
Steps to reproduce the behavior:
Expected behavior
When a job is handled, the pod is terminated and no more pod is instanciated except if a new JOB is fired.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: