Ephemeral mode with runner Controller does not work as expected #1044

achebel · 2022-01-11T08:21:05Z

Describe the bug
After triggering a job :
1- The runner is started => OK as expected
2- it registered itself to the organization (Runner was in idle status) => As expected
3- Github looked for an online and idle runner that matched the job's runs-on label (Status of the runner switched to from idle to active) => As expected
5- Then after running the job, the runner is automatically unregistered from our GitHub instance
the runner was automatically shutdown => OK as expected.

However after a few seconds later :
6- A new runner is provisonned based on the capacity reservation => Not as we expect
Unfortunately It seems that we are not able to disabled this behavior?

Checks

My actions-runner-controller version (v0.x.y) does support the feature
I'm using an unreleased version of the controller I built from HEAD of the default branch

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
When a job is handled, the pod is terminated and no more pod is instanciated except if a new JOB is fired.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Controller Version [ 0.20.2]
Deployment Method [Helm and Kustomize ]
Helm Chart Version [0.13.2, if applicable]

Additional context
Add any other context about the problem here.

mumoshu · 2022-01-11T08:27:12Z

@achebel This is expected behavior. As there's really no "guarantee" that we can receive a github webhook on a queued job(triggers scale up) and completed job(triggers scale down) we can't rely solely on those triggers for scaling. Instead, we only use those webhook events to add and remove the capacity reservations, which results in this behavior.

Do you have any specific issue due to that behavior?

achebel · 2022-01-11T08:52:25Z

No specific issue but what we expected it's just instanciate a POD when a job is triggered and terminate the POD when the job is handled without instancing a new POD based on no webhook events.
We targeted to use Autoscaling with Webhook driven scaling with
minReplicas: 0
maxReplicas: n

mumoshu · 2022-01-11T09:07:33Z

@achebel Thanks. Yep, that's the expected behavior. It will scale down soon after you get "completed" workflow_job webhook event, so there will no long-hanging runner/pod left anyway.

mumoshu · 2022-01-11T09:10:43Z

Also- a capacity reservation eventually expire so it won't result in redundant runners taking your cluster resource for so long, even if it somehow failed to receive a workflow_job event of "completed" status.

achebel · 2022-01-11T09:16:06Z

Yep this is what we understood by setting up durations and scaleDownDelaySecondsAfterScaleOut.
To be honest what we would prefered it's to have the capability to disable this capacity reservation in order to provision only a pod on demand.

mumoshu · 2022-01-11T09:46:57Z

@achebel I hear you, but I'm not eager to add that myself, because webhook is definitely not a reliable communication mechanism and it can easily miss the "demand" when there's something wrong on GitHub, the network infrastructures in between, your K8s cluster, your loadbalancer, and finally actions-runner-controller's webhook server.

If you used your "preferred" method, every missing webhook event can increase the discrepancy between the desired and the current state, which isnt' what you'd want.

I can't stop thinking that you're considering it too easy. It's not.

I don't want to spend my spare time to develop a feature that doesn't pay off and only helps shooting your own foot.

But if you're going to contribute it though, I'll definitely review it. But you'd need a throughout documentation to note every gotchas.

toast-gear · 2022-01-11T10:07:54Z

But if you're going to contribute it though, I'll definitely review it. But you'd need a throughout documentation to note every gotchas.

If the feature does get developed by someone it definitely should not be the default behaviour as that will open up a flood gate of support issues I imagine. Default should be as it is.

@achebel Whilst I understand why logically your described behaviour makes sense (assuming never having any connectivity issues which seems very optimistic to me, but, perhaps someone would be OK with hitting the re-run button every now and then when those crop up), why would you want this behaviour to begin with? The only reason I can think of is for security perhaps, if that is the driver then ARC, to me, is the wrong place to be putting that security. GitHub should be ensuring workflow runs can only run against runners of that label combo and / or runners attached to the runner groups associated with those repositories. At which point whether runners hang around or not shouldn't matter. If the runner has a powerful role attached that you need to protect those runners, the runner should be at the repo level or in a runner group, labels are client side and can be changed in a PR and that change will be respected, as a result they don't provide any protection whatsoever and so should not be relied upon for any level of security.

The only other reason I can think of is so you can aggresssively spin down nodes, however, scaling nodes is a slow process so again, this doesn't seem appropriate either.

mumoshu · 2022-01-11T11:29:19Z

Rephrasing my previous satetement- if you have an enterpris-ish requirement that forces you to run the necessary number of runners only, it can't be implemented reliably today, due to missing GitHub API. You should better ask GitHub about that first.

More concretely, there should be something like a "assign this runner to that workflow job" and a "list all workflows jobs that no runners are assigned by you yet". With those two APIs, it would be straightforward for us to keep only the necessary number of runners.

genisd · 2022-01-11T15:39:12Z

We should! bug Github for those API endpoints 🐛 😂
They can display it to us on the page, should be doable you'd think.

You can try using very low capacity time reservations (I have), but then all these issues @mumoshu describes will come into play. Starvation can happen, but I think runners which are running a job will infact continue running (I think). I have not yet witnessed api calls getting lost on GKE,

It's something to look out for in the future hopefully 👀

achebel · 2022-01-12T10:51:31Z

Thanks you all for your different feebacks
Actually What we expected is to have a behavior similar to Kubernetes plugin for Jenkins.
https://plugins.jenkins.io/kubernetes/

toast-gear · 2022-01-12T10:57:10Z

@achebel can you explain why though? I understand the behaviour you were expecting but not a reason for it beyond because reasons. At the moment I'm struggling to see an actual issue with the current behaviour other than it doesn't feel intuative to you? Have you got a more specific reason for the current behaviour not working for you? I think it's probably worth adding to the docs a bit more to make it a bit clearer how the current logic works as I can see how it might not seem as logical as the behaviour you've outlined without thinking more deeply on it.

EricDales · 2022-01-12T15:00:51Z

Rephrasing my previous satetement- if you have an enterpris-ish requirement that forces you to run the necessary number of runners only, it can't be implemented reliably today, due to missing GitHub API. You should better ask GitHub about that first.

More concretely, there should be something like a "assign this runner to that workflow job" and a "list all workflows jobs that no runners are assigned by you yet". With those two APIs, it would be straightforward for us to keep only the necessary number of runners.

@mumoshu , why would you need such an API ?
If you apply an event driven strategy, all you need to know is that there's a new workflow job happening. That means that in, front of that request, you need to instantiate a resource (a new runner). And that's it, everything is taken care by github : job assignment, runner shutdown (thanks to the ephemeral option).

genisd · 2022-01-12T15:31:45Z

At the moment the runners are pods in the k8s context.
So they are meant to keep running (from a kubernetes point of view), kubernetes thinks of them as a hosting "service/instance" (edit: easily fixed by using k8s jobs).

There are other difficulties, for example we ourselves just started using this software.
And I have a pipeline for which needs some 22 runners, but I consistently get only 20 or so.
I still need to triage/understand why that happens.

Yet right now it's not a big issue, because many of those jobs are relatively fast (no changes -> exit) and the waiting jobs get picked up.

From what I can tell about github webhooks is that they are quite reliable.

But even issues like the one I'm describing would benefit from an API to check "reality" from githubs point of view, just to safeguard against missing webhooks and starvation. Without an API using that event approach I couldn't be using this software [edit: right now].

Stuff happens, sometimes webhooks don't arrive and it doesn't matter whoose fault it is (it could be of course githubs, 3rd party in between, or your/our own).

Not having enough workers and having to be "hands on" babysitting this system, I don't think would be good.

Using a low reserve time for capacity, an no scale down backoff I think does pretty much what you describe.
You can try it.

mumoshu · 2022-01-12T23:50:45Z

@mumoshu , why would you need such an API ?

@EricDales In other words, that's because github webhook (or any sort of webhook in general) and many implementation of webhook servers are based on implicit assumption that the event delivery is guaranteed, which isn't true.

If you apply an event driven strategy, all you need to know is that there's a new workflow job happening. That means that in, front of that request, you need to instantiate a resource (a new runn

That works only when your event source is durable and consistent and you can freely replay the history. Webhook isn't.

EricDales · 2022-01-13T08:20:22Z

@EricDales In other words, that's because github webhook (or any sort of webhook in general) and many implementation of webhook servers are based on implicit assumption that the event delivery is guaranteed, which isn't true.

@mumoshu , first I would like to warmly thank you for taking time to answer, I really appreciate this discussion.

I confirm that I want to rely on webhook & ephemeral mode . I understand that actions-runner-controller is intended to work in different contexts (high latency network ...), then if you consider that webhook delivery is not safe, why did you implement webhook scaling ?

Today our company is running daily more than 50k Jenkins builds, relying on webhook deliveries, and 100% of webhooks are delivered. And even if there were some failures, applying SRE rules, a 5% error budget would still be acceptable.

On a side note, a great improvement from github would be to have some red light on the right column of a repo page that warns that a webhook hasn't been delivered and a button to redeliver the payload (which is actually possible by digging in repository settings but a bit cumbersome). That could also be a commit status. @Link-, can you please submit that to the product management team ?

mumoshu · 2022-01-13T08:32:30Z

why did you implement webhook scaling ?

@EricDales To make it scale quickly in case it did receive the webhook event successfully. Not 100% guaranteed doesn't mean we can't utilize it at all. 99% guaranteed means we have to design it so that 1% failure doesn't break it forever.

Today our company is running daily more than 50k Jenkins builds, relying on webhook deliveries, and 100% of webhooks are delivered

I hear you. But that doesn't mean it "won't" fail in the future, right? That's my point.

BTW, in case you're fine assuming it's delivered 100%, that's ok. You can just make capacity reservation expiration very short in your RunnerDeployment spec. Then you'll mostly get what you want.

toast-gear · 2022-01-13T09:37:41Z

I think what is getting lost here in this issue is that the solution is primarly aimed at github.com users. The docs do say https://github.com/actions-runner-controller/actions-runner-controller#github-enterprise-support:

Note: The repository maintainers do not have an enterprise environment (cloud or server). 
Support for the enterprise specific feature set is community driven and on a best effort basis. 
PRs from the community are welcomed to add features and maintain support.

github.com reguarly has outages, with Actions impacted at least once a month or greater https://www.githubstatus.com/history. Often when these outages hit, webhooks do fail. I'm sure within a self-hosted GHES environment webhooks will be very reliable, but that isn't the environment primarily targeted.

I'd again though ask what the technical reasoning is for wanting 0 slack?

BTW, in case you're fine assuming it's delivered 100%, that's ok. You can just make capacity reservation expiration very short in your RunnerDeployment spec. Then you'll mostly get what you want.

I would have a play with the reservation expiration stuff first and see if you can achieve what you want. We can probably do a better job docs wise highlighting this sort of possible configuration so we'll look at updating them.

Link- · 2022-01-13T10:30:11Z

👋 everyone,

@mumoshu @toast-gear your and the other maintainer's time on this thread is invaluable, thank you 🙇‍♂️

@EricDales there's already a view in GitHub Apps as well as on the Enterprise level indicating the state of webhook deliveries. Webhooks can be redelivered with the exact same payload on demand in case of a failure, both on GitHub.com and GHES.

I believe the reasons for wanting 0 slack are:

Minimise resource consumption by idle runners. For small setups that's fine but when we scale to 1000s or 10,000s of runners that adds up and the waste becomes substantial even for small durations.
(ideally) It's not necessary (from an architectural perspective) since the runner service terminates as soon as the job run is completed with the –-ephemeral flag.

However, I also understand and empathise with @mumoshu and @toast-gear's perspective especially after having a look at the implementation. Irrespective of GitHub's SLAs and service reliability, webhooks are not transactional and stateless by design. They can be looked at as part of an eventual consistency model. This controller has been implemented to guarantee this eventual consistency by scheduled reconciliation.

Changing this behaviour could lead to inconsistencies because the desired state cannot be achieved and I understand if @mumoshu and the team want to optimise for state integrity at the expense of resources cost.

It would be interesting to explore whether there is a model that can provide us with the best of both worlds.

Also, I think it'll be great if we can determine whether the limitations are:

Lack of resources (time, effort, contributors, financial) to revisit the implementation
Technical feasibility (the model just doesn't work that way)

If it's #1 there could be ways around it, if it's #2 I'm afraid we have to either accept the current state or look for alternative solutions.

mumoshu · 2022-02-18T00:29:48Z

LInking #911 (comment)

mumoshu · 2022-02-18T00:32:22Z

Also, I think it'll be great if we can determine whether the limitations are:

Lack of resources (time, effort, contributors, financial) to revisit the implementation

Technical feasibility (the model just doesn't work that way)

Maybe both? 😅

For me, it's 1 because of 2. I don't believe it's feasible and I'm not willing to spend my spare time just to prove it's un-feasibility. But if anyone is actually trying to implement it I'd happily advise or review.

mumoshu · 2022-02-20T06:30:44Z

@achebel @genisd @Link- @EricDales @toast-gear Hey everyone! I think I've managed to improve ephemeral runners to scale on webhook much as you've expected.

Please read my lengthy comment in #911 (comment).

At least it's very unlikely for you to see something below in the next version of ARC.

However after a few seconds later :
6- A new runner is provisonned based on the capacity reservation => Not as we expect
Unfortunately It seems that we are not able to disabled this behavior?

Still keep in mind that webhook delivery is 100% guaranteed so you'd better keep a minReplicas of 1, instead of setting it to 0, if your workflow is rarely run while important.

genisd · 2022-02-22T12:43:44Z

I'm testing current master right now (with webhooks only)

achebel · 2022-02-22T13:09:07Z

I will test it in our side as well after for your feedback Daniel 😊 Regards From: Daniel ***@***.***> Sent: Tuesday, February 22, 2022 1:44 PM To: actions-runner-controller/actions-runner-controller ***@***.***> Cc: CHEBEL Ali (EXT) ResgDdsItfDef ***@***.***>; Mention ***@***.***> Subject: Re: [actions-runner-controller/actions-runner-controller] Ephemeral mode with runner Controller does not work as expected (Issue #1044) [EMETTEUR EXTERNE] / [EXTERNAL SENDER] Soyez vigilant avant d'ouvrir les pièces jointes ou de cliquer sur les liens. En cas de doute, signalez le message via le bouton "Message suspect" ou consultez go/secu. Be cautious before opening attachments or clicking on any links. If in doubt, use "Suspicious email" button or visit go/secu. I'm testing current master right now (with webhooks only) — Reply to this email directly, view it on GitHub<#1044 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF6G7NGHZZ6LGTEZSRIKYTLU4OAJBANCNFSM5LVU3OXQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>> ========================================================= Ce message et toutes les pieces jointes (ci-apres le "message") sont confidentiels et susceptibles de contenir des informations couvertes par le secret professionnel. Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee interdite. Tout message electronique est susceptible d'alteration. La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme falsifie. ========================================================= This message and any attachments (the "message") are confidential, intended solely for the addresses, and may contain legally privileged information. Any unauthorized use or dissemination is prohibited. E-mails are susceptible to alteration. Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or falsified. =========================================================

genisd · 2022-02-22T14:48:02Z

Seems to be working fine here with webhooks only (minimum set to 1, not 0)
And eh, just to be 1000% sure, I've only updated the actionsr-runnter-controller deployment image, not the actions-runner-controller-github-webhook-server (perhaps I should?).

I noticed one kubernetes configuration issue, which might be slightly related and easily fixed in fact
#1144

mumoshu · 2022-02-22T23:45:13Z

It would be great if you could test #1127 which isn't merged to master yet 😄

stale · 2022-03-25T04:12:29Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mumoshu · 2022-04-10T03:09:19Z

EffectiveTime added in v0.22.0 as outlined in #911 (comment) should fix the original problem reported in this issue.

@toast-gear Can we probably close this as resolved?

toast-gear · 2022-04-11T09:44:29Z

Definitely

mumoshu mentioned this issue Jan 24, 2022

HPA issue #1070

Closed

1 task

mumoshu mentioned this issue Feb 18, 2022

Random Operation Cancelled Runner Decommissions #911

Closed

stale bot added the stale label Mar 25, 2022

toast-gear added the pinned Misc issues / PRs we want to keep around label Apr 6, 2022

toast-gear removed the stale label Apr 6, 2022

toast-gear closed this as completed Apr 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ephemeral mode with runner Controller does not work as expected #1044

Ephemeral mode with runner Controller does not work as expected #1044

achebel commented Jan 11, 2022

mumoshu commented Jan 11, 2022

achebel commented Jan 11, 2022

mumoshu commented Jan 11, 2022

mumoshu commented Jan 11, 2022

achebel commented Jan 11, 2022

mumoshu commented Jan 11, 2022 •

edited

Loading

toast-gear commented Jan 11, 2022 •

edited

Loading

mumoshu commented Jan 11, 2022 •

edited

Loading

genisd commented Jan 11, 2022

achebel commented Jan 12, 2022

toast-gear commented Jan 12, 2022 •

edited

Loading

EricDales commented Jan 12, 2022

genisd commented Jan 12, 2022 •

edited

Loading

mumoshu commented Jan 12, 2022 •

edited

Loading

EricDales commented Jan 13, 2022

mumoshu commented Jan 13, 2022 •

edited

Loading

toast-gear commented Jan 13, 2022 •

edited

Loading

Link- commented Jan 13, 2022 •

edited

Loading

mumoshu commented Feb 18, 2022

mumoshu commented Feb 18, 2022 •

edited

Loading

mumoshu commented Feb 20, 2022 •

edited

Loading

genisd commented Feb 22, 2022

achebel commented Feb 22, 2022 via email

genisd commented Feb 22, 2022 •

edited

Loading

mumoshu commented Feb 22, 2022

stale bot commented Mar 25, 2022

mumoshu commented Apr 10, 2022 •

edited

Loading

toast-gear commented Apr 11, 2022

Ephemeral mode with runner Controller does not work as expected #1044

Ephemeral mode with runner Controller does not work as expected #1044

Comments

achebel commented Jan 11, 2022

mumoshu commented Jan 11, 2022

achebel commented Jan 11, 2022

mumoshu commented Jan 11, 2022

mumoshu commented Jan 11, 2022

achebel commented Jan 11, 2022

mumoshu commented Jan 11, 2022 • edited Loading

toast-gear commented Jan 11, 2022 • edited Loading

mumoshu commented Jan 11, 2022 • edited Loading

genisd commented Jan 11, 2022

achebel commented Jan 12, 2022

toast-gear commented Jan 12, 2022 • edited Loading

EricDales commented Jan 12, 2022

genisd commented Jan 12, 2022 • edited Loading

mumoshu commented Jan 12, 2022 • edited Loading

EricDales commented Jan 13, 2022

mumoshu commented Jan 13, 2022 • edited Loading

toast-gear commented Jan 13, 2022 • edited Loading

Link- commented Jan 13, 2022 • edited Loading

mumoshu commented Feb 18, 2022

mumoshu commented Feb 18, 2022 • edited Loading

mumoshu commented Feb 20, 2022 • edited Loading

genisd commented Feb 22, 2022

achebel commented Feb 22, 2022 via email

genisd commented Feb 22, 2022 • edited Loading

mumoshu commented Feb 22, 2022

stale bot commented Mar 25, 2022

mumoshu commented Apr 10, 2022 • edited Loading

toast-gear commented Apr 11, 2022

mumoshu commented Jan 11, 2022 •

edited

Loading

toast-gear commented Jan 11, 2022 •

edited

Loading

mumoshu commented Jan 11, 2022 •

edited

Loading

toast-gear commented Jan 12, 2022 •

edited

Loading

genisd commented Jan 12, 2022 •

edited

Loading

mumoshu commented Jan 12, 2022 •

edited

Loading

mumoshu commented Jan 13, 2022 •

edited

Loading

toast-gear commented Jan 13, 2022 •

edited

Loading

Link- commented Jan 13, 2022 •

edited

Loading

mumoshu commented Feb 18, 2022 •

edited

Loading

mumoshu commented Feb 20, 2022 •

edited

Loading

genisd commented Feb 22, 2022 •

edited

Loading

mumoshu commented Apr 10, 2022 •

edited

Loading