Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ephemeral mode with runner Controller does not work as expected #1044

Closed
2 tasks
achebel opened this issue Jan 11, 2022 · 28 comments
Closed
2 tasks

Ephemeral mode with runner Controller does not work as expected #1044

achebel opened this issue Jan 11, 2022 · 28 comments
Labels
pinned Misc issues / PRs we want to keep around

Comments

@achebel
Copy link

achebel commented Jan 11, 2022

Describe the bug
After triggering a job :
1- The runner is started => OK as expected
2- it registered itself to the organization (Runner was in idle status) => As expected
3- Github looked for an online and idle runner that matched the job's runs-on label (Status of the runner switched to from idle to active) => As expected
5- Then after running the job, the runner is automatically unregistered from our GitHub instance
the runner was automatically shutdown => OK as expected.

However after a few seconds later :
6- A new runner is provisonned based on the capacity reservation => Not as we expect
Unfortunately It seems that we are not able to disabled this behavior?

Checks

  • My actions-runner-controller version (v0.x.y) does support the feature
  • I'm using an unreleased version of the controller I built from HEAD of the default branch

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
When a job is handled, the pod is terminated and no more pod is instanciated except if a new JOB is fired.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • Controller Version [ 0.20.2]
  • Deployment Method [Helm and Kustomize ]
  • Helm Chart Version [0.13.2, if applicable]

Additional context
Add any other context about the problem here.

@mumoshu
Copy link
Collaborator

mumoshu commented Jan 11, 2022

@achebel This is expected behavior. As there's really no "guarantee" that we can receive a github webhook on a queued job(triggers scale up) and completed job(triggers scale down) we can't rely solely on those triggers for scaling. Instead, we only use those webhook events to add and remove the capacity reservations, which results in this behavior.

Do you have any specific issue due to that behavior?

@achebel
Copy link
Author

achebel commented Jan 11, 2022

No specific issue but what we expected it's just instanciate a POD when a job is triggered and terminate the POD when the job is handled without instancing a new POD based on no webhook events.
We targeted to use Autoscaling with Webhook driven scaling with
minReplicas: 0
maxReplicas: n

@mumoshu
Copy link
Collaborator

mumoshu commented Jan 11, 2022

@achebel Thanks. Yep, that's the expected behavior. It will scale down soon after you get "completed" workflow_job webhook event, so there will no long-hanging runner/pod left anyway.

@mumoshu
Copy link
Collaborator

mumoshu commented Jan 11, 2022

Also- a capacity reservation eventually expire so it won't result in redundant runners taking your cluster resource for so long, even if it somehow failed to receive a workflow_job event of "completed" status.

@achebel
Copy link
Author

achebel commented Jan 11, 2022

Yep this is what we understood by setting up durations and scaleDownDelaySecondsAfterScaleOut.
To be honest what we would prefered it's to have the capability to disable this capacity reservation in order to provision only a pod on demand.

@mumoshu
Copy link
Collaborator

mumoshu commented Jan 11, 2022

@achebel I hear you, but I'm not eager to add that myself, because webhook is definitely not a reliable communication mechanism and it can easily miss the "demand" when there's something wrong on GitHub, the network infrastructures in between, your K8s cluster, your loadbalancer, and finally actions-runner-controller's webhook server.

If you used your "preferred" method, every missing webhook event can increase the discrepancy between the desired and the current state, which isnt' what you'd want.

I can't stop thinking that you're considering it too easy. It's not.

I don't want to spend my spare time to develop a feature that doesn't pay off and only helps shooting your own foot.

But if you're going to contribute it though, I'll definitely review it. But you'd need a throughout documentation to note every gotchas.

@toast-gear
Copy link
Collaborator

toast-gear commented Jan 11, 2022

But if you're going to contribute it though, I'll definitely review it. But you'd need a throughout documentation to note every gotchas.

If the feature does get developed by someone it definitely should not be the default behaviour as that will open up a flood gate of support issues I imagine. Default should be as it is.

@achebel Whilst I understand why logically your described behaviour makes sense (assuming never having any connectivity issues which seems very optimistic to me, but, perhaps someone would be OK with hitting the re-run button every now and then when those crop up), why would you want this behaviour to begin with? The only reason I can think of is for security perhaps, if that is the driver then ARC, to me, is the wrong place to be putting that security. GitHub should be ensuring workflow runs can only run against runners of that label combo and / or runners attached to the runner groups associated with those repositories. At which point whether runners hang around or not shouldn't matter. If the runner has a powerful role attached that you need to protect those runners, the runner should be at the repo level or in a runner group, labels are client side and can be changed in a PR and that change will be respected, as a result they don't provide any protection whatsoever and so should not be relied upon for any level of security.

The only other reason I can think of is so you can aggresssively spin down nodes, however, scaling nodes is a slow process so again, this doesn't seem appropriate either.

@mumoshu
Copy link
Collaborator

mumoshu commented Jan 11, 2022

Rephrasing my previous satetement- if you have an enterpris-ish requirement that forces you to run the necessary number of runners only, it can't be implemented reliably today, due to missing GitHub API. You should better ask GitHub about that first.

More concretely, there should be something like a "assign this runner to that workflow job" and a "list all workflows jobs that no runners are assigned by you yet". With those two APIs, it would be straightforward for us to keep only the necessary number of runners.

@genisd
Copy link
Contributor

genisd commented Jan 11, 2022

We should! bug Github for those API endpoints 🐛 😂
They can display it to us on the page, should be doable you'd think.

You can try using very low capacity time reservations (I have), but then all these issues @mumoshu describes will come into play. Starvation can happen, but I think runners which are running a job will infact continue running (I think). I have not yet witnessed api calls getting lost on GKE,

It's something to look out for in the future hopefully 👀

@achebel
Copy link
Author

achebel commented Jan 12, 2022

Thanks you all for your different feebacks
Actually What we expected is to have a behavior similar to Kubernetes plugin for Jenkins.
https://plugins.jenkins.io/kubernetes/

@toast-gear
Copy link
Collaborator

toast-gear commented Jan 12, 2022

@achebel can you explain why though? I understand the behaviour you were expecting but not a reason for it beyond because reasons. At the moment I'm struggling to see an actual issue with the current behaviour other than it doesn't feel intuative to you? Have you got a more specific reason for the current behaviour not working for you? I think it's probably worth adding to the docs a bit more to make it a bit clearer how the current logic works as I can see how it might not seem as logical as the behaviour you've outlined without thinking more deeply on it.

@EricDales
Copy link

Rephrasing my previous satetement- if you have an enterpris-ish requirement that forces you to run the necessary number of runners only, it can't be implemented reliably today, due to missing GitHub API. You should better ask GitHub about that first.

More concretely, there should be something like a "assign this runner to that workflow job" and a "list all workflows jobs that no runners are assigned by you yet". With those two APIs, it would be straightforward for us to keep only the necessary number of runners.

@mumoshu , why would you need such an API ?
If you apply an event driven strategy, all you need to know is that there's a new workflow job happening. That means that in, front of that request, you need to instantiate a resource (a new runner). And that's it, everything is taken care by github : job assignment, runner shutdown (thanks to the ephemeral option).

@genisd
Copy link
Contributor

genisd commented Jan 12, 2022

At the moment the runners are pods in the k8s context.
So they are meant to keep running (from a kubernetes point of view), kubernetes thinks of them as a hosting "service/instance" (edit: easily fixed by using k8s jobs).

There are other difficulties, for example we ourselves just started using this software.
And I have a pipeline for which needs some 22 runners, but I consistently get only 20 or so.
I still need to triage/understand why that happens.

Yet right now it's not a big issue, because many of those jobs are relatively fast (no changes -> exit) and the waiting jobs get picked up.

From what I can tell about github webhooks is that they are quite reliable.

But even issues like the one I'm describing would benefit from an API to check "reality" from githubs point of view, just to safeguard against missing webhooks and starvation. Without an API using that event approach I couldn't be using this software [edit: right now].

Stuff happens, sometimes webhooks don't arrive and it doesn't matter whoose fault it is (it could be of course githubs, 3rd party in between, or your/our own).

Not having enough workers and having to be "hands on" babysitting this system, I don't think would be good.

Using a low reserve time for capacity, an no scale down backoff I think does pretty much what you describe.
You can try it.

@mumoshu
Copy link
Collaborator

mumoshu commented Jan 12, 2022

@mumoshu , why would you need such an API ?

@EricDales In other words, that's because github webhook (or any sort of webhook in general) and many implementation of webhook servers are based on implicit assumption that the event delivery is guaranteed, which isn't true.

If you apply an event driven strategy, all you need to know is that there's a new workflow job happening. That means that in, front of that request, you need to instantiate a resource (a new runn

That works only when your event source is durable and consistent and you can freely replay the history. Webhook isn't.

@EricDales
Copy link

@EricDales In other words, that's because github webhook (or any sort of webhook in general) and many implementation of webhook servers are based on implicit assumption that the event delivery is guaranteed, which isn't true.

@mumoshu , first I would like to warmly thank you for taking time to answer, I really appreciate this discussion.

I confirm that I want to rely on webhook & ephemeral mode . I understand that actions-runner-controller is intended to work in different contexts (high latency network ...), then if you consider that webhook delivery is not safe, why did you implement webhook scaling ?

Today our company is running daily more than 50k Jenkins builds, relying on webhook deliveries, and 100% of webhooks are delivered. And even if there were some failures, applying SRE rules, a 5% error budget would still be acceptable.

On a side note, a great improvement from github would be to have some red light on the right column of a repo page that warns that a webhook hasn't been delivered and a button to redeliver the payload (which is actually possible by digging in repository settings but a bit cumbersome). That could also be a commit status. @Link-, can you please submit that to the product management team ?

@mumoshu
Copy link
Collaborator

mumoshu commented Jan 13, 2022

why did you implement webhook scaling ?

@EricDales To make it scale quickly in case it did receive the webhook event successfully. Not 100% guaranteed doesn't mean we can't utilize it at all. 99% guaranteed means we have to design it so that 1% failure doesn't break it forever.

Today our company is running daily more than 50k Jenkins builds, relying on webhook deliveries, and 100% of webhooks are delivered

I hear you. But that doesn't mean it "won't" fail in the future, right? That's my point.

BTW, in case you're fine assuming it's delivered 100%, that's ok. You can just make capacity reservation expiration very short in your RunnerDeployment spec. Then you'll mostly get what you want.

@toast-gear
Copy link
Collaborator

toast-gear commented Jan 13, 2022

I think what is getting lost here in this issue is that the solution is primarly aimed at github.com users. The docs do say https://github.com/actions-runner-controller/actions-runner-controller#github-enterprise-support:

Note: The repository maintainers do not have an enterprise environment (cloud or server). 
Support for the enterprise specific feature set is community driven and on a best effort basis. 
PRs from the community are welcomed to add features and maintain support.

github.com reguarly has outages, with Actions impacted at least once a month or greater https://www.githubstatus.com/history. Often when these outages hit, webhooks do fail. I'm sure within a self-hosted GHES environment webhooks will be very reliable, but that isn't the environment primarily targeted.

I'd again though ask what the technical reasoning is for wanting 0 slack?

BTW, in case you're fine assuming it's delivered 100%, that's ok. You can just make capacity reservation expiration very short in your RunnerDeployment spec. Then you'll mostly get what you want.

I would have a play with the reservation expiration stuff first and see if you can achieve what you want. We can probably do a better job docs wise highlighting this sort of possible configuration so we'll look at updating them.

@Link-
Copy link
Member

Link- commented Jan 13, 2022

👋 everyone,

@mumoshu @toast-gear your and the other maintainer's time on this thread is invaluable, thank you 🙇‍♂️

@EricDales there's already a view in GitHub Apps as well as on the Enterprise level indicating the state of webhook deliveries. Webhooks can be redelivered with the exact same payload on demand in case of a failure, both on GitHub.com and GHES.

I believe the reasons for wanting 0 slack are:

  1. Minimise resource consumption by idle runners. For small setups that's fine but when we scale to 1000s or 10,000s of runners that adds up and the waste becomes substantial even for small durations.
  2. (ideally) It's not necessary (from an architectural perspective) since the runner service terminates as soon as the job run is completed with the –-ephemeral flag.

However, I also understand and empathise with @mumoshu and @toast-gear's perspective especially after having a look at the implementation. Irrespective of GitHub's SLAs and service reliability, webhooks are not transactional and stateless by design. They can be looked at as part of an eventual consistency model. This controller has been implemented to guarantee this eventual consistency by scheduled reconciliation.

Changing this behaviour could lead to inconsistencies because the desired state cannot be achieved and I understand if @mumoshu and the team want to optimise for state integrity at the expense of resources cost.

It would be interesting to explore whether there is a model that can provide us with the best of both worlds.

Also, I think it'll be great if we can determine whether the limitations are:

  1. Lack of resources (time, effort, contributors, financial) to revisit the implementation
  2. Technical feasibility (the model just doesn't work that way)

If it's #1 there could be ways around it, if it's #2 I'm afraid we have to either accept the current state or look for alternative solutions.

@mumoshu
Copy link
Collaborator

mumoshu commented Feb 18, 2022

LInking #911 (comment)

@mumoshu
Copy link
Collaborator

mumoshu commented Feb 18, 2022

Also, I think it'll be great if we can determine whether the limitations are:

  1. Lack of resources (time, effort, contributors, financial) to revisit the implementation
  2. Technical feasibility (the model just doesn't work that way)

Maybe both? 😅

For me, it's 1 because of 2. I don't believe it's feasible and I'm not willing to spend my spare time just to prove it's un-feasibility. But if anyone is actually trying to implement it I'd happily advise or review.

@mumoshu
Copy link
Collaborator

mumoshu commented Feb 20, 2022

@achebel @genisd @Link- @EricDales @toast-gear Hey everyone! I think I've managed to improve ephemeral runners to scale on webhook much as you've expected.

Please read my lengthy comment in #911 (comment).

At least it's very unlikely for you to see something below in the next version of ARC.

However after a few seconds later :
6- A new runner is provisonned based on the capacity reservation => Not as we expect
Unfortunately It seems that we are not able to disabled this behavior?

Still keep in mind that webhook delivery is 100% guaranteed so you'd better keep a minReplicas of 1, instead of setting it to 0, if your workflow is rarely run while important.

@genisd
Copy link
Contributor

genisd commented Feb 22, 2022

I'm testing current master right now (with webhooks only)

@achebel
Copy link
Author

achebel commented Feb 22, 2022 via email

@genisd
Copy link
Contributor

genisd commented Feb 22, 2022

Seems to be working fine here with webhooks only (minimum set to 1, not 0)
And eh, just to be 1000% sure, I've only updated the actionsr-runnter-controller deployment image, not the actions-runner-controller-github-webhook-server (perhaps I should?).

I noticed one kubernetes configuration issue, which might be slightly related and easily fixed in fact
#1144

@mumoshu
Copy link
Collaborator

mumoshu commented Feb 22, 2022

It would be great if you could test #1127 which isn't merged to master yet 😄

@stale
Copy link

stale bot commented Mar 25, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 25, 2022
@toast-gear toast-gear added the pinned Misc issues / PRs we want to keep around label Apr 6, 2022
@toast-gear toast-gear removed the stale label Apr 6, 2022
@mumoshu
Copy link
Collaborator

mumoshu commented Apr 10, 2022

EffectiveTime added in v0.22.0 as outlined in #911 (comment) should fix the original problem reported in this issue.

@toast-gear Can we probably close this as resolved?

@toast-gear
Copy link
Collaborator

Definitely

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pinned Misc issues / PRs we want to keep around
Projects
None yet
Development

No branches or pull requests

6 participants