-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draining nodes without interrupting busy runners #643
Comments
#470 (comment) I guess it's worth mentioning that in theory (but not guaranteed), quite soon jobs won't fail if there isn't a runner matching the specified labels (this will allow us to scale from 0 without a fake runner). Once that is here you could in theory just setup a scheduled override and wait for the runners to naturally hit 0 and then do what you need to do? Doesn't help you now obviously :D. |
Hmm that does sound interesting. I think though would still need to Perhaps then we can even accomplish this without full horizontal autoscaler support:
A bit hacky though... would definitely have to run inside a maintenance window. I'd still like to see some safe drain support, because in general we'd like to run updates outside of maintenance windows (as our previous CI system did) or be able to take a node out of service at any time.
We're on GHES, so see you in a year or two 🤣. |
yeh you'd need to stop new pods being scheduled and nodes being scaled out but you can do that via regular kubectl commands, no code to be maintained :D
You could almost certainly do something with the scheduling feature and tolerations + taints to do a green / blue style rollover of runners allowing for a seamless upgrade without the need for a window |
Definitely feasible... only thing I'm missing is a way to gracefully terminate a runner (i.e. don't terminate if running a workflow). I suppose the root of the issue is that if the runner is given SIGTERM/SIGINT, it doesn't wait for the current running job to finish. If the runner did have that guardrail, then a Having typed this out just now, it occurs to me that it's probably better to address the pretty serious shortcoming in https://github.com/actions/runner itself rather than add to your maintenance burden :). On my end I'll probably still need a quick-and-dirty CLI to do the strategy I outlined in the OP, but I'll look into seeing what it takes to patch actions/runner instead. I wonder even if there's some way we can probe whether the actions process is busy from the runner end then set a preStop hook (querying API, reading process memory/fds, etc.). Thanks for talking this through with me! |
@inahga Hey! Yeah I thought the same thing and read actions/runner code a bit. My conclusion at the time (a few months ago) was that there's no hook or any interface for that. So, the only potential solution would be to programmatically unregister the runner at the very end of the workflow, if that doesn't break the running workflow itself. Even if that's possible, all the changes you need would reside in your own actions workflow definition, rather than actions-runner-controller. That's why I had not put further effort on this matter, because I thought I'd better focus on actions-runner-controller itself, assuming that's what people expect me to do. |
Thanks for the heads up! I was leaning on querying GH API from the preStop hook, using an in-cluster API proxy to protect the So the preStop hook would execute a script something like: until [ curl http://cluster-local.api.gateway/status/$(hostname) says worker is idle ]; do
sleep 5
done I can't think of a reason for this not to work... but I have said that many times before when working with kubernetes 🤣 |
Sorry for reviving this issue, but we need this (drain nodes without cancelling jobs) at my organization too. We would probably be okay with using the |
@Natanande Hey! Thanks for the ideas.
Container lifecycles aren't configurable at the moment for Runner and RunnerDeployment. But the experimental RunnerSet (#629) which I suppose is being the standard for deploying runners with actions-runner-controller allows you to configure container lifecycles today. For Runner and RunnerDeployment, I'd appreciate it if you could submit a PR to add it to the RunnerSpec! Out of curiosity, how would you use container lifecycles for this use-case? Are you willing to defer dockerd container until the runner stops with it? Or do you think you need to defer stopping runner container by using some Thanks for your support! |
@Natanande I've done a similar extension of the API in #580, if you need help getting started. I just haven't done this myself because I am on vacation, so no business-related OSS contributions for me 😉. I can do after I return if you are not comfortable.
Not to derail the thread, but @mumoshu is there any plans for the |
@inahga I thought I have not yet tested it but theoretically, it should just work 😄 Would you mind giving it a shot? Also, ephemeral runners are fundamentally unreliable while being stopped. We need #470 (comment) to improve it. You might have already read that but I thought it worth being mentioned. |
For sure!
Yep we have the risks documented. Given the size of our org, the number and complexity of the builds, and the fact that our previous CI had good support for ephemeral, we really do need ephemeral. The autoscaling issues we're not worried about since we're not using it. The way our cluster is set up there is no cost saving benefit to autoscaling--so we just run at capacity and ensure the number of runners reflect the peak workload. As for the issue with |
Is your feature request related to a problem? Please describe.
We'd like to be able to
kubectl drain
a node but allow running jobs to run until completion. This is so that we can trigger low urgency tasks such as Kubernetes updates or node maintenance. We'd like this to be (semi-)automated, since generally our node updates are automated.Describe the solution you'd like
A means of draining nodes, without interrupting running jobs.
Describe alternatives you've considered
kubectl drain
simply evicts the pods, unsurprisingly 😢.kubectl cordon
is plausible, since runners won't be rescheduled once they're consumed (we're running in ephemeral mode). But this could take a very long time, since a job has to run on the runner first.I was planning on making a separate CLI tool that would follow roughly this logic:
It's also possible that a mutating admissions webhook can be placed on whatever API
kubectl drain
uses, which is probably a better solution so can just usekubectl drain
rather than a separate utility.Additional context
Would be happy to work on this myself and contribute back (since I'll have to work on it anyway 😄). Just wanted to check first if there was something I was missing or if anyone had a better solution.
The text was updated successfully, but these errors were encountered: