Draining nodes without interrupting busy runners #643

inahga · 2021-06-21T13:06:29Z

Is your feature request related to a problem? Please describe.
We'd like to be able to kubectl drain a node but allow running jobs to run until completion. This is so that we can trigger low urgency tasks such as Kubernetes updates or node maintenance. We'd like this to be (semi-)automated, since generally our node updates are automated.

Describe the solution you'd like
A means of draining nodes, without interrupting running jobs.

Describe alternatives you've considered
kubectl drain simply evicts the pods, unsurprisingly 😢.

kubectl cordon is plausible, since runners won't be rescheduled once they're consumed (we're running in ephemeral mode). But this could take a very long time, since a job has to run on the runner first.

I was planning on making a separate CLI tool that would follow roughly this logic:

cordon node
for len(runners) > 0 {
    for runner := range runners {
        if runner is not busy according to GH API {
            evict runner
            remove from runners
        }
    }
}

It's also possible that a mutating admissions webhook can be placed on whatever API kubectl drain uses, which is probably a better solution so can just use kubectl drain rather than a separate utility.

Additional context
Would be happy to work on this myself and contribute back (since I'll have to work on it anyway 😄). Just wanted to check first if there was something I was missing or if anyone had a better solution.

The text was updated successfully, but these errors were encountered:

toast-gear · 2021-06-21T14:15:38Z

#470 (comment) I guess it's worth mentioning that in theory (but not guaranteed), quite soon jobs won't fail if there isn't a runner matching the specified labels (this will allow us to scale from 0 without a fake runner). Once that is here you could in theory just setup a scheduled override and wait for the runners to naturally hit 0 and then do what you need to do? Doesn't help you now obviously :D.

inahga · 2021-06-21T15:06:48Z

Hmm that does sound interesting. I think though would still need to kubectl cordon the node, but at least we'd have fewer replicas sitting around waiting to be consumed.

Perhaps then we can even accomplish this without full horizontal autoscaler support:

cordon node
scale down workers to some low number, thus terminating excess runners
scale back up
wait for runners on that node to be zero
update node

A bit hacky though... would definitely have to run inside a maintenance window. I'd still like to see some safe drain support, because in general we'd like to run updates outside of maintenance windows (as our previous CI system did) or be able to take a node out of service at any time.

We plan to ship most of this by the end of the summer and we plan to also ship it in GHES after we test it on dotcom.

We're on GHES, so see you in a year or two 🤣.

toast-gear · 2021-06-21T15:08:36Z

I think though would still need to kubectl cordon the node, but at least we'd have fewer replicas sitting around waiting to be consumed.

yeh you'd need to stop new pods being scheduled and nodes being scaled out but you can do that via regular kubectl commands, no code to be maintained :D

A bit hacky though... would definitely have to run inside a maintenance window

You could almost certainly do something with the scheduling feature and tolerations + taints to do a green / blue style rollover of runners allowing for a seamless upgrade without the need for a window

inahga · 2021-06-21T21:18:14Z

You could almost certainly do something with the scheduling feature and tolerations + taints to do a green / blue style rollover of runners allowing for a seamless upgrade without the need for a window

Definitely feasible... only thing I'm missing is a way to gracefully terminate a runner (i.e. don't terminate if running a workflow).

I suppose the root of the issue is that if the runner is given SIGTERM/SIGINT, it doesn't wait for the current running job to finish. If the runner did have that guardrail, then a NoExecute toleration can be set against set some taint, with a reasonable grace period on the runner pod spec--and all would be well.

Having typed this out just now, it occurs to me that it's probably better to address the pretty serious shortcoming in https://github.com/actions/runner itself rather than add to your maintenance burden :).

On my end I'll probably still need a quick-and-dirty CLI to do the strategy I outlined in the OP, but I'll look into seeing what it takes to patch actions/runner instead. I wonder even if there's some way we can probe whether the actions process is busy from the runner end then set a preStop hook (querying API, reading process memory/fds, etc.).

Thanks for talking this through with me!

mumoshu · 2021-06-22T04:50:53Z

I wonder even if there's some way we can probe whether the actions process is busy from the runner end

@inahga Hey! Yeah I thought the same thing and read actions/runner code a bit. My conclusion at the time (a few months ago) was that there's no hook or any interface for that.

So, the only potential solution would be to programmatically unregister the runner at the very end of the workflow, if that doesn't break the running workflow itself.

Even if that's possible, all the changes you need would reside in your own actions workflow definition, rather than actions-runner-controller. That's why I had not put further effort on this matter, because I thought I'd better focus on actions-runner-controller itself, assuming that's what people expect me to do.

inahga · 2021-06-22T13:22:40Z

Thanks for the heads up!

I was leaning on querying GH API from the preStop hook, using an in-cluster API proxy to protect the admin:enterprise token.

So the preStop hook would execute a script something like:

until [ curl http://cluster-local.api.gateway/status/$(hostname) says worker is idle ]; do
  sleep 5
done

I can't think of a reason for this not to work... but I have said that many times before when working with kubernetes 🤣

Natanande · 2021-07-08T11:35:33Z

Sorry for reviving this issue, but we need this (drain nodes without cancelling jobs) at my organization too. We would probably be okay with using the ps aux | grep hack described here until they provide a better interface, or maybe even checking for existence of _diag/Worker* files. But to do this we would need to be able to set container lifecycles and probably terminationGracePeriodSeconds too for runner/docker containers. Is this something that could be implemented?

mumoshu · 2021-07-09T00:21:26Z

@Natanande Hey! Thanks for the ideas.

terminationGracePeriodSeconds already exist in Runner spec so you can set whatever duration in it and the runner pod will inherit it(this should apply to the pod so you won't need runner/docker container distinction for this).

Container lifecycles aren't configurable at the moment for Runner and RunnerDeployment. But the experimental RunnerSet (#629) which I suppose is being the standard for deploying runners with actions-runner-controller allows you to configure container lifecycles today.

For Runner and RunnerDeployment, I'd appreciate it if you could submit a PR to add it to the RunnerSpec!

Out of curiosity, how would you use container lifecycles for this use-case? Are you willing to defer dockerd container until the runner stops with it?

Or do you think you need to defer stopping runner container by using some sleep prestop hook, so that it has enough time to finish? But I thought actions/runner was able to wait for running jobs to finish until it stops. Not exactly sure because it isn't documented anywhere in the upstream though. If you have insights and ideas around it I'm very interested to hear.

Thanks for your support!

inahga · 2021-07-09T02:35:23Z

@Natanande I've done a similar extension of the API in #580, if you need help getting started. I just haven't done this myself because I am on vacation, so no business-related OSS contributions for me 😉. I can do after I return if you are not comfortable.

ps aux | grep sounds like a wonderful simplification over polling the API. I probably should have thought of it 😆. Thanks for the tip, would love to test and see how it works out.

But the experimental RunnerSet (#629) which I suppose is being the standard for deploying runners with actions-runner-controller allows you to configure container lifecycles today.

Not to derail the thread, but @mumoshu is there any plans for the RunnerSet to be extended to support ephemeral runners? I am secretly very excited for RunnerSet for its stronger API, but can't use it in production without ephemeral support. Would be happy to test/contribute if needed (after vacation 😄).

mumoshu · 2021-07-09T04:20:28Z

RunnerSet to be extended to support ephemeral runners?

@inahga I thought I have not yet tested it but theoretically, it should just work 😄 Would you mind giving it a shot?

Also, ephemeral runners are fundamentally unreliable while being stopped. We need #470 (comment) to improve it. You might have already read that but I thought it worth being mentioned.

inahga · 2021-07-09T15:40:18Z

@inahga I thought I have not yet tested it but theoretically, it should just work smile Would you mind giving it a shot?

For sure!

Also, ephemeral runners are fundamentally unreliable while being stopped. We need #470 (comment) to improve it. You might have already read that but I thought it worth being mentioned.

Yep we have the risks documented. Given the size of our org, the number and complexity of the builds, and the fact that our previous CI had good support for ephemeral, we really do need ephemeral.

The autoscaling issues we're not worried about since we're not using it. The way our cluster is set up there is no cost saving benefit to autoscaling--so we just run at capacity and ensure the number of runners reflect the peak workload.

As for the issue with --once #466, and the lack of a real ephemeral flag, we haven't actually encountered these issues yet. Our build count is well into the thousands by now and myself hundreds, all of varying complexity, and it seems to just work. I suspect maybe some other changeset in actions/runner accidentally quietly fixed it? I have not dug into it though.

inahga closed this as completed Jun 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draining nodes without interrupting busy runners #643

Draining nodes without interrupting busy runners #643

inahga commented Jun 21, 2021

toast-gear commented Jun 21, 2021 •

edited

Loading

inahga commented Jun 21, 2021

toast-gear commented Jun 21, 2021 •

edited

Loading

inahga commented Jun 21, 2021 •

edited

Loading

mumoshu commented Jun 22, 2021

inahga commented Jun 22, 2021 •

edited

Loading

Natanande commented Jul 8, 2021

mumoshu commented Jul 9, 2021 •

edited

Loading

inahga commented Jul 9, 2021

mumoshu commented Jul 9, 2021

inahga commented Jul 9, 2021

Draining nodes without interrupting busy runners #643

Draining nodes without interrupting busy runners #643

Comments

inahga commented Jun 21, 2021

toast-gear commented Jun 21, 2021 • edited Loading

inahga commented Jun 21, 2021

toast-gear commented Jun 21, 2021 • edited Loading

inahga commented Jun 21, 2021 • edited Loading

mumoshu commented Jun 22, 2021

inahga commented Jun 22, 2021 • edited Loading

Natanande commented Jul 8, 2021

mumoshu commented Jul 9, 2021 • edited Loading

inahga commented Jul 9, 2021

mumoshu commented Jul 9, 2021

inahga commented Jul 9, 2021

toast-gear commented Jun 21, 2021 •

edited

Loading

toast-gear commented Jun 21, 2021 •

edited

Loading

inahga commented Jun 21, 2021 •

edited

Loading

inahga commented Jun 22, 2021 •

edited

Loading

mumoshu commented Jul 9, 2021 •

edited

Loading