Question: is it possible to see the amount of runs in a queue, so I could scale the amount of virtual machines? #723

pekhota · 2020-09-28T17:10:28Z

Hi, I couldn't find any information on the internet so I decided to ask my question here. Is it possible to figure out the amount of runs/messages/items in a queue so I could scale my virtual machines? links or any ideas? Thank you.

j3parker · 2020-09-28T18:34:23Z

GitHub doesn't expose any public APIs for this yet. You can listen to webhook events but that only tells you that something went into the queue, not the status of the queue. You can use the APIs on a per-repo basis to check for queued jobs. I hear some people are scaling based on this. Obviously this isn't going to be perfectly accurate if you reuse runners (one job may finish and an old runner could pick up a new job) but its probably good enough.

Here's what we do**:

We use AWS
We use spot instances
We have an auto scale group (ASG) of "spare" (to be explained) runners.
Using Cloudwatch events we trigger a Lambda function when a new EC2 machine spins up. We look for the Org and (optional) Repo tag on the instance to determine who the runner is for, create either an org or repo registration token and put it into EC2 Parameter Store.
The EC2 machine polls parameter store on boot to get the registration token (it's usually already there, because that Lambda runs in parallel with booting)
When the runner picks up a job (see Feature request: provide a signal to other programs when the runner has started a job #699 for some extra context) we detatch ourselves from the ASG.
When a job completes, the VM terminates**

The ASG is "spare" capacity, because while a job is running it is no longer in the ASG, which will typically cause the ASG to spin up a new runner. The ASG has a desired capacity. This is the amount of idle capacity to have around ready to accept new jobs. If you set this to a fixed amount (e.g. 1) then you could still have multiple jobs running in parallel, but the rate that you can start jobs is limited by this number.

One simple way to auto-scale is to have an alarm that fires when the capacity of the ASG reaches zero. Zero means that you don't have a machine available to pick up new jobs, which means the next job might have to wait, which sucks.

The exact formulas you use for this kind of scale-up/scale-down should be chosen based on how bursty your workload is/how much you want to avoid overhead (from idle capacity: imagine scaling up desired a bunch because dependabot made 1000 PRs on monday morning -- that load is transient.)

This method works pretty well even if you just stick at desired=1. It doesn't support "scale to zero", but "scale to 1" is pretty close. The VM needs to be agnostic when you use org runners -- i.e. you can't have a per-repo IAM role or anything cool like that.

We run an org wide pool but also some per-repo pools for busy repos.

** We're waiting for ephemeral runners to land before doing all of this exactly -- for now we're just scaling based on time. Mostly our machines sit idle, but sometimes there are a burst of jobs, e.g. due to dependabot, and that results in queueing. It's also not resilient to some jobs/repos taking a long time.

j3parker · 2020-09-28T18:39:10Z

It would be really nice to have either queue introspection or a queue-based way to register runners for jobs:

It would make auto-scaling easier
It would make scale-to-zero a better experience, which would be good for per-repo runners even in a large org (people love to make millions of repos)
Tracking info about how long jobs take in the queue could be useful for cost/benefit analysis (how much are we actually saving with self-hosted runners when you account for spare/idle capacity, VM creation/deletion time etc.)
If we could see enough data about a job, e.g. that its for the abc workflow in the xyz branch (maybe master/main) then we could decide to run it on a VM with such-and-such an IAM role in such-and-such a VPC... this would be a huge win for us.... but right now you register a runner to either an org or repo and it just picks up whatever is in the queue.

pekhota · 2020-09-29T11:36:23Z

Well, one more question. I used https://github.com/actions/virtual-environments repo to create aws ami. Time to setup and provision new instance is about 5 minutes. I can have a situation when there is no active self hosted runner available. And at that moment all new jobs will fail with a message "No runner available". Is it possible to prevent that error so it will wait instead of fail?

pekhota · 2020-09-29T11:38:03Z

And I forgot to say a huge THANK YOU! for your deep and detailed response!

j3parker · 2020-09-29T14:10:29Z

Well, one more question. I used https://github.com/actions/virtual-environments repo to create aws ami. Time to setup and provision new instance is about 5 minutes. I can have a situation when there is no active self hosted runner available. And at that moment all new jobs will fail with a message "No runner available". Is it possible to prevent that error so it will wait instead of fail?

One trick that people have used is to register a runner and keep it offline as a stub. GitHub will hold on to the job in case that runner comes back online. Recently GitHub started garbage-collecting offline runner registrations after they've been offline for too long, so it's a bit trickier to do that, and it's clearly not a supported thing. :)

The way my company solves it is by never scaling to zero at the org level.

pekhota · 2020-09-29T14:50:11Z

I had the same thoughts. Thank you.

What kind of spot instances do u use? Flexible ones or with fixed duration? If flexible there is a problem if action is in a progress and u need to stop instance in 2 minutes it becomes very difficult to handle such situation... So I guess u are using fixed ones. Isn't it?

Sorry for many questions. I working on creating such infrastructure and any information would be help-full.

j3parker · 2020-09-29T14:59:23Z

We use flexible ones. Most of our workflows are probably less than 2 minutes. We don't yet handle the spot termination notice and stop the runner if its idle (we should). We don't have monitoring up to track spot terminations and see how often they happen / how often they fail jobs... we should. So far no user reports though.

My hope is that because our VMs are pretty short-lived we aren't the first victim chosen for spot termination typically, but I don't know if AWS documents anything there.

We mitigate spot capacity by running in plenty of zones (eventually we'll use multiple regions) and a variety of instance types.

The worst case of having to rerun a job isn't too bad if its an infrequent thing.

pekhota · 2020-09-30T09:11:40Z

Thanks! You've helped me a lot.

pekhota closed this as completed Sep 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: is it possible to see the amount of runs in a queue, so I could scale the amount of virtual machines? #723

Question: is it possible to see the amount of runs in a queue, so I could scale the amount of virtual machines? #723

pekhota commented Sep 28, 2020

j3parker commented Sep 28, 2020

j3parker commented Sep 28, 2020

pekhota commented Sep 29, 2020

pekhota commented Sep 29, 2020

j3parker commented Sep 29, 2020 •

edited

Loading

pekhota commented Sep 29, 2020

j3parker commented Sep 29, 2020

pekhota commented Sep 30, 2020

Question: is it possible to see the amount of runs in a queue, so I could scale the amount of virtual machines? #723

Question: is it possible to see the amount of runs in a queue, so I could scale the amount of virtual machines? #723

Comments

pekhota commented Sep 28, 2020

j3parker commented Sep 28, 2020

j3parker commented Sep 28, 2020

pekhota commented Sep 29, 2020

pekhota commented Sep 29, 2020

j3parker commented Sep 29, 2020 • edited Loading

pekhota commented Sep 29, 2020

j3parker commented Sep 29, 2020

pekhota commented Sep 30, 2020

j3parker commented Sep 29, 2020 •

edited

Loading