Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: is it possible to see the amount of runs in a queue, so I could scale the amount of virtual machines? #723

Closed
pekhota opened this issue Sep 28, 2020 · 8 comments

Comments

@pekhota
Copy link

pekhota commented Sep 28, 2020

Hi, I couldn't find any information on the internet so I decided to ask my question here. Is it possible to figure out the amount of runs/messages/items in a queue so I could scale my virtual machines? links or any ideas? Thank you.

@j3parker
Copy link

GitHub doesn't expose any public APIs for this yet. You can listen to webhook events but that only tells you that something went into the queue, not the status of the queue. You can use the APIs on a per-repo basis to check for queued jobs. I hear some people are scaling based on this. Obviously this isn't going to be perfectly accurate if you reuse runners (one job may finish and an old runner could pick up a new job) but its probably good enough.

Here's what we do**:

The ASG is "spare" capacity, because while a job is running it is no longer in the ASG, which will typically cause the ASG to spin up a new runner. The ASG has a desired capacity. This is the amount of idle capacity to have around ready to accept new jobs. If you set this to a fixed amount (e.g. 1) then you could still have multiple jobs running in parallel, but the rate that you can start jobs is limited by this number.

One simple way to auto-scale is to have an alarm that fires when the capacity of the ASG reaches zero. Zero means that you don't have a machine available to pick up new jobs, which means the next job might have to wait, which sucks.

The exact formulas you use for this kind of scale-up/scale-down should be chosen based on how bursty your workload is/how much you want to avoid overhead (from idle capacity: imagine scaling up desired a bunch because dependabot made 1000 PRs on monday morning -- that load is transient.)

This method works pretty well even if you just stick at desired=1. It doesn't support "scale to zero", but "scale to 1" is pretty close. The VM needs to be agnostic when you use org runners -- i.e. you can't have a per-repo IAM role or anything cool like that.

We run an org wide pool but also some per-repo pools for busy repos.

** We're waiting for ephemeral runners to land before doing all of this exactly -- for now we're just scaling based on time. Mostly our machines sit idle, but sometimes there are a burst of jobs, e.g. due to dependabot, and that results in queueing. It's also not resilient to some jobs/repos taking a long time.

@j3parker
Copy link

It would be really nice to have either queue introspection or a queue-based way to register runners for jobs:

  • It would make auto-scaling easier
  • It would make scale-to-zero a better experience, which would be good for per-repo runners even in a large org (people love to make millions of repos)
  • Tracking info about how long jobs take in the queue could be useful for cost/benefit analysis (how much are we actually saving with self-hosted runners when you account for spare/idle capacity, VM creation/deletion time etc.)
  • If we could see enough data about a job, e.g. that its for the abc workflow in the xyz branch (maybe master/main) then we could decide to run it on a VM with such-and-such an IAM role in such-and-such a VPC... this would be a huge win for us.... but right now you register a runner to either an org or repo and it just picks up whatever is in the queue.

@pekhota
Copy link
Author

pekhota commented Sep 29, 2020

Well, one more question. I used https://github.com/actions/virtual-environments repo to create aws ami. Time to setup and provision new instance is about 5 minutes. I can have a situation when there is no active self hosted runner available. And at that moment all new jobs will fail with a message "No runner available". Is it possible to prevent that error so it will wait instead of fail?

@pekhota
Copy link
Author

pekhota commented Sep 29, 2020

And I forgot to say a huge THANK YOU! for your deep and detailed response!

@j3parker
Copy link

j3parker commented Sep 29, 2020

Well, one more question. I used https://github.com/actions/virtual-environments repo to create aws ami. Time to setup and provision new instance is about 5 minutes. I can have a situation when there is no active self hosted runner available. And at that moment all new jobs will fail with a message "No runner available". Is it possible to prevent that error so it will wait instead of fail?

One trick that people have used is to register a runner and keep it offline as a stub. GitHub will hold on to the job in case that runner comes back online. Recently GitHub started garbage-collecting offline runner registrations after they've been offline for too long, so it's a bit trickier to do that, and it's clearly not a supported thing. :)

The way my company solves it is by never scaling to zero at the org level.

@pekhota
Copy link
Author

pekhota commented Sep 29, 2020

I had the same thoughts. Thank you.

What kind of spot instances do u use? Flexible ones or with fixed duration? If flexible there is a problem if action is in a progress and u need to stop instance in 2 minutes it becomes very difficult to handle such situation... So I guess u are using fixed ones. Isn't it?

Sorry for many questions. I working on creating such infrastructure and any information would be help-full.

@j3parker
Copy link

We use flexible ones. Most of our workflows are probably less than 2 minutes. We don't yet handle the spot termination notice and stop the runner if its idle (we should). We don't have monitoring up to track spot terminations and see how often they happen / how often they fail jobs... we should. So far no user reports though.

My hope is that because our VMs are pretty short-lived we aren't the first victim chosen for spot termination typically, but I don't know if AWS documents anything there.

We mitigate spot capacity by running in plenty of zones (eventually we'll use multiple regions) and a variety of instance types.

The worst case of having to rerun a job isn't too bad if its an infrequent thing.

@pekhota
Copy link
Author

pekhota commented Sep 30, 2020

Thanks! You've helped me a lot.

@pekhota pekhota closed this as completed Sep 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants