-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: is it possible to see the amount of runs in a queue, so I could scale the amount of virtual machines? #723
Comments
GitHub doesn't expose any public APIs for this yet. You can listen to webhook events but that only tells you that something went into the queue, not the status of the queue. You can use the APIs on a per-repo basis to check for queued jobs. I hear some people are scaling based on this. Obviously this isn't going to be perfectly accurate if you reuse runners (one job may finish and an old runner could pick up a new job) but its probably good enough. Here's what we do**:
The ASG is "spare" capacity, because while a job is running it is no longer in the ASG, which will typically cause the ASG to spin up a new runner. The ASG has a desired capacity. This is the amount of idle capacity to have around ready to accept new jobs. If you set this to a fixed amount (e.g. 1) then you could still have multiple jobs running in parallel, but the rate that you can start jobs is limited by this number. One simple way to auto-scale is to have an alarm that fires when the capacity of the ASG reaches zero. Zero means that you don't have a machine available to pick up new jobs, which means the next job might have to wait, which sucks. The exact formulas you use for this kind of scale-up/scale-down should be chosen based on how bursty your workload is/how much you want to avoid overhead (from idle capacity: imagine scaling up desired a bunch because dependabot made 1000 PRs on monday morning -- that load is transient.) This method works pretty well even if you just stick at desired=1. It doesn't support "scale to zero", but "scale to 1" is pretty close. The VM needs to be agnostic when you use org runners -- i.e. you can't have a per-repo IAM role or anything cool like that. We run an org wide pool but also some per-repo pools for busy repos. ** We're waiting for ephemeral runners to land before doing all of this exactly -- for now we're just scaling based on time. Mostly our machines sit idle, but sometimes there are a burst of jobs, e.g. due to dependabot, and that results in queueing. It's also not resilient to some jobs/repos taking a long time. |
It would be really nice to have either queue introspection or a queue-based way to register runners for jobs:
|
Well, one more question. I used https://github.com/actions/virtual-environments repo to create aws ami. Time to setup and provision new instance is about 5 minutes. I can have a situation when there is no active self hosted runner available. And at that moment all new jobs will fail with a message "No runner available". Is it possible to prevent that error so it will wait instead of fail? |
And I forgot to say a huge THANK YOU! for your deep and detailed response! |
One trick that people have used is to register a runner and keep it offline as a stub. GitHub will hold on to the job in case that runner comes back online. Recently GitHub started garbage-collecting offline runner registrations after they've been offline for too long, so it's a bit trickier to do that, and it's clearly not a supported thing. :) The way my company solves it is by never scaling to zero at the org level. |
I had the same thoughts. Thank you. What kind of spot instances do u use? Flexible ones or with fixed duration? If flexible there is a problem if action is in a progress and u need to stop instance in 2 minutes it becomes very difficult to handle such situation... So I guess u are using fixed ones. Isn't it? Sorry for many questions. I working on creating such infrastructure and any information would be help-full. |
We use flexible ones. Most of our workflows are probably less than 2 minutes. We don't yet handle the spot termination notice and stop the runner if its idle (we should). We don't have monitoring up to track spot terminations and see how often they happen / how often they fail jobs... we should. So far no user reports though. My hope is that because our VMs are pretty short-lived we aren't the first victim chosen for spot termination typically, but I don't know if AWS documents anything there. We mitigate spot capacity by running in plenty of zones (eventually we'll use multiple regions) and a variety of instance types. The worst case of having to rerun a job isn't too bad if its an infrequent thing. |
Thanks! You've helped me a lot. |
Hi, I couldn't find any information on the internet so I decided to ask my question here. Is it possible to figure out the amount of runs/messages/items in a queue so I could scale my virtual machines? links or any ideas? Thank you.
The text was updated successfully, but these errors were encountered: