-
Notifications
You must be signed in to change notification settings - Fork 628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scale down Github API responses inconsistency #3589
Comments
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs. Thank you for your contributions. |
@aakilin We see this a lot when we have a lot of runners scaled up (1000+). It seems to get worse when we have more and more instances, but can also happen even when scaled down to lower levels. We have reason to believe this is a bug on Github's side, but we really have no way to prove that is the case other than implementing this kind of "double validate" logic and seeing if that fixes it. One thing my colleague experimented with was turning |
@mcaulifn also seemed to have this issue in the aforementioned issue |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs. Thank you for your contributions. |
#1151 (comment)
Following up from my comment here. I can close this issue and keep it in that existing issue if that's better. TL;DR we are confident that the code that paginates through the runners from Github is working as advertised, we just think there might be a chance that Github is occasionally leaving out certain runners. We have logging around rate limiting so we've ruled that out but our logs seem to indicate that Runners are being terminated while they are healthily busy, not with memory pressure, without any clear errors in the Agent logs.
I'm wondering if you'd consider a patch to:
terraform-aws-github-runner/lambdas/functions/control-plane/src/scale-runners/scale-down.ts
Lines 178 to 183 in f7e4935
That would basically "double-validate" before termination. We'd re-query the API to make sure it's still missing, and then do the "hard" termination.
Also just curious if other folks have run into this.
The text was updated successfully, but these errors were encountered: