Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale down Github API responses inconsistency #3589

Closed
maschwenk opened this issue Oct 31, 2023 · 5 comments
Closed

Scale down Github API responses inconsistency #3589

maschwenk opened this issue Oct 31, 2023 · 5 comments

Comments

@maschwenk
Copy link
Contributor

maschwenk commented Oct 31, 2023

#1151 (comment)

Following up from my comment here. I can close this issue and keep it in that existing issue if that's better. TL;DR we are confident that the code that paginates through the runners from Github is working as advertised, we just think there might be a chance that Github is occasionally leaving out certain runners. We have logging around rate limiting so we've ruled that out but our logs seem to indicate that Runners are being terminated while they are healthily busy, not with memory pressure, without any clear errors in the Agent logs.

I'm wondering if you'd consider a patch to:

if (bootTimeExceeded(ec2Runner)) {
logger.info(`Runner '${ec2Runner.instanceId}' is orphaned and will be removed.`);
terminateOrphan(ec2Runner.instanceId);
} else {
logger.debug(`Runner ${ec2Runner.instanceId} has not yet booted.`);
}

That would basically "double-validate" before termination. We'd re-query the API to make sure it's still missing, and then do the "hard" termination.

Also just curious if other folks have run into this.

Copy link
Contributor

github-actions bot commented Dec 1, 2023

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Stale label Dec 1, 2023
@aakilin
Copy link

aakilin commented Dec 5, 2023

We have the same issue on our side. The runners shut down during the working. I can't figure out why the Lambda decides that the worker isn't busy
image

@maschwenk
Copy link
Contributor Author

maschwenk commented Dec 5, 2023

@aakilin We see this a lot when we have a lot of runners scaled up (1000+). It seems to get worse when we have more and more instances, but can also happen even when scaled down to lower levels. We have reason to believe this is a bug on Github's side, but we really have no way to prove that is the case other than implementing this kind of "double validate" logic and seeing if that fixes it.

One thing my colleague experimented with was turning disable_runner_autoupdate off, but that seemed to have no effect. We get these issues even in cases where the Runners are completely idle, which has led us to believe that there's not an issue with the Instance resources or anything like that.

@maschwenk
Copy link
Contributor Author

@mcaulifn also seemed to have this issue in the aforementioned issue

@github-actions github-actions bot removed the Stale label Dec 6, 2023
Copy link
Contributor

github-actions bot commented Jan 5, 2024

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Stale label Jan 5, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants