Scale down Github API responses inconsistency #3589

maschwenk · 2023-10-31T19:43:17Z

Following up from my comment here. I can close this issue and keep it in that existing issue if that's better. TL;DR we are confident that the code that paginates through the runners from Github is working as advertised, we just think there might be a chance that Github is occasionally leaving out certain runners. We have logging around rate limiting so we've ruled that out but our logs seem to indicate that Runners are being terminated while they are healthily busy, not with memory pressure, without any clear errors in the Agent logs.

I'm wondering if you'd consider a patch to:

terraform-aws-github-runner/lambdas/functions/control-plane/src/scale-runners/scale-down.ts

Lines 178 to 183 in f7e4935

    
           if (bootTimeExceeded(ec2Runner)) { 
        
             logger.info(`Runner '${ec2Runner.instanceId}' is orphaned and will be removed.`); 
        
             terminateOrphan(ec2Runner.instanceId); 
        
           } else { 
        
             logger.debug(`Runner ${ec2Runner.instanceId} has not yet booted.`); 
        
           }

That would basically "double-validate" before termination. We'd re-query the API to make sure it's still missing, and then do the "hard" termination.

Also just curious if other folks have run into this.

github-actions · 2023-12-01T01:51:44Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs. Thank you for your contributions.

aakilin · 2023-12-05T22:07:18Z

We have the same issue on our side. The runners shut down during the working. I can't figure out why the Lambda decides that the worker isn't busy

maschwenk · 2023-12-05T22:29:31Z

@aakilin We see this a lot when we have a lot of runners scaled up (1000+). It seems to get worse when we have more and more instances, but can also happen even when scaled down to lower levels. We have reason to believe this is a bug on Github's side, but we really have no way to prove that is the case other than implementing this kind of "double validate" logic and seeing if that fixes it.

One thing my colleague experimented with was turning disable_runner_autoupdate off, but that seemed to have no effect. We get these issues even in cases where the Runners are completely idle, which has led us to believe that there's not an issue with the Instance resources or anything like that.

maschwenk · 2023-12-05T22:34:37Z

@mcaulifn also seemed to have this issue in the aforementioned issue

Scale down not getting all runners for org #1151

github-actions · 2024-01-05T01:48:30Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions bot added the Stale label Dec 1, 2023

github-actions bot removed the Stale label Dec 6, 2023

github-actions bot added the Stale label Jan 5, 2024

github-actions bot added the abandoned label Jan 15, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale down Github API responses inconsistency #3589

Scale down Github API responses inconsistency #3589

maschwenk commented Oct 31, 2023 •

edited

Loading

github-actions bot commented Dec 1, 2023

aakilin commented Dec 5, 2023

maschwenk commented Dec 5, 2023 •

edited

Loading

maschwenk commented Dec 5, 2023

github-actions bot commented Jan 5, 2024

Scale down Github API responses inconsistency #3589

Scale down Github API responses inconsistency #3589

Comments

maschwenk commented Oct 31, 2023 • edited Loading

github-actions bot commented Dec 1, 2023

aakilin commented Dec 5, 2023

maschwenk commented Dec 5, 2023 • edited Loading

maschwenk commented Dec 5, 2023

github-actions bot commented Jan 5, 2024

maschwenk commented Oct 31, 2023 •

edited

Loading

maschwenk commented Dec 5, 2023 •

edited

Loading