-
Notifications
You must be signed in to change notification settings - Fork 628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs getting dropped #84
Comments
Thanks for taking the time and sharing your issues. We run also more or less the default config for roughly 50 repo's on org level. We have set the time till a runner instance get killed to 20 minutes. The last we we had several issues but they were releated to changes at GitHub. We terminate the runner and remove the AWS instance if an instance is not used after a minimum time in use, for us 20 minutes. There is not a very decent way for cleaning up runners, so we simply try to remove the instance from GitHub. When that succeeds, we remove the runner. This remove of the runner only succeeds if the runner is not active. You will see in the scale down lambda a log line like |
Is that the minimum_running_time_in_minutes set to 20? Mine was at 5. I tried upping that to 20 just in case, but I'm still regularly having runners cancelled in the middle of a job. It's happening more than half the time with my full length workflows. I also have 70ish "offline" runners in the self-hosted runners list. |
Oh, I just had a nice and isolated one due to it being a cron run, so I can maybe provide a more detailed context. Workflow started at midnight with two jobs, the second one needing the first: 00:08:32.207-04:00 - Webhook receives a check_run Job 1 ran for 21m 42s My minimum_running_time_in_minutes is set to 20. Might there be an issue with not realizing that jobs that run longer than the minimum_running_time_in_minutes are actually still running, since no check_runs have been sent in that time? I'm going to try bumping it up to 40 and see what happens, though having to keep them up for 40 minutes for those workflows that only take 5 might be problematic. I have no idea why it would spin up two instances with one workflow that has two jobs with one dependent on the other though. |
Tried 2 workflows with minimum_running_time_in_minutes at 40. Both were cancelled. |
Your logs mention you are running out of spot instances, so that is at least an issue with your AWS setup, you can try to request to increase the max ammount via AWS console. |
Yeah, that's what it looks like, but this is the only thing we use spot instances for, and it had been more than 6 hours after the last one had stopped, so that doesn't entirely make sense. I was just reading more about them though and found buried and hidden away that although the default max is 20, newer accounts might have a lower max than that. I think they might not know what "default" means. |
Might not wat you want but there is an option to disable spot,
|
It looks like the problem was indeed related to an absurdly low default spot instance cap. They upped it to 10, and everything works fine now. Closing. |
Thx for the feedback |
I'm hitting the same error on a build that takes a very long time to complete:
So far it's happened to me once after 11h23m, the second time after 8h29m. And the spot instances remain under Settings > Actions > Self-hosted runners as permanently offline. So I'm guessing the spot instances got killed. I couldn't tell based on the discussion above how @AlexMcConnell resolved this issue on his end, and whether I'm encountering the same issue or a similar one caused by a different underlying source. @npalm do you have any advice for how I can solve or how to go about debugging? |
Sorry for jumping the gun in posting, I think I've figured out what's going on. Just spot instances being spot instances 😄 In my EC2 console I can see Status: instance-terminated-no-capacity AWS docs on Spot Instance interruptions explain:
Looks like AWS's native "EC2 Auto Scaling" optimizes around this using the "capacity optimized strategy" in launching spot instances. I don't see any way to optimize for capacity with the current feature set of this codebase. @npalm does my conclusion sound right to you? |
@michaelstepner your conclussion sounds right. Created an issue to support also on demand. @Kostiantyn-Vorobiov is working on a PR #586 to support fall back instances. |
Summary
I'm not sure if this this is one problem that is with the runners or two problems and one of them is GitHub. Or maybe it's all GitHub's fault, and it's not communicating properly with the webhooks. IDK.
In the past week, I have regularly been seeing jobs getting cancelled or just not happening. The first thing I'm seeing, it seems like there might be some sort of timing issue between the close order being given to the spot request and the request picking up another job. My jobs are getting "The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled" when no one has done anything.
The other thing I'm seeing is jobs that just never run and yet the workflow fails.
Steps to reproduce
I don't know. It happens all the time with my pipeline with all of my workflows and jobs. All of my jobs are running bash scripts which in turn run docker containers for everything. I do have a few differences from the default settings. I have instance type set to m5.4xlarge, and I have a post_install script that provides ecr access:
I just thought to try updating the lambda zips, since I'm based straight on the github repo and haven't done that since the last time I ran a terraform init. So I'll give that a shot.
The text was updated successfully, but these errors were encountered: