Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs getting dropped #84

Closed
AlexMcConnell opened this issue Jul 23, 2020 · 12 comments
Closed

Jobs getting dropped #84

AlexMcConnell opened this issue Jul 23, 2020 · 12 comments

Comments

@AlexMcConnell
Copy link

Summary

I'm not sure if this this is one problem that is with the runners or two problems and one of them is GitHub. Or maybe it's all GitHub's fault, and it's not communicating properly with the webhooks. IDK.

In the past week, I have regularly been seeing jobs getting cancelled or just not happening. The first thing I'm seeing, it seems like there might be some sort of timing issue between the close order being given to the spot request and the request picking up another job. My jobs are getting "The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled" when no one has done anything.

Screen Shot 2020-07-23 at 12 46 39 PM

The other thing I'm seeing is jobs that just never run and yet the workflow fails.

Screen Shot 2020-07-23 at 1 16 37 PM

Steps to reproduce

I don't know. It happens all the time with my pipeline with all of my workflows and jobs. All of my jobs are running bash scripts which in turn run docker containers for everything. I do have a few differences from the default settings. I have instance type set to m5.4xlarge, and I have a post_install script that provides ecr access:

mkdir /home/ec2-user/.docker
touch /home/ec2-user/.docker/config.json
echo "{" >> /home/ec2-user/.docker/config.json
echo '	"credsStore": "ecr-login"' >> /home/ec2-user/.docker/config.json
echo "}" >> /home/ec2-user/.docker/config.json
amazon-linux-extras enable docker
yum install -y amazon-ecr-credential-helper

I just thought to try updating the lambda zips, since I'm based straight on the github repo and haven't done that since the last time I ran a terraform init. So I'll give that a shot.

@npalm
Copy link
Member

npalm commented Jul 24, 2020

Thanks for taking the time and sharing your issues. We run also more or less the default config for roughly 50 repo's on org level. We have set the time till a runner instance get killed to 20 minutes. The last we we had several issues but they were releated to changes at GitHub.

We terminate the runner and remove the AWS instance if an instance is not used after a minimum time in use, for us 20 minutes. There is not a very decent way for cleaning up runners, so we simply try to remove the instance from GitHub. When that succeeds, we remove the runner. This remove of the runner only succeeds if the runner is not active. You will see in the scale down lambda a log line like DEBUG Runner 'i-1234567890abcd' cannot be de-registered, most likely the runner is active. I just checked this function, and it works as expected. So most likely it is an issue on the site of GItHub.

@AlexMcConnell
Copy link
Author

Is that the minimum_running_time_in_minutes set to 20? Mine was at 5. I tried upping that to 20 just in case, but I'm still regularly having runners cancelled in the middle of a job. It's happening more than half the time with my full length workflows.

I also have 70ish "offline" runners in the self-hosted runners list.

@AlexMcConnell
Copy link
Author

AlexMcConnell commented Jul 29, 2020

Oh, I just had a nice and isolated one due to it being a cron run, so I can maybe provide a more detailed context.

Workflow started at midnight with two jobs, the second one needing the first:

00:08:32.207-04:00 - Webhook receives a check_run
00:09:13.177-04:00 - Scale up attempts to start an instance and receives "ERROR InsufficientInstanceCapacity"
00:10:10.156-04:00 - Scale up reports that is has created an instance
00:11:08.897-04:00 - Scale up attempts to start an instance and receives "ERROR InsufficientInstanceCapacity"
00:12:10.553-04:00 - Scale up attempts to start an instance and receives "ERROR InsufficientInstanceCapacity"
00:13:09.479-04:00 - Scale up reports that it has created an instance????
00:34:21.360-04:00 - Webhook receives a check_run
00:34:22.396-04:00 - Webhook receives a check_run
00:35:02.212-04:00 - Scale up shows 2 runners running, tries to start and instance, and receives "ERROR MaxSpotInstanceCountExceeded"
00:35:25.855-04:00 - Scale down - Runner 'i-00182d9960d370e91' cannot be de-registered
00:35:27.895-04:00 - Scale down - AWS runner instance 'i-0aa1baba356fea399' is terminated and GitHub runner 'i-0aa1baba356fea399' is de-registered.
00:35:29.908-04:00 - Scale down - Runner 'i-00182d9960d370e91' cannot be de-registered
00:35:57.311-04:00 - Scale up shows 1 runner running, tries to start and instance, and receives "ERROR MaxSpotInstanceCountExceeded"
00:35:57.311-04:00 - Scale up shows 1 runner running, tries to start and instance, and receives "ERROR MaxSpotInstanceCountExceeded"
00:37:33.738-04:00 - Webhook receives a check_run
00:37:56.032-04:00 - Scape up shows 0 queued workflows and stops

Job 1 ran for 21m 42s
Job 2 ran for 3m 2s before being cancelled due to receiving a shut down signal.

My minimum_running_time_in_minutes is set to 20.

Might there be an issue with not realizing that jobs that run longer than the minimum_running_time_in_minutes are actually still running, since no check_runs have been sent in that time? I'm going to try bumping it up to 40 and see what happens, though having to keep them up for 40 minutes for those workflows that only take 5 might be problematic.

I have no idea why it would spin up two instances with one workflow that has two jobs with one dependent on the other though.

@AlexMcConnell
Copy link
Author

Tried 2 workflows with minimum_running_time_in_minutes at 40. Both were cancelled.

@npalm
Copy link
Member

npalm commented Jul 30, 2020

Your logs mention you are running out of spot instances, so that is at least an issue with your AWS setup, you can try to request to increase the max ammount via AWS console.

@AlexMcConnell
Copy link
Author

Yeah, that's what it looks like, but this is the only thing we use spot instances for, and it had been more than 6 hours after the last one had stopped, so that doesn't entirely make sense. I was just reading more about them though and found buried and hidden away that although the default max is 20, newer accounts might have a lower max than that. I think they might not know what "default" means.

@npalm
Copy link
Member

npalm commented Jul 31, 2020

Might not wat you want but there is an option to disable spot,

variable "market_options" {
The variable is only not exported via the root module yet.

@AlexMcConnell
Copy link
Author

AlexMcConnell commented Aug 3, 2020

It looks like the problem was indeed related to an absurdly low default spot instance cap. They upped it to 10, and everything works fine now. Closing.

@npalm
Copy link
Member

npalm commented Aug 3, 2020

Thx for the feedback

@michaelstepner
Copy link

I'm hitting the same error on a build that takes a very long time to complete:

The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

So far it's happened to me once after 11h23m, the second time after 8h29m. And the spot instances remain under Settings > Actions > Self-hosted runners as permanently offline. So I'm guessing the spot instances got killed.

I couldn't tell based on the discussion above how @AlexMcConnell resolved this issue on his end, and whether I'm encountering the same issue or a similar one caused by a different underlying source.

@npalm do you have any advice for how I can solve or how to go about debugging?

@michaelstepner
Copy link

Sorry for jumping the gun in posting, I think I've figured out what's going on. Just spot instances being spot instances 😄

In my EC2 console I can see Status: instance-terminated-no-capacity

AWS docs on Spot Instance interruptions explain:

Capacity – If there are not enough unused EC2 instances to meet the demand for On-Demand Instances, Amazon EC2 interrupts Spot Instances. The order in which the instances are interrupted is determined by Amazon EC2.

Looks like AWS's native "EC2 Auto Scaling" optimizes around this using the "capacity optimized strategy" in launching spot instances.

I don't see any way to optimize for capacity with the current feature set of this codebase. @npalm does my conclusion sound right to you?

@npalm
Copy link
Member

npalm commented Mar 9, 2021

@michaelstepner your conclussion sounds right. Created an issue to support also on demand. @Kostiantyn-Vorobiov is working on a PR #586 to support fall back instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants