Jobs getting dropped #84

AlexMcConnell · 2020-07-23T17:20:32Z

Summary

I'm not sure if this this is one problem that is with the runners or two problems and one of them is GitHub. Or maybe it's all GitHub's fault, and it's not communicating properly with the webhooks. IDK.

In the past week, I have regularly been seeing jobs getting cancelled or just not happening. The first thing I'm seeing, it seems like there might be some sort of timing issue between the close order being given to the spot request and the request picking up another job. My jobs are getting "The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled" when no one has done anything.

The other thing I'm seeing is jobs that just never run and yet the workflow fails.

Steps to reproduce

I don't know. It happens all the time with my pipeline with all of my workflows and jobs. All of my jobs are running bash scripts which in turn run docker containers for everything. I do have a few differences from the default settings. I have instance type set to m5.4xlarge, and I have a post_install script that provides ecr access:

mkdir /home/ec2-user/.docker
touch /home/ec2-user/.docker/config.json
echo "{" >> /home/ec2-user/.docker/config.json
echo '	"credsStore": "ecr-login"' >> /home/ec2-user/.docker/config.json
echo "}" >> /home/ec2-user/.docker/config.json
amazon-linux-extras enable docker
yum install -y amazon-ecr-credential-helper

I just thought to try updating the lambda zips, since I'm based straight on the github repo and haven't done that since the last time I ran a terraform init. So I'll give that a shot.

The text was updated successfully, but these errors were encountered:

npalm · 2020-07-24T13:18:11Z

Thanks for taking the time and sharing your issues. We run also more or less the default config for roughly 50 repo's on org level. We have set the time till a runner instance get killed to 20 minutes. The last we we had several issues but they were releated to changes at GitHub.

We terminate the runner and remove the AWS instance if an instance is not used after a minimum time in use, for us 20 minutes. There is not a very decent way for cleaning up runners, so we simply try to remove the instance from GitHub. When that succeeds, we remove the runner. This remove of the runner only succeeds if the runner is not active. You will see in the scale down lambda a log line like DEBUG Runner 'i-1234567890abcd' cannot be de-registered, most likely the runner is active. I just checked this function, and it works as expected. So most likely it is an issue on the site of GItHub.

AlexMcConnell · 2020-07-28T21:32:12Z

Is that the minimum_running_time_in_minutes set to 20? Mine was at 5. I tried upping that to 20 just in case, but I'm still regularly having runners cancelled in the middle of a job. It's happening more than half the time with my full length workflows.

I also have 70ish "offline" runners in the self-hosted runners list.

AlexMcConnell · 2020-07-29T13:34:41Z

Oh, I just had a nice and isolated one due to it being a cron run, so I can maybe provide a more detailed context.

Workflow started at midnight with two jobs, the second one needing the first:

00:08:32.207-04:00 - Webhook receives a check_run
00:09:13.177-04:00 - Scale up attempts to start an instance and receives "ERROR InsufficientInstanceCapacity"
00:10:10.156-04:00 - Scale up reports that is has created an instance
00:11:08.897-04:00 - Scale up attempts to start an instance and receives "ERROR InsufficientInstanceCapacity"
00:12:10.553-04:00 - Scale up attempts to start an instance and receives "ERROR InsufficientInstanceCapacity"
00:13:09.479-04:00 - Scale up reports that it has created an instance????
00:34:21.360-04:00 - Webhook receives a check_run
00:34:22.396-04:00 - Webhook receives a check_run
00:35:02.212-04:00 - Scale up shows 2 runners running, tries to start and instance, and receives "ERROR MaxSpotInstanceCountExceeded"
00:35:25.855-04:00 - Scale down - Runner 'i-00182d9960d370e91' cannot be de-registered
00:35:27.895-04:00 - Scale down - AWS runner instance 'i-0aa1baba356fea399' is terminated and GitHub runner 'i-0aa1baba356fea399' is de-registered.
00:35:29.908-04:00 - Scale down - Runner 'i-00182d9960d370e91' cannot be de-registered
00:35:57.311-04:00 - Scale up shows 1 runner running, tries to start and instance, and receives "ERROR MaxSpotInstanceCountExceeded"
00:35:57.311-04:00 - Scale up shows 1 runner running, tries to start and instance, and receives "ERROR MaxSpotInstanceCountExceeded"
00:37:33.738-04:00 - Webhook receives a check_run
00:37:56.032-04:00 - Scape up shows 0 queued workflows and stops

Job 1 ran for 21m 42s
Job 2 ran for 3m 2s before being cancelled due to receiving a shut down signal.

My minimum_running_time_in_minutes is set to 20.

Might there be an issue with not realizing that jobs that run longer than the minimum_running_time_in_minutes are actually still running, since no check_runs have been sent in that time? I'm going to try bumping it up to 40 and see what happens, though having to keep them up for 40 minutes for those workflows that only take 5 might be problematic.

I have no idea why it would spin up two instances with one workflow that has two jobs with one dependent on the other though.

AlexMcConnell · 2020-07-29T21:31:13Z

Tried 2 workflows with minimum_running_time_in_minutes at 40. Both were cancelled.

npalm · 2020-07-30T09:40:04Z

Your logs mention you are running out of spot instances, so that is at least an issue with your AWS setup, you can try to request to increase the max ammount via AWS console.

AlexMcConnell · 2020-07-30T21:25:36Z

Yeah, that's what it looks like, but this is the only thing we use spot instances for, and it had been more than 6 hours after the last one had stopped, so that doesn't entirely make sense. I was just reading more about them though and found buried and hidden away that although the default max is 20, newer accounts might have a lower max than that. I think they might not know what "default" means.

npalm · 2020-07-31T07:23:44Z

Might not wat you want but there is an option to disable spot,

terraform-aws-github-runner/modules/runners/variables.tf

Line 54 in ab2ba19

variable "market_options" {

The variable is only not exported via the root module yet.

AlexMcConnell · 2020-08-03T12:49:56Z

It looks like the problem was indeed related to an absurdly low default spot instance cap. They upped it to 10, and everything works fine now. Closing.

npalm · 2020-08-03T12:58:59Z

Thx for the feedback

michaelstepner · 2021-03-09T07:53:18Z

I'm hitting the same error on a build that takes a very long time to complete:

The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

So far it's happened to me once after 11h23m, the second time after 8h29m. And the spot instances remain under Settings > Actions > Self-hosted runners as permanently offline. So I'm guessing the spot instances got killed.

I couldn't tell based on the discussion above how @AlexMcConnell resolved this issue on his end, and whether I'm encountering the same issue or a similar one caused by a different underlying source.

@npalm do you have any advice for how I can solve or how to go about debugging?

michaelstepner · 2021-03-09T08:27:09Z

Sorry for jumping the gun in posting, I think I've figured out what's going on. Just spot instances being spot instances 😄

In my EC2 console I can see Status: instance-terminated-no-capacity

AWS docs on Spot Instance interruptions explain:

Capacity – If there are not enough unused EC2 instances to meet the demand for On-Demand Instances, Amazon EC2 interrupts Spot Instances. The order in which the instances are interrupted is determined by Amazon EC2.

Looks like AWS's native "EC2 Auto Scaling" optimizes around this using the "capacity optimized strategy" in launching spot instances.

I don't see any way to optimize for capacity with the current feature set of this codebase. @npalm does my conclusion sound right to you?

npalm · 2021-03-09T09:16:02Z

@michaelstepner your conclussion sounds right. Created an issue to support also on demand. @Kostiantyn-Vorobiov is working on a PR #586 to support fall back instances.

AlexMcConnell closed this as completed Aug 3, 2020

dimisjim mentioned this issue Apr 29, 2021

Deregister Runner Application when Spot Interruption signal is received #804

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs getting dropped #84

Jobs getting dropped #84

AlexMcConnell commented Jul 23, 2020

npalm commented Jul 24, 2020

AlexMcConnell commented Jul 28, 2020

AlexMcConnell commented Jul 29, 2020 •

edited

Loading

AlexMcConnell commented Jul 29, 2020

npalm commented Jul 30, 2020 •

edited

Loading

AlexMcConnell commented Jul 30, 2020

npalm commented Jul 31, 2020

AlexMcConnell commented Aug 3, 2020 •

edited

Loading

npalm commented Aug 3, 2020

michaelstepner commented Mar 9, 2021

michaelstepner commented Mar 9, 2021

npalm commented Mar 9, 2021

Jobs getting dropped #84

Jobs getting dropped #84

Comments

AlexMcConnell commented Jul 23, 2020

Summary

Steps to reproduce

npalm commented Jul 24, 2020

AlexMcConnell commented Jul 28, 2020

AlexMcConnell commented Jul 29, 2020 • edited Loading

AlexMcConnell commented Jul 29, 2020

npalm commented Jul 30, 2020 • edited Loading

AlexMcConnell commented Jul 30, 2020

npalm commented Jul 31, 2020

AlexMcConnell commented Aug 3, 2020 • edited Loading

npalm commented Aug 3, 2020

michaelstepner commented Mar 9, 2021

michaelstepner commented Mar 9, 2021

npalm commented Mar 9, 2021

AlexMcConnell commented Jul 29, 2020 •

edited

Loading

npalm commented Jul 30, 2020 •

edited

Loading

AlexMcConnell commented Aug 3, 2020 •

edited

Loading