-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mark last instance unhealthy if BuildkiteTerminateInstanceAfterJob
is enabled
#1245
Conversation
There's a situation that can occur when using the TerminateAfterJob mode of the agent, where when the ASG gets down to its MinSize, after running a job, the agent will attempt to self-terminate using the `aws autoscaling terminate-instance-in-autoscaling-group` command, but becuase that action would cause the ASG to go below its minsize, the call fails. This fails the entire systemd `ExecPostStop` hook, which fails the entire unit, which causes systemd to restart the unit. This means that in certain circumstances (when the instance attempted to self-terminate but the call failed), single-job instances can be caused to retain state from previous jobs, which is the opposite of what we want for these instances. This PR changes the autoterminate script so that if the instance is in TerminateAfterJob mode, if the call to terminate fails, the instance will call `shutdown` on itself, causing the instance to be removed from the ASG no matter what EC2 has to say about it
Then it will appear in the job logs, so customers will know it is set by the elastic stack.
The latter affects every subsequently started service (at least). The former only affects the service that needs the env variable.
…estart it if needed
aws autoscaling set-instance-health \ | ||
--region "$1" \ | ||
--instance-id "$2" \ | ||
--health-status Unhealthy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ps1 includes --no-should-respect-grace-period
- I think this should have it too, if the goal is to emulate shutdown
(and where someone has configured a grace period).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I copy-pasted that from your PR 😅.
The grace period is set by the HealthCheckGracePeriod
field of an AWS::AutoScaling::AutoScalingGroup
, but that is not set here:
elastic-ci-stack-for-aws/templates/aws-stack.yml
Lines 1253 to 1304 in 5b02d95
AgentAutoScaleGroup: | |
Type: AWS::AutoScaling::AutoScalingGroup | |
DependsOn: | |
- IAMPolicies | |
- VpcComplete | |
Properties: | |
VPCZoneIdentifier: !If [ "CreateVpcResources", [ !Ref Subnet0, !Ref Subnet1 ], !Ref Subnets ] | |
MixedInstancesPolicy: | |
InstancesDistribution: | |
OnDemandPercentageAboveBaseCapacity: !Ref OnDemandPercentage | |
SpotAllocationStrategy: !Ref SpotAllocationStrategy | |
LaunchTemplate: | |
LaunchTemplateSpecification: | |
LaunchTemplateId: !Ref AgentLaunchTemplate | |
Version: !GetAtt "AgentLaunchTemplate.LatestVersionNumber" | |
Overrides: | |
- InstanceType: !Select [ "0", !Split [ ",", !Join [ ",", [ !Ref InstanceTypes, "", "", "" ] ] ] ] | |
- !If | |
- UseInstanceType2 | |
- InstanceType: !Select [ "1", !Split [ ",", !Join [ ",", [ !Ref InstanceTypes, "", "", "" ] ] ] ] | |
- !Ref "AWS::NoValue" | |
- !If | |
- UseInstanceType3 | |
- InstanceType: !Select [ "2", !Split [ ",", !Join [ ",", [ !Ref InstanceTypes, "", "", "" ] ] ] ] | |
- !Ref "AWS::NoValue" | |
- !If | |
- UseInstanceType4 | |
- InstanceType: !Select [ "3", !Split [ ",", !Join [ ",", [ !Ref InstanceTypes, "", "", "" ] ] ] ] | |
- !Ref "AWS::NoValue" | |
MinSize: !Ref MinSize | |
MaxSize: !Ref MaxSize | |
Cooldown: 60 | |
MetricsCollection: | |
- Granularity: 1Minute | |
Metrics: | |
- GroupMinSize | |
- GroupMaxSize | |
- GroupInServiceInstances | |
- GroupTerminatingInstances | |
- GroupPendingInstances | |
- GroupDesiredCapacity | |
TerminationPolicies: | |
- OldestLaunchConfiguration | |
- ClosestToNextInstanceHour | |
NewInstancesProtectedFromScaleIn: true | |
CreationPolicy: | |
ResourceSignal: | |
Timeout: !If [ UseDefaultInstanceCreationTimeout, !If [ UseWindowsAgents, PT10M, PT5M ], !Ref InstanceCreationTimeout ] | |
Count: !Ref MinSize | |
UpdatePolicy: | |
AutoScalingReplacingUpdate: | |
WillReplace: true |
So the default of 0s
is what all stacks have by default. Given this, I think if the user has gone out of their way to set a grace period, we should not interfere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm split on this. On the one hand, that's a fairly good argument for respecting the grace period. But on the other, why would they configure BuildkiteTerminateInstanceAfterJob
at all (if they also want a grace period, in which to... recover the agent?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDK, but if they explicitly did both, they may have thought of a reason we haven't.
This is a different implementation of the same goal as #1225, but we found when testing that PR that the instances would NOT terminate on linux because the buildkite-agent user does not have permission to shutdown the instance. Rather than granting it this permisison, in this PR we mark the instance as unhealthy if
BuildkiteTerminateInstanceAfterJob
is enabled.Closes: #1225