Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark last instance unhealthy if BuildkiteTerminateInstanceAfterJob is enabled #1245

Merged
merged 13 commits into from
Oct 23, 2023

Conversation

triarius
Copy link
Contributor

This is a different implementation of the same goal as #1225, but we found when testing that PR that the instances would NOT terminate on linux because the buildkite-agent user does not have permission to shutdown the instance. Rather than granting it this permisison, in this PR we mark the instance as unhealthy if

  1. The idle period has expired and the agent has shutdown.
  2. The call to the AWS API to terminate the instance and decrement the desired capacity fails.
  3. BuildkiteTerminateInstanceAfterJob is enabled.

Closes: #1225

moskyb and others added 12 commits October 20, 2023 16:35
There's a situation that can occur when using the TerminateAfterJob mode of the agent, where when the ASG gets down to its MinSize, after running a job, the agent will attempt to self-terminate using the `aws autoscaling terminate-instance-in-autoscaling-group` command, but becuase that action would cause the ASG to go below its minsize, the call fails. This fails the entire systemd `ExecPostStop` hook, which fails the entire unit, which causes systemd to restart the unit.

This means that in certain circumstances (when the instance attempted to self-terminate but the call failed), single-job instances can be caused to retain state from previous jobs, which is the opposite of what we want for these instances.

This PR changes the autoterminate script so that if the instance is in TerminateAfterJob mode, if the call to terminate fails, the instance will call `shutdown` on itself, causing the instance to be removed from the ASG no matter what EC2 has to say about it
Then it will appear in the job logs, so customers will know it is set by
the elastic stack.
The latter affects every subsequently started service (at least). The
former only affects the service that needs the env variable.
@triarius triarius requested a review from a team October 22, 2023 02:23
aws autoscaling set-instance-health \
--region "$1" \
--instance-id "$2" \
--health-status Unhealthy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ps1 includes --no-should-respect-grace-period - I think this should have it too, if the goal is to emulate shutdown (and where someone has configured a grace period).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I copy-pasted that from your PR 😅.

The grace period is set by the HealthCheckGracePeriod field of an AWS::AutoScaling::AutoScalingGroup, but that is not set here:

AgentAutoScaleGroup:
Type: AWS::AutoScaling::AutoScalingGroup
DependsOn:
- IAMPolicies
- VpcComplete
Properties:
VPCZoneIdentifier: !If [ "CreateVpcResources", [ !Ref Subnet0, !Ref Subnet1 ], !Ref Subnets ]
MixedInstancesPolicy:
InstancesDistribution:
OnDemandPercentageAboveBaseCapacity: !Ref OnDemandPercentage
SpotAllocationStrategy: !Ref SpotAllocationStrategy
LaunchTemplate:
LaunchTemplateSpecification:
LaunchTemplateId: !Ref AgentLaunchTemplate
Version: !GetAtt "AgentLaunchTemplate.LatestVersionNumber"
Overrides:
- InstanceType: !Select [ "0", !Split [ ",", !Join [ ",", [ !Ref InstanceTypes, "", "", "" ] ] ] ]
- !If
- UseInstanceType2
- InstanceType: !Select [ "1", !Split [ ",", !Join [ ",", [ !Ref InstanceTypes, "", "", "" ] ] ] ]
- !Ref "AWS::NoValue"
- !If
- UseInstanceType3
- InstanceType: !Select [ "2", !Split [ ",", !Join [ ",", [ !Ref InstanceTypes, "", "", "" ] ] ] ]
- !Ref "AWS::NoValue"
- !If
- UseInstanceType4
- InstanceType: !Select [ "3", !Split [ ",", !Join [ ",", [ !Ref InstanceTypes, "", "", "" ] ] ] ]
- !Ref "AWS::NoValue"
MinSize: !Ref MinSize
MaxSize: !Ref MaxSize
Cooldown: 60
MetricsCollection:
- Granularity: 1Minute
Metrics:
- GroupMinSize
- GroupMaxSize
- GroupInServiceInstances
- GroupTerminatingInstances
- GroupPendingInstances
- GroupDesiredCapacity
TerminationPolicies:
- OldestLaunchConfiguration
- ClosestToNextInstanceHour
NewInstancesProtectedFromScaleIn: true
CreationPolicy:
ResourceSignal:
Timeout: !If [ UseDefaultInstanceCreationTimeout, !If [ UseWindowsAgents, PT10M, PT5M ], !Ref InstanceCreationTimeout ]
Count: !Ref MinSize
UpdatePolicy:
AutoScalingReplacingUpdate:
WillReplace: true

So the default of 0s is what all stacks have by default. Given this, I think if the user has gone out of their way to set a grace period, we should not interfere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm split on this. On the one hand, that's a fairly good argument for respecting the grace period. But on the other, why would they configure BuildkiteTerminateInstanceAfterJob at all (if they also want a grace period, in which to... recover the agent?)

Copy link
Contributor Author

@triarius triarius Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK, but if they explicitly did both, they may have thought of a reason we haven't.

@triarius triarius merged commit d80bf0d into main Oct 23, 2023
1 check passed
@triarius triarius deleted the pdp-1828-take-over-terminate-instance-pr branch October 23, 2023 02:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants