Mark last instance unhealthy if `BuildkiteTerminateInstanceAfterJob` is enabled #1245

triarius · 2023-10-22T02:20:29Z

This is a different implementation of the same goal as #1225, but we found when testing that PR that the instances would NOT terminate on linux because the buildkite-agent user does not have permission to shutdown the instance. Rather than granting it this permisison, in this PR we mark the instance as unhealthy if

The idle period has expired and the agent has shutdown.
The call to the AWS API to terminate the instance and decrement the desired capacity fails.
BuildkiteTerminateInstanceAfterJob is enabled.

Closes: #1225

There's a situation that can occur when using the TerminateAfterJob mode of the agent, where when the ASG gets down to its MinSize, after running a job, the agent will attempt to self-terminate using the `aws autoscaling terminate-instance-in-autoscaling-group` command, but becuase that action would cause the ASG to go below its minsize, the call fails. This fails the entire systemd `ExecPostStop` hook, which fails the entire unit, which causes systemd to restart the unit. This means that in certain circumstances (when the instance attempted to self-terminate but the call failed), single-job instances can be caused to retain state from previous jobs, which is the opposite of what we want for these instances. This PR changes the autoterminate script so that if the instance is in TerminateAfterJob mode, if the call to terminate fails, the instance will call `shutdown` on itself, causing the instance to be removed from the ASG no matter what EC2 has to say about it

… true

Then it will appear in the job logs, so customers will know it is set by the elastic stack.

The latter affects every subsequently started service (at least). The former only affects the service that needs the env variable.

… Windows

…estart it if needed

DrJosh9000 · 2023-10-22T23:16:38Z

packer/linux/conf/buildkite-agent/scripts/terminate-instance

+  aws autoscaling set-instance-health \
+    --region "$1" \
+    --instance-id "$2" \
+    --health-status Unhealthy


The ps1 includes --no-should-respect-grace-period - I think this should have it too, if the goal is to emulate shutdown (and where someone has configured a grace period).

Looks like I copy-pasted that from your PR 😅.

The grace period is set by the HealthCheckGracePeriod field of an AWS::AutoScaling::AutoScalingGroup, but that is not set here:

elastic-ci-stack-for-aws/templates/aws-stack.yml

Lines 1253 to 1304 in 5b02d95

AgentAutoScaleGroup:

Type: AWS::AutoScaling::AutoScalingGroup

DependsOn:

- IAMPolicies

- VpcComplete

Properties:

VPCZoneIdentifier: !If [ "CreateVpcResources", [ !Ref Subnet0, !Ref Subnet1 ], !Ref Subnets ]

MixedInstancesPolicy:

InstancesDistribution:

OnDemandPercentageAboveBaseCapacity: !Ref OnDemandPercentage

SpotAllocationStrategy: !Ref SpotAllocationStrategy

LaunchTemplate:

LaunchTemplateSpecification:

LaunchTemplateId: !Ref AgentLaunchTemplate

Version: !GetAtt "AgentLaunchTemplate.LatestVersionNumber"

Overrides:

- InstanceType: !Select [ "0", !Split [ ",", !Join [ ",", [ !Ref InstanceTypes, "", "", "" ] ] ] ]

- !If

- UseInstanceType2

- InstanceType: !Select [ "1", !Split [ ",", !Join [ ",", [ !Ref InstanceTypes, "", "", "" ] ] ] ]

- !Ref "AWS::NoValue"

- !If

- UseInstanceType3

- InstanceType: !Select [ "2", !Split [ ",", !Join [ ",", [ !Ref InstanceTypes, "", "", "" ] ] ] ]

- !Ref "AWS::NoValue"

- !If

- UseInstanceType4

- InstanceType: !Select [ "3", !Split [ ",", !Join [ ",", [ !Ref InstanceTypes, "", "", "" ] ] ] ]

- !Ref "AWS::NoValue"

MinSize: !Ref MinSize

MaxSize: !Ref MaxSize

Cooldown: 60

MetricsCollection:

- Granularity: 1Minute

Metrics:

- GroupMinSize

- GroupMaxSize

- GroupInServiceInstances

- GroupTerminatingInstances

- GroupPendingInstances

- GroupDesiredCapacity

TerminationPolicies:

- OldestLaunchConfiguration

- ClosestToNextInstanceHour

NewInstancesProtectedFromScaleIn: true

CreationPolicy:

ResourceSignal:

Timeout: !If [ UseDefaultInstanceCreationTimeout, !If [ UseWindowsAgents, PT10M, PT5M ], !Ref InstanceCreationTimeout ]

Count: !Ref MinSize

UpdatePolicy:

AutoScalingReplacingUpdate:

WillReplace: true

So the default of 0s is what all stacks have by default. Given this, I think if the user has gone out of their way to set a grace period, we should not interfere.

I'm split on this. On the one hand, that's a fairly good argument for respecting the grace period. But on the other, why would they configure BuildkiteTerminateInstanceAfterJob at all (if they also want a grace period, in which to... recover the agent?)

IDK, but if they explicitly did both, they may have thought of a reason we haven't.

moskyb and others added 12 commits October 20, 2023 16:35

Fix missing equals

7eec130

FFFFFFFF

4e95386

Accursed plusses

11a5832

Only terminate on shutdown when BuildkiteTerminateInstanceAfterJob is…

916dfe1

… true

Also set BUILDKITE_TERMINATE_INSTANCE_AFTER_JOB in cfn-env

cee477c

Then it will appear in the job logs, so customers will know it is set by the elastic stack.

Use systemd service overrides instead of systemctl set-env

f05546d

The latter affects every subsequently started service (at least). The former only affects the service that needs the env variable.

Ensure BUILDKITE_TERMINATE_INSTANCE_AFTER_JOB is shown in job logs on…

d9fe204

… Windows

Mark instance unhealthy instead of shutting down

f9d7731

Mark as unhealthy on windows

d2927cf

Modify terminate-instance on windows to stop the agent earlier, but r…

c3ebaa5

…estart it if needed

Remove unused condition

5b02d95

triarius requested a review from a team October 22, 2023 02:23

DrJosh9000 approved these changes Oct 22, 2023

View reviewed changes

Respect grace period on windows

198db6e

triarius requested a review from DrJosh9000 October 23, 2023 01:17

triarius merged commit d80bf0d into main Oct 23, 2023
1 check passed

triarius deleted the pdp-1828-take-over-terminate-instance-pr branch October 23, 2023 02:57

triarius mentioned this pull request Oct 23, 2023

Update changelog for v6.9.0 #1248

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mark last instance unhealthy if `BuildkiteTerminateInstanceAfterJob` is enabled #1245

Mark last instance unhealthy if `BuildkiteTerminateInstanceAfterJob` is enabled #1245

triarius commented Oct 22, 2023

DrJosh9000 Oct 22, 2023

triarius Oct 23, 2023

DrJosh9000 Oct 23, 2023

triarius Oct 23, 2023 •

edited

Loading

	AgentAutoScaleGroup:
	Type: AWS::AutoScaling::AutoScalingGroup
	DependsOn:
	- IAMPolicies
	- VpcComplete
	Properties:
	VPCZoneIdentifier: !If [ "CreateVpcResources", [ !Ref Subnet0, !Ref Subnet1 ], !Ref Subnets ]
	MixedInstancesPolicy:
	InstancesDistribution:
	OnDemandPercentageAboveBaseCapacity: !Ref OnDemandPercentage
	SpotAllocationStrategy: !Ref SpotAllocationStrategy
	LaunchTemplate:
	LaunchTemplateSpecification:
	LaunchTemplateId: !Ref AgentLaunchTemplate
	Version: !GetAtt "AgentLaunchTemplate.LatestVersionNumber"
	Overrides:
	- InstanceType: !Select [ "0", !Split [ ",", !Join [ ",", [ !Ref InstanceTypes, "", "", "" ] ] ] ]
	- !If
	- UseInstanceType2
	- InstanceType: !Select [ "1", !Split [ ",", !Join [ ",", [ !Ref InstanceTypes, "", "", "" ] ] ] ]
	- !Ref "AWS::NoValue"
	- !If
	- UseInstanceType3
	- InstanceType: !Select [ "2", !Split [ ",", !Join [ ",", [ !Ref InstanceTypes, "", "", "" ] ] ] ]
	- !Ref "AWS::NoValue"
	- !If
	- UseInstanceType4
	- InstanceType: !Select [ "3", !Split [ ",", !Join [ ",", [ !Ref InstanceTypes, "", "", "" ] ] ] ]
	- !Ref "AWS::NoValue"
	MinSize: !Ref MinSize
	MaxSize: !Ref MaxSize
	Cooldown: 60
	MetricsCollection:
	- Granularity: 1Minute
	Metrics:
	- GroupMinSize
	- GroupMaxSize
	- GroupInServiceInstances
	- GroupTerminatingInstances
	- GroupPendingInstances
	- GroupDesiredCapacity
	TerminationPolicies:
	- OldestLaunchConfiguration
	- ClosestToNextInstanceHour
	NewInstancesProtectedFromScaleIn: true
	CreationPolicy:
	ResourceSignal:
	Timeout: !If [ UseDefaultInstanceCreationTimeout, !If [ UseWindowsAgents, PT10M, PT5M ], !Ref InstanceCreationTimeout ]
	Count: !Ref MinSize
	UpdatePolicy:
	AutoScalingReplacingUpdate:
	WillReplace: true

Mark last instance unhealthy if BuildkiteTerminateInstanceAfterJob is enabled #1245

Mark last instance unhealthy if BuildkiteTerminateInstanceAfterJob is enabled #1245

Conversation

triarius commented Oct 22, 2023

DrJosh9000 Oct 22, 2023

Choose a reason for hiding this comment

triarius Oct 23, 2023

Choose a reason for hiding this comment

DrJosh9000 Oct 23, 2023

Choose a reason for hiding this comment

triarius Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

Mark last instance unhealthy if `BuildkiteTerminateInstanceAfterJob` is enabled #1245

Mark last instance unhealthy if `BuildkiteTerminateInstanceAfterJob` is enabled #1245

triarius Oct 23, 2023 •

edited

Loading