-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
autoscaling
deploy: re-enable ASG scaling before final stabilisation check
#1345
autoscaling
deploy: re-enable ASG scaling before final stabilisation check
#1345
Conversation
6c8f8f7
to
7550f3b
Compare
|
I think it's probably safe to allow scaling actions to run again as soon as we've asked AWS to terminate the instances running the old build. At this point there is no risk of AWS trying to scale down the 'wrong' instances (i.e. the ones running the new build) and we know it is safe to launch more instances that will run with the newest build (as at least one instance running this build has already passed the health check). With that in mind, the only downside of the suggested approach is that while we're blocking and waiting for the instances to terminate, we could probably be allowing the app to start scaling again if we were to split the 'waiting work' into a separate task. We're probably talking about a pretty small time window here though so if it's considerably easier to implement the version you've suggested then I think that would still be a big improvement! |
7550f3b
to
4900a69
Compare
You've convinced me @jacobwinch, so I've added some WIP to this PR to implement a Not quite tidied up yet, but you probably get the idea. Some code from |
feaf4f0
to
3560ae2
Compare
20fa5f2
to
fc4a264
Compare
This aims to address, to some extent, issue #1342 - the problem that *apps can not auto-scale* until an autoscaling deploy has successfully completed. On 22nd May 2024, this inability to auto-scale led to a severe outage in the Ophan Tracker. Ever since #83 in April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy (`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy, (`ResumeAlarmNotifications`) once deployment has successfully completed. In December 2016, with #403, an additional `WaitForStabilization` was added as a penultimate deploy step, with the aim of ensuring that the cull of old instances has _completed_ before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and if the ASG instances are already overloaded and recycling, the ASG will _never_ stabilise, because it _needs to scale up_ to handle the load it's experiencing. In this change, we introduce a new task, `WaitForCullToComplete`, that can establish whether the cull has completed or not, regardless of whether the ASG is scaling - it simply checks that there are no remaining instances tagged for termination. Consequently, once we've executed `CullInstancesWithTerminationTag` to _request_ old instances terminate, we can immediately allow scaling with `ResumeAlarmNotifications`, and then `WaitForCullToComplete` _afterwards_. With this change in place, the Ophan outage would have been shortened from 1 hour to ~2 minutes, a much better outcome! Common code between `CullInstancesWithTerminationTag` and `WaitForCullToComplete` has been factored out into a new `CullSummary` class.
fc4a264
to
0f75e11
Compare
WaitForStabilization( | ||
autoScalingGroup, | ||
secondsToWait(pkg, target, reporter), | ||
target.region | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're retaining this here because it does offer proof that the new boxes can handle our traffic without the old boxes.
In normal deploys, it will complete almost instantly, if auto-scaling alarms change the size of the ASG, it could take a little longer to settle.
Jacob Winch pointed out that `WaitForCullToComplete` is a repeating check, and so needs to get up-to-date `AutoScalingGroupInfo` in order for it to know what instances currently exist and what their state is! I was missing a call to `ASG.refresh(asg)`. This is easy to miss in tasks like `WaitForCullToComplete` that extend `magenta.tasks.PollingCheck`, the `magenta.tasks.ASGTask.execute(asg: AutoScalingGroup, ...)` method puts a `AutoScalingGroup` into scope by the method parameter, and it's inevitably out-out-date... could be something to refactor in a later PR! We also decided that as polling checks inevitably involve network calls, it makes sense to put an exception-catching guard in the `magenta.tasks.PollingCheck.check()` method.
val checkResult = | ||
Try(theCheck).recover { case NonFatal(e) => | ||
reporter.info(e.getMessage) | ||
false | ||
}.get // '.get' allows a fatal exception, like an OutOfMemoryError, to kill the process, which is what we want! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacobwinch & I decided that as polling checks invariably involve network calls, it makes sense to put an exception-catching guard around the invocation of theCheck
- if a non-fatal exception is received, we probably don't want to kill the deploy, just attempt the check again after the next sleepyTime
.
autoscaling
deploy: re-enable ASG scaling before final stabilisation check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me 👍
I ran one additional test to confirm that automatically scaling up while Riff-Raff is running WaitForCullToComplete
works successfully.
Test Setup
- Redeployed this branch to Riff-Raff
CODE
- Started sending some artificial traffic to a test service, with the aim of forcing the scale up action to coincide with the Riff-Raff deployment
- Started a deployment of the test service via Riff-Raff
CODE
Observations
12:16:56 - Deployment starts: [12:16:56] deploy for devx::cdk-playground (build 801) to stage PROD
12:17:05 - Scaling actions are disabled by Riff-Raff: [12:17:05] task SuspendAlarmNotifications Suspending Alarm Notifications - playground-PROD-cdk-playground-AutoScalingGroupCdkplaygroundASGD6E49F0F-9KOUOLFGBKZB will no longer scale on any configured alarms
12:18:411 - Alarm used for automatically scaling up enters alarm state:
Automatic scale up does not happen yet because it is still blocked by Riff-Raff.
12:19:39 - Scaling actions are re-enabled by Riff-Raff: [12:19:39] task ResumeAlarmNotifications Resuming Alarm Notifications - playground-PROD-cdk-playground-AutoScalingGroupCdkplaygroundASGD6E49F0F-9KOUOLFGBKZB will scale on any configured alarms
12:19:40 - New WaitForCullToComplete
task starts running: [12:19:40] task WaitForCullToComplete Check that all instances tagged for termination in playground-PROD-cdk-playground-AutoScalingGroupCdkplaygroundASGD6E49F0F-9KOUOLFGBKZB have been terminated
12:19:45 - The ASG scales up automatically:
12:21:41 - New WaitForCullToComplete
task (and overall deployment) completes:
Footnotes
-
The screenshot here shows a UTC timestamp, I've converted to BST to make it easier to follow the timeline. ↩
Magnificent! Great test, and gratifying to see it work correctly! I shall now merge 👍 |
This aims to address, to an extent, issue #1342 - that in an
autoscaling
deploy, apps can not auto-scale until the original number of instances in the ASG is capable of successfully handling the level of traffic, as checked byWaitForStabilization
. On 22nd May 2024, this inability to auto-scale led to a severe outage in the Ophan Tracker, when a large increase in pageview traffic during a deploy meant the original number of instances could never have handled the load.Disabling auto-scaling
Ever since #83 in April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy (
SuspendAlarmNotifications
), and only re-enabled them at the end of the deploy (ResumeAlarmNotifications
) once deployment has successfully completed.In December 2016, with #403, an additional
WaitForStabilization
was added as a penultimate deploy step, which served two purposes:However, the
WaitForStabilization
step was placed beforeResumeAlarmNotifications
- so the original number of EC2 instances in the ASG must be capable of supporting the latest level of traffic, otherwiseWaitForStabilization
and the entire deploy will fail, leaving auto-scaling indefinitely disabled - potentially waiting hours for human intervention to fix the problem.The old-instance cull must complete
By putting
ResumeAlarmNotifications
beforeWaitForStabilization
, the Ophan outage would have been shortened from 1 hour to ~2 minutes, but as @jacobwinch points out, making only that change could lead to a race condition:ResumeAlarmNotifications
executes, a 'low CPU' alarm could immediately decide to reduce the desired size of the ASG, randomly removing new instancesWaitForStabilization
could then see that the ASG is at its desired size, even if it happens that the termination of the old servers has not yet completedConsequently, with this change we are also introducing a new task,
WaitForCullToComplete
, that checks specifically that EC2 instances tagged for termination have been removed from the ASG - more specific than the old assumption withWaitForStabilization
that the ASG being at desired capacity definitely means all the old instances are gone.The final sequence of tasks in an
autoscaling
deploy now looks like this:riff-raff/magenta-lib/src/main/scala/magenta/deployment_type/AutoScaling.scala
Lines 199 to 215 in 0fb840e
Common logic for finding EC2 instances tagged for termination, used by both
CullInstancesWithTerminationTag
&WaitForCullToComplete
, has been factored out into the newCullSummary
class.Testing
I've deployed this branch to Riff Raff CODE as build 3385, and then used that to deploy the Ophan Dashboard to CODE, successfully! You can see that the new
WaitForCullToComplete
step took place afterResumeAlarmNotifications
, and theWaitForStabilization
step completed immediately after that:cc @guardian/ophan