`autoscaling` deploy: re-enable ASG scaling before final stabilisation check #1345

This aims to address, to some extent, issue #1342 - the problem that *apps can not auto-scale* until an autoscaling deploy has successfully completed. On 22nd May 2024, this inability to auto-scale led to a severe outage in the Ophan Tracker. Ever since #83 in April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy (`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy, (`ResumeAlarmNotifications`) once deployment has successfully completed. In December 2016, with #403, an additional `WaitForStabilization` was added as a penultimate deploy step, with the aim of ensuring that the cull of old instances has _completed_ before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and if the ASG instances are already overloaded and recycling, the ASG will _never_ stabilise, because it _needs to scale up_ to handle the load it's experiencing. In this change, we introduce a new task, `WaitForCullToComplete`, that can establish whether the cull has completed or not, regardless of whether the ASG is scaling - it simply checks that there are no remaining instances tagged for termination. Consequently, once we've executed `CullInstancesWithTerminationTag` to _request_ old instances terminate, we can immediately allow scaling with `ResumeAlarmNotifications`, and then `WaitForCullToComplete` _afterwards_. With this change in place, the Ophan outage would have been shortened from 1 hour to ~2 minutes, a much better outcome! Common code between `CullInstancesWithTerminationTag` and `WaitForCullToComplete` has been factored out into a new `CullSummary` class.

Jacob Winch pointed out that `WaitForCullToComplete` is a repeating check, and so needs to get up-to-date `AutoScalingGroupInfo` in order for it to know what instances currently exist and what their state is! I was missing a call to `ASG.refresh(asg)`. This is easy to miss in tasks like `WaitForCullToComplete` that extend `magenta.tasks.PollingCheck`, the `magenta.tasks.ASGTask.execute(asg: AutoScalingGroup, ...)` method puts a `AutoScalingGroup` into scope by the method parameter, and it's inevitably out-out-date... could be something to refactor in a later PR! We also decided that as polling checks inevitably involve network calls, it makes sense to put an exception-catching guard in the `magenta.tasks.PollingCheck.check()` method.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`autoscaling` deploy: re-enable ASG scaling before final stabilisation check #1345

`autoscaling` deploy: re-enable ASG scaling before final stabilisation check #1345

Commits on Jun 4, 2024

autoscaling deploy: re-enable ASG scaling before final stabilisation check #1345

autoscaling deploy: re-enable ASG scaling before final stabilisation check #1345

Commits on Jun 4, 2024

`autoscaling` deploy: re-enable ASG scaling before final stabilisation check #1345

`autoscaling` deploy: re-enable ASG scaling before final stabilisation check #1345