Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

autoscaling deploy: re-enable ASG scaling before final stabilisation check #1345

Merged
merged 2 commits into from
Jun 5, 2024

Commits on Jun 4, 2024

  1. Re-enable ASG scaling before final stabilisation

    This aims to address, to some extent, issue #1342 -
    the problem that *apps can not auto-scale* until an autoscaling deploy has
    successfully completed. On 22nd May 2024, this inability to auto-scale led
    to a severe outage in the Ophan Tracker.
    
    Ever since #83 in
    April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy
    (`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy,
    (`ResumeAlarmNotifications`) once deployment has successfully completed.
    
    In December 2016, with #403, an
    additional `WaitForStabilization` was added as a penultimate deploy step,
    with the aim of ensuring that the cull of old instances has _completed_
    before the deploy ends. However, the `WaitForStabilization` step was added _before_
    `ResumeAlarmNotifications`, rather than _after_, and if the ASG instances are
    already overloaded and recycling, the ASG will _never_ stabilise, because it _needs
    to scale up_ to handle the load it's experiencing.
    
    In this change, we introduce a new task, `WaitForCullToComplete`, that can establish
    whether the cull has completed or not, regardless of whether the ASG is scaling -
    it simply checks that there are no remaining instances tagged for termination.
    Consequently, once we've executed `CullInstancesWithTerminationTag` to _request_ old
    instances terminate, we can immediately allow scaling with `ResumeAlarmNotifications`,
    and then `WaitForCullToComplete` _afterwards_.
    
    With this change in place, the Ophan outage would have been shortened from
    1 hour to ~2 minutes, a much better outcome!
    
    Common code between `CullInstancesWithTerminationTag` and `WaitForCullToComplete` has
    been factored out into a new `CullSummary` class.
    rtyley committed Jun 4, 2024
    Configuration menu
    Copy the full SHA
    0f75e11 View commit details
    Browse the repository at this point in the history
  2. Give WaitForCullToComplete up-to-date ASG info!

    Jacob Winch pointed out that `WaitForCullToComplete` is a repeating check, and
    so needs to get up-to-date `AutoScalingGroupInfo` in order for it to know what
    instances currently exist and what their state is! I was missing a call to
    `ASG.refresh(asg)`.
    
    This is easy to miss in tasks like `WaitForCullToComplete` that extend
    `magenta.tasks.PollingCheck`, the
    `magenta.tasks.ASGTask.execute(asg: AutoScalingGroup, ...)` method
    puts a `AutoScalingGroup` into scope by the method parameter, and it's inevitably
    out-out-date... could be something to refactor in a later PR!
    
    We also decided that as polling checks inevitably involve network calls, it makes
    sense to put an exception-catching guard in the `magenta.tasks.PollingCheck.check()`
    method.
    rtyley committed Jun 4, 2024
    Configuration menu
    Copy the full SHA
    0fb840e View commit details
    Browse the repository at this point in the history