Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: autoscaler-agent scaling algorithm is too volatile for larger computes #729

Open
3 tasks
sharnoff opened this issue Jan 8, 2024 · 1 comment
Open
3 tasks
Labels
c/autoscaling/autoscaler-agent Component: autoscaling: autoscaler-agent t/bug Issue Type: Bug

Comments

@sharnoff
Copy link
Member

sharnoff commented Jan 8, 2024

Problem description / Motivation

One of the blockers for allowing larger computes (ref neondatabase/cloud#9103) is improving the scaling algorithm.

Currently, because the scaling algorithm (a) recalculates the "goal" CU every 5s, via updated metrics, and (b) does not factor into account past metrics when calculating the "goal" CU:

  1. It's easy to cause the goal CU to oscillate, resulting in a lot of effort spent scaling, with little net benefit
  2. As computes get larger, the same percentage change in metrics is more likely to produce a change in (integer) goal CU — meaning that each 5s the metrics update is more likely to prompt scaling, and by a larger amount

In a perfect world, maybe this'd be fine. But in practice, the process of scaling actually consumes resources, and so is generally something we want to avoid doing frivolously.

See also: https://neondb.slack.com/archives/C03ETHV2KD1/p1704319422570509?thread_ts=1704316837.680979

Feature idea(s) / DoD

Scaling algorithm should be more stable over some time period, under some conditions.

This isn't a super well-defined goal — so this issue mostly just exists to track some improvement.

Implementation ideas

There's a couple directions we could take this.

One is to still not include any scaling history, and instead limit the size of a change (e.g., by no more than 1 CU at a time) and introduce rate-limiting on scaling. This wouldn't necessarily stop oscillation, but may reduce the impact.

The other is to include some history around recent metrics so that we have a longer time period to use for decision-making. This solution would probably be harder, but likely easier to understand and easier to produce better outcomes.

One possibly annoying piece of this is that we may need to change a substantial portion of the tests for pkg/agent/core. We probably want a way to override the "goal CU" and directly provide that.

Tasks

Pre-requisites

  1. Omrigan

Implementation

Follow-ups

@sharnoff sharnoff added t/bug Issue Type: Bug c/autoscaling/autoscaler-agent Component: autoscaling: autoscaler-agent labels Jan 8, 2024
Omrigan added a commit to Omrigan/autoscaling that referenced this issue Oct 10, 2024
sharnoff added a commit that referenced this issue Nov 4, 2024
This is kind of a second take on #737, and a pre-req to #729 so that we
can freely change how metrics are interpreted without needing to rewrite
our unit tests in 'pkg/agent/core/state_test.go'.
@sharnoff
Copy link
Member Author

sharnoff commented Nov 8, 2024

There's an open RFC that will partially address this issue here: https://www.notion.so/neondatabase/131f189e004780b2915ef2fdb95bae6a

In short: the approach should reduce volatility by ~60% from what we have today, but it's only a fractional decrease — probably insufficient for very volatile workloads on much larger computes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/autoscaling/autoscaler-agent Component: autoscaling: autoscaler-agent t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

1 participant