You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the blockers for allowing larger computes (ref neondatabase/cloud#9103) is improving the scaling algorithm.
Currently, because the scaling algorithm (a) recalculates the "goal" CU every 5s, via updated metrics, and (b) does not factor into account past metrics when calculating the "goal" CU:
It's easy to cause the goal CU to oscillate, resulting in a lot of effort spent scaling, with little net benefit
As computes get larger, the same percentage change in metrics is more likely to produce a change in (integer) goal CU — meaning that each 5s the metrics update is more likely to prompt scaling, and by a larger amount
In a perfect world, maybe this'd be fine. But in practice, the process of scaling actually consumes resources, and so is generally something we want to avoid doing frivolously.
Scaling algorithm should be more stable over some time period, under some conditions.
This isn't a super well-defined goal — so this issue mostly just exists to track some improvement.
Implementation ideas
There's a couple directions we could take this.
One is to still not include any scaling history, and instead limit the size of a change (e.g., by no more than 1 CU at a time) and introduce rate-limiting on scaling. This wouldn't necessarily stop oscillation, but may reduce the impact.
The other is to include some history around recent metrics so that we have a longer time period to use for decision-making. This solution would probably be harder, but likely easier to understand and easier to produce better outcomes.
One possibly annoying piece of this is that we may need to change a substantial portion of the tests for pkg/agent/core. We probably want a way to override the "goal CU" and directly provide that.
Tasks
The content you are editing has changed. Please copy your edits and refresh the page.
This is kind of a second take on #737, and a pre-req to #729 so that we
can freely change how metrics are interpreted without needing to rewrite
our unit tests in 'pkg/agent/core/state_test.go'.
In short: the approach should reduce volatility by ~60% from what we have today, but it's only a fractional decrease — probably insufficient for very volatile workloads on much larger computes.
Problem description / Motivation
One of the blockers for allowing larger computes (ref neondatabase/cloud#9103) is improving the scaling algorithm.
Currently, because the scaling algorithm (a) recalculates the "goal" CU every 5s, via updated metrics, and (b) does not factor into account past metrics when calculating the "goal" CU:
In a perfect world, maybe this'd be fine. But in practice, the process of scaling actually consumes resources, and so is generally something we want to avoid doing frivolously.
See also: https://neondb.slack.com/archives/C03ETHV2KD1/p1704319422570509?thread_ts=1704316837.680979
Feature idea(s) / DoD
Scaling algorithm should be more stable over some time period, under some conditions.
This isn't a super well-defined goal — so this issue mostly just exists to track some improvement.
Implementation ideas
There's a couple directions we could take this.
One is to still not include any scaling history, and instead limit the size of a change (e.g., by no more than 1 CU at a time) and introduce rate-limiting on scaling. This wouldn't necessarily stop oscillation, but may reduce the impact.
The other is to include some history around recent metrics so that we have a longer time period to use for decision-making. This solution would probably be harder, but likely easier to understand and easier to produce better outcomes.
One possibly annoying piece of this is that we may need to change a substantial portion of the tests for
pkg/agent/core
. We probably want a way to override the "goal CU" and directly provide that.Tasks
Pre-requisites
Implementation
Follow-ups
The text was updated successfully, but these errors were encountered: