You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The autoscaler-agent calculates the "goal CU" based on demand for CPU, using the guest kernel's 1-minute load average metric.
Load average represents an exponentially weighted moving average, updated every 5 seconds, based on the instantaneous number of running or runnable tasks at that moment in time — i.e., it's an average of the queue size.
For workloads that are spiky in their parallelism, this can result in dramatic over-estimations if we interpret it as "demand" for CPU time. If there's 4x as many tasks as CPUs, each task may contribute 4x as much as they should to our measure of "demand" (because fair scheduling would result in all tasks being in the queue for 4x as long).
In practice we believe this issue is quite rare (hence: why this is marked as "tech debt"), but it's still worth addressing.
It'd still be useful to get a measure of how much demand for CPU there is, but load average clearly doesn't give us that (and unfortunately CPU time won't, either).
The text was updated successfully, but these errors were encountered:
Background
The autoscaler-agent calculates the "goal CU" based on demand for CPU, using the guest kernel's 1-minute load average metric.
Load average represents an exponentially weighted moving average, updated every 5 seconds, based on the instantaneous number of running or runnable tasks at that moment in time — i.e., it's an average of the queue size.
For workloads that are spiky in their parallelism, this can result in dramatic over-estimations if we interpret it as "demand" for CPU time. If there's 4x as many tasks as CPUs, each task may contribute 4x as much as they should to our measure of "demand" (because fair scheduling would result in all tasks being in the queue for 4x as long).
In practice we believe this issue is quite rare (hence: why this is marked as "tech debt"), but it's still worth addressing.
For more on load average, refer to:
Example of a user hitting this: https://neondb.slack.com/archives/C03TN5G758R/p1728409813336859
Implementation ideas
Don't use load average...?
It'd still be useful to get a measure of how much demand for CPU there is, but load average clearly doesn't give us that (and unfortunately CPU time won't, either).
The text was updated successfully, but these errors were encountered: