Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load average can over-estimate demand for CPU #1114

Open
sharnoff opened this issue Oct 16, 2024 · 1 comment
Open

Load average can over-estimate demand for CPU #1114

sharnoff opened this issue Oct 16, 2024 · 1 comment
Assignees
Labels
a/tech_debt Area: related to tech debt

Comments

@sharnoff
Copy link
Member

sharnoff commented Oct 16, 2024

Background

The autoscaler-agent calculates the "goal CU" based on demand for CPU, using the guest kernel's 1-minute load average metric.

Load average represents an exponentially weighted moving average, updated every 5 seconds, based on the instantaneous number of running or runnable tasks at that moment in time — i.e., it's an average of the queue size.

For workloads that are spiky in their parallelism, this can result in dramatic over-estimations if we interpret it as "demand" for CPU time. If there's 4x as many tasks as CPUs, each task may contribute 4x as much as they should to our measure of "demand" (because fair scheduling would result in all tasks being in the queue for 4x as long).

In practice we believe this issue is quite rare (hence: why this is marked as "tech debt"), but it's still worth addressing.

For more on load average, refer to:

Example of a user hitting this: https://neondb.slack.com/archives/C03TN5G758R/p1728409813336859

Implementation ideas

Don't use load average...?

It'd still be useful to get a measure of how much demand for CPU there is, but load average clearly doesn't give us that (and unfortunately CPU time won't, either).

@sharnoff sharnoff added the a/tech_debt Area: related to tech debt label Oct 16, 2024
@sharnoff
Copy link
Member Author

sharnoff commented Nov 8, 2024

There is an open RFC that will fix this issue here: https://www.notion.so/neondatabase/131f189e004780b2915ef2fdb95bae6a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt
Projects
None yet
Development

No branches or pull requests

1 participant