-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: distribute COCKROACH_SCHEDULER_CONCURRENCY
across stores
#102859
kvserver: distribute COCKROACH_SCHEDULER_CONCURRENCY
across stores
#102859
Conversation
COCKROACH_SCHEDULER_CONCURRENCY
across stores
I feel like this is okay to backport, since I think this was always the intended behavior, and we've seen the existing behavior cause problems on nodes with many stores. But we can consider gating this behind an environment variable for the backports. |
67a1b1f
to
a9ed7be
Compare
pkg/server/server.go
Outdated
if storeTestingKnobs := cfg.TestingKnobs.Store; storeTestingKnobs != nil { | ||
storeCfg.TestingKnobs = *storeTestingKnobs.(*kvserver.StoreTestingKnobs) | ||
} | ||
storeCfg.SetDefaults() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could imagine this would break a few tests. I guess it hasn't, since CI looks green. Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, most tests use testStoreConfig()
which already calls SetDefaults()
.
Any thoughts on gating this behind an environment variable for backports, or shipping it as-is? Alternatively, scaling up the worker count somewhat with multiple stores? |
Because the outage here seemed to be related to quadratic behavior of Raft probes due to reproposals (#102956) and the inefficient application of these probes (#102953), those backports seem like a safer and more effective fix. Backporting this PR to 23.1 seems fine, because we already shard the Raft scheduler in 23.1 thus putting a limit on the number of workers available to a range (16 per shard), but backporting to 22.2 is riskier because we'd be significantly reducing the available workers per range. I'm therefore only backporting to 23.1 for now. If some users need this in 22.2, they can accomplish the same effect by setting bors r+ |
Oops, forgot to address Pavel's comment. bors r- |
Canceled. |
Previously, it was possible to create a Raft scheduler with no workers by setting `COCKROACH_SCHEDULER_CONCURRENCY=0`. This patch ensures we always create at least 1 worker per scheduler. Epic: none Release note: None
Epic: none Release note: None
Epic: none Release note: None
`COCKROACH_SCHEDULER_CONCURRENCY` defaults to 8 per CPU core, capped at 96, to avoid putting excessive pressure on the Go scheduler. However, it was applied individually to each store, which means that nodes with e.g. 10 stores would run up to 960 workers. This can lead to scheduler thrashing, as well as excessive memory usage since it also serves to bound the amount of data pulled into memory for concurrent Raft ready processing. This patch instead divides the worker count across stores. Epic: none Release note (ops change): the default Raft scheduler concurrency, controlled by `COCKROACH_SCHEDULER_CONCURRENCY` and defaulting to 8 per CPU core capped at 96, is now divided evenly across stores instead of applying individually per store. This avoids excessive Go scheduler pressure and memory usage on nodes with many stores. The common case of 1 store per node is not affected.
a9ed7be
to
a448edc
Compare
bors r+ |
Build succeeded: |
kvserver: ensure at least 1 Raft scheduler worker
Previously, it was possible to create a Raft scheduler with no workers by setting
COCKROACH_SCHEDULER_CONCURRENCY=0
. This patch ensures we always create at least 1 worker per scheduler.Release note: None
kvserver: pass Raft scheduler workers via
StoreConfig
Release note: None
kvserver: move
StoreConfig.SetDefaults()
out ofNewStore()
Release note: None
kvserver: distribute
COCKROACH_SCHEDULER_CONCURRENCY
across storesCOCKROACH_SCHEDULER_CONCURRENCY
defaults to 8 per CPU core, capped at 96, to avoid putting excessive pressure on the Go scheduler. However, it was applied individually to each store, which means that nodes with e.g. 10 stores would run up to 960 workers. This can lead to scheduler thrashing, as well as excessive memory usage since it also serves to bound the amount of data pulled into memory for concurrent Raft ready processing.This patch instead divides the worker count across stores.
Resolves #102838.
Epic: none
Release note (ops change): the default Raft scheduler concurrency, controlled by
COCKROACH_SCHEDULER_CONCURRENCY
and defaulting to 8 per CPU core capped at 96, is now divided evenly across stores instead of applying individually per store. This avoids excessive Go scheduler pressure and memory usage on nodes with many stores. The common case of 1 store per node is not affected.