-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: support multiple priority ranges in Raft scheduler #101023
kvserver: support multiple priority ranges in Raft scheduler #101023
Conversation
2aea6b7
to
37b809d
Compare
pkg/kv/kvserver/scheduler.go
Outdated
// Priority shard at index 0. | ||
s.shards = append(s.shards, newRaftSchedulerShard(priorityWorkers, maxTicks)) | ||
|
||
// Regular shards. | ||
numShards := 1 | ||
if shardSize > 0 && numWorkers > shardSize { | ||
numShards = (numWorkers-1)/shardSize + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does numShards
include or not include the priority shard?
With this arithmetics, it looks like numShards
is at least 2 (numWorkers > shardSize
=> numWorkers-1 >= shardSize
=> (numWorkers-1)/shardSize + 1 >= 2
).
So, with the appends loop below, we'll end up with at least 3 shards. But it looked like we want the min number of shards to be 2 (priority + non-priority)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's excluding the priority shard. If numWorkers <= shardSize
then we keep numShards = 1
from above, but it shouldn't make any difference if we fall through either (except for 0 cases) -- the form (x-1)/y+1
is a trick to do ceiling division, i.e. 5/2==3 rather than 5/2==2. Either way, we'll end up with numShards
being 1 at minimum. Added comments for this.
d25e86e
to
bb04272
Compare
I don't remember the full rationale for this, but I'd like to know if we still need this. If we are using all expiration-based leases, then the liveness range is less important as the expiration time won't be used at all. If we use epoch-based leases, then the liveness range is still the most important to keep alive. So maybe the right direction is to leave this for liveness only in 23.2 and if we make more progress on removing epoch leases to drop this prioritization completely. Was there a specific problem that this is solving? |
Previously, expiration leases for the meta and liveness ranges were eagerly extended in the store lease renewer. Now, expiration leases are extended by the Raft scheduler, and no longer fall back to extensions during request processing in the latter half of the lease interval. There is some concern that this additional dependency on the Raft scheduler could contribute to meta/liveness unavailability under overload -- especially with the removal of quiescence which puts significantly more pressure on the scheduler. You're right that this likely becomes less important with expiration leases, but the liveness and meta ranges still serve a fundamental role in the system, and starvation could worsen unavailability or instability even with expiration leases. Given that, I think it makes sense to prioritize these during scheduling, although I suppose the same argument could be made for other SQL-level system ranges. This prioritization was originally introduced in #56860 following several escalation where liveness range scheduler starvation caused prolonged cluster outages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly all looks good, although I'm not familiar enough with the internals of the Raft loop to really stamp this as good. Maybe @pavelkalinnikov can do that. Finally, I don't love the way we are overloading the shard 0 as the "high priority" shard, but its probably OK also.
Reviewed 9 of 9 files at r4, 7 of 7 files at r5, 5 of 5 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @pavelkalinnikov)
pkg/kv/kvserver/replica_init.go
line 400 at r5 (raw file):
r.mu.state.Desc = desc // Prioritize the NodeLiveness Range in the Raft scheduler above all other
nit: Change this to say "at high priority"
pkg/kv/kvserver/scheduler.go
line 35 at r5 (raw file):
const priorityWorkers = 2 var priorityIDsValue = unsafe.Pointer(new(bool))
nit: I don't understand this pattern of using unsafe.Pointer, at a minimum, add a comment to describe it better.
pkg/kv/kvserver/scheduler.go
line 176 at r5 (raw file):
b := raftSchedulerBatchPool.Get().(*raftSchedulerBatch) if len(b.rangeIDs) != numShards { b.rangeIDs = make([][]roachpb.RangeID, numShards)
Don't we lose the previous b.rangeIDs
here by recreating this? It seems like you should have copied the existing ones over. The old b.priorityIDs
would not be lost though. This seems wrong (but I could be missing something). Have you tested calling this with a different numShards
?
pkg/kv/kvserver/scheduler.go
line 183 at r5 (raw file):
// Cache the priority range IDs in an owned map, since we expect this to be // very small or empty and we do a lookup for every Add() call. priorityIDs.Range(func(id int64, _ unsafe.Pointer) bool {
nit: I didn't know about int_map.Range. Should we create a PR to convert this to use generics? I'm not sure if there is a perf benefit. Also we should consider adding int_map.KeyRange
since it would avoid an unnecessary load of the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll get a review from @pavelkalinnikov.
I don't love the way we are overloading the shard 0 as the "high priority" shard, but its probably OK also.
What would you suggest? Maybe a separate priorityShard
field or some such? I think it'll become a bit unwieldy, because we have to special-case or duplicate code every time we loop over shards, and the shard addressing becomes two-dimensional. Any better ideas?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @nvanbenschoten, and @pavelkalinnikov)
pkg/kv/kvserver/replica_init.go
line 400 at r5 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
nit: Change this to say "at high priority"
Done, is something like this what you had in mind?
pkg/kv/kvserver/scheduler.go
line 35 at r5 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
nit: I don't understand this pattern of using unsafe.Pointer, at a minimum, add a comment to describe it better.
IntMap takes unsafe.Pointer
values. We don't care about the map value, so we can reuse the same value for every key, avoiding additional allocations. They don't really matter here though, tbh. Added a comment.
pkg/kv/kvserver/scheduler.go
line 176 at r5 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
Don't we lose the previous
b.rangeIDs
here by recreating this? It seems like you should have copied the existing ones over. The oldb.priorityIDs
would not be lost though. This seems wrong (but I could be missing something). Have you tested calling this with a differentnumShards
?
We do. However, numShards
only ever changes in tests, where it doesn't matter that we're constructing a new slice instead of reusing the existing one -- in practice, this path was only taken when we got a newly created batch with rangeIDs = nil
.
But I added some code to reuse the slice where possible, to make it obviously correct.
pkg/kv/kvserver/scheduler.go
line 183 at r5 (raw file):
Should we create a PR to convert this to use generics?
Yeah, it could probably be done if we handle all of the atomic pointer fiddling internally. It might hamper performance though, because generics prevent certain compiler optimizations.
In any case, that seems like a distraction right now -- we have more important things to do, but I wrote up #103147.
we should consider adding
int_map.KeyRange
since it would avoid an unnecessary load of the value
We'd really want an IntSet
or some such, we don't care about the value here at all. I can write up a quick prototype and see if it matters at all performance-wise, but I suspect latency will be dominated by the actual enqueueing rather than this tiny little loop here. It might matter slightly for the IntMap lookup when we enqueue individual ranges though.
bb04272
to
e3fb687
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @nvanbenschoten, and @pavelkalinnikov)
pkg/kv/kvserver/scheduler.go
line 183 at r5 (raw file):
I can write up a quick prototype and see if it matters at all
Gave this a quick shot, but we still need to atomically load the value to look for a deletion marker in order to handle concurrent deletions. Fetching an IntMap value is just doing an atomic pointer load, which is exactly equivalent to looking for a set deletion marker, so there's nothing to save by getting rid of the value (we'd be loading an atomic bool rather than a pointer, but I benchmarked them and the performance is the same).
Might be a better way to design a lock-free set though, but I'm not going to do that now (could probably nerd-snipe @pavelkalinnikov into it).
This patch adds `priority=false|true` variants to measure enqueue performance when there is 0 or 1 priority ranges respectively (the common cases given that only the liveness range currently has priority). Epic: none Release note: None
e3fb687
to
973d3f8
Compare
Added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Looks like some tests need fixing.
Previously, the Raft scheduler only supported a single priority range (the liveness range). However, other critical system ranges should also get priority, in particular when we move responsibility for expiration-based lease extensions to the Raft scheduler. This patch implements prioritization of multiple ranges. It does so by creating a separate Raft scheduler shard at index 0 with 2 worker goroutines reserved for priority ranges, configurable via `COCKROACH_SCHEDULER_PRIORITY_SHARD_SIZE`. This has the added benefit of avoiding scheduler starvation by non-priority ranges (although unlikely in practice), and of amortizing the priority check costs at enqueue time rather than incurring it for every dequeue. These goroutines do have a minor cost, especially considering many nodes may not have any priority ranges at all, but the added complexity and risk of dynamically sizing this worker pool down to 0 does not appear justified to avoid 2 idle goroutines. Benchmarks show some additional enqueue overhead, but this approaches 0% when there are no priority ranges (`priority=false`) and 12% when there is a single priority range (`priority=true`). In absolute terms the overhead is negligible, considering we only enqueue ticks every 500 ms. ``` name old time/op new time/op delta SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=1/workers=64-24 158ns ± 3% 201ns ± 2% +26.58% (p=0.000 n=10+9) SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=10/workers=64-24 696ns ± 3% 777ns ± 2% +11.55% (p=0.000 n=10+10) SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=100/workers=64-24 5.69µs ± 2% 6.12µs ± 1% +7.59% (p=0.000 n=10+10) SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=1000/workers=64-24 59.9µs ± 1% 62.4µs ± 1% +4.19% (p=0.000 n=10+10) SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=10000/workers=64-24 622µs ± 1% 650µs ± 1% +4.49% (p=0.000 n=9+9) SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=100000/workers=64-24 8.30ms ± 6% 8.15ms ± 1% ~ (p=0.211 n=10+9) SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=1/workers=64-24 158ns ± 1% 259ns ± 2% +64.58% (p=0.000 n=9+10) SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=10/workers=64-24 696ns ± 2% 985ns ± 4% +41.54% (p=0.000 n=9+10) SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=100/workers=64-24 5.68µs ± 2% 6.96µs ± 2% +22.65% (p=0.000 n=10+10) SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=1000/workers=64-24 59.8µs ± 1% 70.1µs ± 1% +17.34% (p=0.000 n=10+10) SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=10000/workers=64-24 625µs ± 1% 725µs ± 1% +15.96% (p=0.000 n=10+9) SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=100000/workers=64-24 8.23ms ± 5% 9.27ms ± 3% +12.56% (p=0.000 n=10+10) ``` Epic: none Release note: None
Previously, only the node liveness range was prioritized in the Raft scheduler. This patch also prioritizes scheduling for the meta ranges. Epic: none Release note: None
973d3f8
to
a69d0ea
Compare
TFTR! I'll give @andrewbaptist a chance to respond here before borsing. I also had a look at Telemetry, which is a moderately large cluster (8 nodes with 100k ranges and 35k replicas per node), and it had 4 meta ranges in addition to liveness. With RF=5 on system ranges, each node thus had 3 replicas to process in the priority shard. Since clusters will generally not have more than 50k replicas per node or so (15 TB at 300 MB average range size), I think we can assume that typically the priority shard will have no more than 5 low-traffic replicas, which should be manageable for 2 workers. |
bors r+ |
👎 Rejected by code reviews |
bors r+ |
Build succeeded: |
kvserver: add priority range to
BenchmarkSchedulerEnqueueRaftTicks
This patch adds
priority=false|true
variants to measure enqueue performance when there is 0 or 1 priority ranges respectively (the common cases given that only the liveness range currently has priority).Epic: none
Release note: None
kvserver: support multiple priority ranges in Raft scheduler
Previously, the Raft scheduler only supported a single priority range (the liveness range). However, other critical system ranges should also get priority, in particular when we move responsibility for expiration-based lease extensions to the Raft scheduler.
This patch implements prioritization of multiple ranges. It does so by creating a separate Raft scheduler shard at index 0 with 2 worker goroutines reserved for priority ranges, configurable via
COCKROACH_SCHEDULER_PRIORITY_SHARD_SIZE
. This has the added benefit of avoiding scheduler starvation by non-priority ranges (although unlikely in practice), and of amortizing the priority check costs at enqueue time rather than incurring it for every dequeue. These goroutines do have a minor cost, especially considering many nodes may not have any priority ranges at all, but the added complexity and risk of dynamically sizing this worker pool down to 0 does not appear justified to avoid 2 idle goroutines.Benchmarks show some additional enqueue overhead, but this approaches 0% when there are no priority ranges (
priority=false
) and 12% when there is a single priority range (priority=true
). In absolute terms the overhead is negligible, considering we only enqueue ticks every 500 ms.Epic: none
Release note: None
kvserver: prioritize meta ranges in Raft scheduler
Previously, only the node liveness range was prioritized in the Raft scheduler. This patch also prioritizes scheduling for the meta ranges.
Resolves #101013.
Epic: none
Release note: None