kvserver: support multiple priority ranges in Raft scheduler #101023

erikgrinaker · 2023-04-09T17:47:51Z

kvserver: add priority range to BenchmarkSchedulerEnqueueRaftTicks

This patch adds priority=false|true variants to measure enqueue performance when there is 0 or 1 priority ranges respectively (the common cases given that only the liveness range currently has priority).

Epic: none
Release note: None

kvserver: support multiple priority ranges in Raft scheduler

Previously, the Raft scheduler only supported a single priority range (the liveness range). However, other critical system ranges should also get priority, in particular when we move responsibility for expiration-based lease extensions to the Raft scheduler.

This patch implements prioritization of multiple ranges. It does so by creating a separate Raft scheduler shard at index 0 with 2 worker goroutines reserved for priority ranges, configurable via
COCKROACH_SCHEDULER_PRIORITY_SHARD_SIZE. This has the added benefit of avoiding scheduler starvation by non-priority ranges (although unlikely in practice), and of amortizing the priority check costs at enqueue time rather than incurring it for every dequeue. These goroutines do have a minor cost, especially considering many nodes may not have any priority ranges at all, but the added complexity and risk of dynamically sizing this worker pool down to 0 does not appear justified to avoid 2 idle goroutines.

Benchmarks show some additional enqueue overhead, but this approaches 0% when there are no priority ranges (priority=false) and 12% when there is a single priority range (priority=true). In absolute terms the overhead is negligible, considering we only enqueue ticks every 500 ms.

name                                                                               old time/op  new time/op  delta
SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=1/workers=64-24        158ns ± 3%   201ns ± 2%  +26.58%  (p=0.000 n=10+9)
SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=10/workers=64-24       696ns ± 3%   777ns ± 2%  +11.55%  (p=0.000 n=10+10)
SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=100/workers=64-24     5.69µs ± 2%  6.12µs ± 1%   +7.59%  (p=0.000 n=10+10)
SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=1000/workers=64-24    59.9µs ± 1%  62.4µs ± 1%   +4.19%  (p=0.000 n=10+10)
SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=10000/workers=64-24    622µs ± 1%   650µs ± 1%   +4.49%  (p=0.000 n=9+9)
SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=100000/workers=64-24  8.30ms ± 6%  8.15ms ± 1%     ~     (p=0.211 n=10+9)
SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=1/workers=64-24         158ns ± 1%   259ns ± 2%  +64.58%  (p=0.000 n=9+10)
SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=10/workers=64-24        696ns ± 2%   985ns ± 4%  +41.54%  (p=0.000 n=9+10)
SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=100/workers=64-24      5.68µs ± 2%  6.96µs ± 2%  +22.65%  (p=0.000 n=10+10)
SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=1000/workers=64-24     59.8µs ± 1%  70.1µs ± 1%  +17.34%  (p=0.000 n=10+10)
SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=10000/workers=64-24     625µs ± 1%   725µs ± 1%  +15.96%  (p=0.000 n=10+9)
SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=100000/workers=64-24   8.23ms ± 5%  9.27ms ± 3%  +12.56%  (p=0.000 n=10+10)

Epic: none
Release note: None

kvserver: prioritize meta ranges in Raft scheduler

Previously, only the node liveness range was prioritized in the Raft scheduler. This patch also prioritizes scheduling for the meta ranges.

Resolves #101013.

Epic: none
Release note: None

cockroach-teamcity · 2023-04-09T17:48:02Z

This change is

pkg/kv/kvserver/scheduler.go

pav-kv · 2023-04-11T09:54:54Z

pkg/kv/kvserver/scheduler.go

+	// Priority shard at index 0.
+	s.shards = append(s.shards, newRaftSchedulerShard(priorityWorkers, maxTicks))
+
+	// Regular shards.
 	numShards := 1
 	if shardSize > 0 && numWorkers > shardSize {
 		numShards = (numWorkers-1)/shardSize + 1


Does numShards include or not include the priority shard?

With this arithmetics, it looks like numShards is at least 2 (numWorkers > shardSize => numWorkers-1 >= shardSize => (numWorkers-1)/shardSize + 1 >= 2).

So, with the appends loop below, we'll end up with at least 3 shards. But it looked like we want the min number of shards to be 2 (priority + non-priority)?

It's excluding the priority shard. If numWorkers <= shardSize then we keep numShards = 1 from above, but it shouldn't make any difference if we fall through either (except for 0 cases) -- the form (x-1)/y+1 is a trick to do ceiling division, i.e. 5/2==3 rather than 5/2==2. Either way, we'll end up with numShards being 1 at minimum. Added comments for this.

pkg/kv/kvserver/scheduler.go

andrewbaptist · 2023-04-28T20:02:02Z

I don't remember the full rationale for this, but I'd like to know if we still need this. If we are using all expiration-based leases, then the liveness range is less important as the expiration time won't be used at all. If we use epoch-based leases, then the liveness range is still the most important to keep alive. So maybe the right direction is to leave this for liveness only in 23.2 and if we make more progress on removing epoch leases to drop this prioritization completely.

Was there a specific problem that this is solving?

erikgrinaker · 2023-04-29T10:44:59Z

Previously, expiration leases for the meta and liveness ranges were eagerly extended in the store lease renewer. Now, expiration leases are extended by the Raft scheduler, and no longer fall back to extensions during request processing in the latter half of the lease interval. There is some concern that this additional dependency on the Raft scheduler could contribute to meta/liveness unavailability under overload -- especially with the removal of quiescence which puts significantly more pressure on the scheduler.

You're right that this likely becomes less important with expiration leases, but the liveness and meta ranges still serve a fundamental role in the system, and starvation could worsen unavailability or instability even with expiration leases. Given that, I think it makes sense to prioritize these during scheduling, although I suppose the same argument could be made for other SQL-level system ranges.

This prioritization was originally introduced in #56860 following several escalation where liveness range scheduler starvation caused prolonged cluster outages.

andrewbaptist

This mostly all looks good, although I'm not familiar enough with the internals of the Raft loop to really stamp this as good. Maybe @pavelkalinnikov can do that. Finally, I don't love the way we are overloading the shard 0 as the "high priority" shard, but its probably OK also.

Reviewed 9 of 9 files at r4, 7 of 7 files at r5, 5 of 5 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @pavelkalinnikov)

pkg/kv/kvserver/replica_init.go line 400 at r5 (raw file):

	r.mu.state.Desc = desc

	// Prioritize the NodeLiveness Range in the Raft scheduler above all other

nit: Change this to say "at high priority"

pkg/kv/kvserver/scheduler.go line 35 at r5 (raw file):

const priorityWorkers = 2

var priorityIDsValue = unsafe.Pointer(new(bool))

nit: I don't understand this pattern of using unsafe.Pointer, at a minimum, add a comment to describe it better.

pkg/kv/kvserver/scheduler.go line 176 at r5 (raw file):

	b := raftSchedulerBatchPool.Get().(*raftSchedulerBatch)
	if len(b.rangeIDs) != numShards {
		b.rangeIDs = make([][]roachpb.RangeID, numShards)

Don't we lose the previous b.rangeIDs here by recreating this? It seems like you should have copied the existing ones over. The old b.priorityIDs would not be lost though. This seems wrong (but I could be missing something). Have you tested calling this with a different numShards?

pkg/kv/kvserver/scheduler.go line 183 at r5 (raw file):

	// Cache the priority range IDs in an owned map, since we expect this to be
	// very small or empty and we do a lookup for every Add() call.
	priorityIDs.Range(func(id int64, _ unsafe.Pointer) bool {

nit: I didn't know about int_map.Range. Should we create a PR to convert this to use generics? I'm not sure if there is a perf benefit. Also we should consider adding int_map.KeyRange since it would avoid an unnecessary load of the value.

erikgrinaker

Sure, I'll get a review from @pavelkalinnikov.

I don't love the way we are overloading the shard 0 as the "high priority" shard, but its probably OK also.

What would you suggest? Maybe a separate priorityShard field or some such? I think it'll become a bit unwieldy, because we have to special-case or duplicate code every time we loop over shards, and the shard addressing becomes two-dimensional. Any better ideas?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @nvanbenschoten, and @pavelkalinnikov)

pkg/kv/kvserver/replica_init.go line 400 at r5 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

nit: Change this to say "at high priority"

Done, is something like this what you had in mind?

pkg/kv/kvserver/scheduler.go line 35 at r5 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

nit: I don't understand this pattern of using unsafe.Pointer, at a minimum, add a comment to describe it better.

IntMap takes unsafe.Pointer values. We don't care about the map value, so we can reuse the same value for every key, avoiding additional allocations. They don't really matter here though, tbh. Added a comment.

pkg/kv/kvserver/scheduler.go line 176 at r5 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

Don't we lose the previous b.rangeIDs here by recreating this? It seems like you should have copied the existing ones over. The old b.priorityIDs would not be lost though. This seems wrong (but I could be missing something). Have you tested calling this with a different numShards?

We do. However, numShards only ever changes in tests, where it doesn't matter that we're constructing a new slice instead of reusing the existing one -- in practice, this path was only taken when we got a newly created batch with rangeIDs = nil.

But I added some code to reuse the slice where possible, to make it obviously correct.

pkg/kv/kvserver/scheduler.go line 183 at r5 (raw file):

Should we create a PR to convert this to use generics?

Yeah, it could probably be done if we handle all of the atomic pointer fiddling internally. It might hamper performance though, because generics prevent certain compiler optimizations.

In any case, that seems like a distraction right now -- we have more important things to do, but I wrote up #103147.

we should consider adding int_map.KeyRange since it would avoid an unnecessary load of the value

We'd really want an IntSet or some such, we don't care about the value here at all. I can write up a quick prototype and see if it matters at all performance-wise, but I suspect latency will be dominated by the actual enqueueing rather than this tiny little loop here. It might matter slightly for the IntMap lookup when we enqueue individual ranges though.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @nvanbenschoten, and @pavelkalinnikov)

pkg/kv/kvserver/scheduler.go line 183 at r5 (raw file):

I can write up a quick prototype and see if it matters at all

Gave this a quick shot, but we still need to atomically load the value to look for a deletion marker in order to handle concurrent deletions. Fetching an IntMap value is just doing an atomic pointer load, which is exactly equivalent to looking for a set deletion marker, so there's nothing to save by getting rid of the value (we'd be loading an atomic bool rather than a pointer, but I benchmarked them and the performance is the same).

Might be a better way to design a lock-free set though, but I'm not going to do that now (could probably nerd-snipe @pavelkalinnikov into it).

This patch adds `priority=false|true` variants to measure enqueue performance when there is 0 or 1 priority ranges respectively (the common cases given that only the liveness range currently has priority). Epic: none Release note: None

erikgrinaker · 2023-05-14T22:37:23Z

Added COCKROACH_SCHEDULER_PRIORITY_SHARD_SIZE to configure the number of priority workers per store (minimum 1), just in case we'll need to bump it.

pav-kv

LGTM. Looks like some tests need fixing.

Previously, the Raft scheduler only supported a single priority range (the liveness range). However, other critical system ranges should also get priority, in particular when we move responsibility for expiration-based lease extensions to the Raft scheduler. This patch implements prioritization of multiple ranges. It does so by creating a separate Raft scheduler shard at index 0 with 2 worker goroutines reserved for priority ranges, configurable via `COCKROACH_SCHEDULER_PRIORITY_SHARD_SIZE`. This has the added benefit of avoiding scheduler starvation by non-priority ranges (although unlikely in practice), and of amortizing the priority check costs at enqueue time rather than incurring it for every dequeue. These goroutines do have a minor cost, especially considering many nodes may not have any priority ranges at all, but the added complexity and risk of dynamically sizing this worker pool down to 0 does not appear justified to avoid 2 idle goroutines. Benchmarks show some additional enqueue overhead, but this approaches 0% when there are no priority ranges (`priority=false`) and 12% when there is a single priority range (`priority=true`). In absolute terms the overhead is negligible, considering we only enqueue ticks every 500 ms. ``` name old time/op new time/op delta SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=1/workers=64-24 158ns ± 3% 201ns ± 2% +26.58% (p=0.000 n=10+9) SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=10/workers=64-24 696ns ± 3% 777ns ± 2% +11.55% (p=0.000 n=10+10) SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=100/workers=64-24 5.69µs ± 2% 6.12µs ± 1% +7.59% (p=0.000 n=10+10) SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=1000/workers=64-24 59.9µs ± 1% 62.4µs ± 1% +4.19% (p=0.000 n=10+10) SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=10000/workers=64-24 622µs ± 1% 650µs ± 1% +4.49% (p=0.000 n=9+9) SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=100000/workers=64-24 8.30ms ± 6% 8.15ms ± 1% ~ (p=0.211 n=10+9) SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=1/workers=64-24 158ns ± 1% 259ns ± 2% +64.58% (p=0.000 n=9+10) SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=10/workers=64-24 696ns ± 2% 985ns ± 4% +41.54% (p=0.000 n=9+10) SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=100/workers=64-24 5.68µs ± 2% 6.96µs ± 2% +22.65% (p=0.000 n=10+10) SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=1000/workers=64-24 59.8µs ± 1% 70.1µs ± 1% +17.34% (p=0.000 n=10+10) SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=10000/workers=64-24 625µs ± 1% 725µs ± 1% +15.96% (p=0.000 n=10+9) SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=100000/workers=64-24 8.23ms ± 5% 9.27ms ± 3% +12.56% (p=0.000 n=10+10) ``` Epic: none Release note: None

Previously, only the node liveness range was prioritized in the Raft scheduler. This patch also prioritizes scheduling for the meta ranges. Epic: none Release note: None

erikgrinaker · 2023-05-15T13:29:32Z

TFTR! I'll give @andrewbaptist a chance to respond here before borsing.

I also had a look at Telemetry, which is a moderately large cluster (8 nodes with 100k ranges and 35k replicas per node), and it had 4 meta ranges in addition to liveness. With RF=5 on system ranges, each node thus had 3 replicas to process in the priority shard. Since clusters will generally not have more than 50k replicas per node or so (15 TB at 300 MB average range size), I think we can assume that typically the priority shard will have no more than 5 low-traffic replicas, which should be manageable for 2 workers.

erikgrinaker · 2023-05-17T10:13:58Z

bors r+

craig · 2023-05-17T10:14:01Z

👎 Rejected by code reviews

Let's go

erikgrinaker · 2023-05-17T10:14:51Z

bors r+

craig · 2023-05-17T10:49:39Z

Build succeeded:

Bazel Essential CI (Cockroach)

erikgrinaker requested review from nvanbenschoten and andrewbaptist April 9, 2023 17:47

erikgrinaker self-assigned this Apr 9, 2023

erikgrinaker requested a review from a team as a code owner April 9, 2023 17:47

erikgrinaker requested a review from a team April 9, 2023 17:47

erikgrinaker changed the title ~~Raft scheduler priority~~ kvserver: support multiple priority ranges in Raft scheduler Apr 9, 2023

erikgrinaker force-pushed the raft-scheduler-priority branch 2 times, most recently from 2aea6b7 to 37b809d Compare April 9, 2023 17:57

pav-kv reviewed Apr 11, 2023

View reviewed changes

erikgrinaker force-pushed the raft-scheduler-priority branch 2 times, most recently from d25e86e to bb04272 Compare April 11, 2023 11:04

erikgrinaker mentioned this pull request Apr 11, 2023

Queue #100924

Draft

This was referenced Apr 18, 2023

kvserver: merges don't respect static splits (like liveness range) #96365

Closed

kvserver: remove special case handling of Liveness range #101741

Open

andrewbaptist previously requested changes May 11, 2023

View reviewed changes

erikgrinaker commented May 11, 2023

View reviewed changes

erikgrinaker force-pushed the raft-scheduler-priority branch from bb04272 to e3fb687 Compare May 11, 2023 22:30

erikgrinaker requested a review from pav-kv May 11, 2023 22:30

erikgrinaker commented May 12, 2023

View reviewed changes

erikgrinaker force-pushed the raft-scheduler-priority branch from e3fb687 to 973d3f8 Compare May 14, 2023 22:35

pav-kv approved these changes May 15, 2023

View reviewed changes

erikgrinaker added 2 commits May 15, 2023 13:26

kvserver: prioritize meta ranges in Raft scheduler

a69d0ea

Previously, only the node liveness range was prioritized in the Raft scheduler. This patch also prioritizes scheduling for the meta ranges. Epic: none Release note: None

erikgrinaker force-pushed the raft-scheduler-priority branch from 973d3f8 to a69d0ea Compare May 15, 2023 13:26

craig bot merged commit 1e86a91 into cockroachdb:master May 17, 2023

erikgrinaker deleted the raft-scheduler-priority branch May 30, 2023 12:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: support multiple priority ranges in Raft scheduler #101023

kvserver: support multiple priority ranges in Raft scheduler #101023

erikgrinaker commented Apr 9, 2023 •

edited

Loading

cockroach-teamcity commented Apr 9, 2023

pav-kv Apr 11, 2023

erikgrinaker Apr 11, 2023 •

edited

Loading

andrewbaptist commented Apr 28, 2023

erikgrinaker commented Apr 29, 2023 •

edited

Loading

andrewbaptist left a comment

erikgrinaker left a comment

erikgrinaker left a comment

erikgrinaker commented May 14, 2023

pav-kv left a comment

erikgrinaker commented May 15, 2023 •

edited

Loading

erikgrinaker commented May 17, 2023

craig bot commented May 17, 2023

erikgrinaker commented May 17, 2023

craig bot commented May 17, 2023

kvserver: support multiple priority ranges in Raft scheduler #101023

kvserver: support multiple priority ranges in Raft scheduler #101023

Conversation

erikgrinaker commented Apr 9, 2023 • edited Loading

cockroach-teamcity commented Apr 9, 2023

pav-kv Apr 11, 2023

Choose a reason for hiding this comment

erikgrinaker Apr 11, 2023 • edited Loading

Choose a reason for hiding this comment

andrewbaptist commented Apr 28, 2023

erikgrinaker commented Apr 29, 2023 • edited Loading

andrewbaptist left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker commented May 14, 2023

pav-kv left a comment

Choose a reason for hiding this comment

erikgrinaker commented May 15, 2023 • edited Loading

erikgrinaker commented May 17, 2023

craig bot commented May 17, 2023

erikgrinaker commented May 17, 2023

craig bot commented May 17, 2023

erikgrinaker commented Apr 9, 2023 •

edited

Loading

erikgrinaker Apr 11, 2023 •

edited

Loading

erikgrinaker commented Apr 29, 2023 •

edited

Loading

erikgrinaker commented May 15, 2023 •

edited

Loading