Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: support multiple priority ranges in Raft scheduler #101023

Merged
merged 3 commits into from
May 17, 2023

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Apr 9, 2023

kvserver: add priority range to BenchmarkSchedulerEnqueueRaftTicks

This patch adds priority=false|true variants to measure enqueue performance when there is 0 or 1 priority ranges respectively (the common cases given that only the liveness range currently has priority).

Epic: none
Release note: None

kvserver: support multiple priority ranges in Raft scheduler

Previously, the Raft scheduler only supported a single priority range (the liveness range). However, other critical system ranges should also get priority, in particular when we move responsibility for expiration-based lease extensions to the Raft scheduler.

This patch implements prioritization of multiple ranges. It does so by creating a separate Raft scheduler shard at index 0 with 2 worker goroutines reserved for priority ranges, configurable via
COCKROACH_SCHEDULER_PRIORITY_SHARD_SIZE. This has the added benefit of avoiding scheduler starvation by non-priority ranges (although unlikely in practice), and of amortizing the priority check costs at enqueue time rather than incurring it for every dequeue. These goroutines do have a minor cost, especially considering many nodes may not have any priority ranges at all, but the added complexity and risk of dynamically sizing this worker pool down to 0 does not appear justified to avoid 2 idle goroutines.

Benchmarks show some additional enqueue overhead, but this approaches 0% when there are no priority ranges (priority=false) and 12% when there is a single priority range (priority=true). In absolute terms the overhead is negligible, considering we only enqueue ticks every 500 ms.

name                                                                               old time/op  new time/op  delta
SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=1/workers=64-24        158ns ± 3%   201ns ± 2%  +26.58%  (p=0.000 n=10+9)
SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=10/workers=64-24       696ns ± 3%   777ns ± 2%  +11.55%  (p=0.000 n=10+10)
SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=100/workers=64-24     5.69µs ± 2%  6.12µs ± 1%   +7.59%  (p=0.000 n=10+10)
SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=1000/workers=64-24    59.9µs ± 1%  62.4µs ± 1%   +4.19%  (p=0.000 n=10+10)
SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=10000/workers=64-24    622µs ± 1%   650µs ± 1%   +4.49%  (p=0.000 n=9+9)
SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=100000/workers=64-24  8.30ms ± 6%  8.15ms ± 1%     ~     (p=0.211 n=10+9)
SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=1/workers=64-24         158ns ± 1%   259ns ± 2%  +64.58%  (p=0.000 n=9+10)
SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=10/workers=64-24        696ns ± 2%   985ns ± 4%  +41.54%  (p=0.000 n=9+10)
SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=100/workers=64-24      5.68µs ± 2%  6.96µs ± 2%  +22.65%  (p=0.000 n=10+10)
SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=1000/workers=64-24     59.8µs ± 1%  70.1µs ± 1%  +17.34%  (p=0.000 n=10+10)
SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=10000/workers=64-24     625µs ± 1%   725µs ± 1%  +15.96%  (p=0.000 n=10+9)
SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=100000/workers=64-24   8.23ms ± 5%  9.27ms ± 3%  +12.56%  (p=0.000 n=10+10)

Epic: none
Release note: None

kvserver: prioritize meta ranges in Raft scheduler

Previously, only the node liveness range was prioritized in the Raft scheduler. This patch also prioritizes scheduling for the meta ranges.

Resolves #101013.

Epic: none
Release note: None

@erikgrinaker erikgrinaker self-assigned this Apr 9, 2023
@erikgrinaker erikgrinaker requested a review from a team as a code owner April 9, 2023 17:47
@erikgrinaker erikgrinaker requested a review from a team April 9, 2023 17:47
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker erikgrinaker changed the title Raft scheduler priority kvserver: support multiple priority ranges in Raft scheduler Apr 9, 2023
@erikgrinaker erikgrinaker force-pushed the raft-scheduler-priority branch 2 times, most recently from 2aea6b7 to 37b809d Compare April 9, 2023 17:57
pkg/kv/kvserver/scheduler.go Show resolved Hide resolved
pkg/kv/kvserver/scheduler.go Outdated Show resolved Hide resolved
// Priority shard at index 0.
s.shards = append(s.shards, newRaftSchedulerShard(priorityWorkers, maxTicks))

// Regular shards.
numShards := 1
if shardSize > 0 && numWorkers > shardSize {
numShards = (numWorkers-1)/shardSize + 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does numShards include or not include the priority shard?

With this arithmetics, it looks like numShards is at least 2 (numWorkers > shardSize => numWorkers-1 >= shardSize => (numWorkers-1)/shardSize + 1 >= 2).

So, with the appends loop below, we'll end up with at least 3 shards. But it looked like we want the min number of shards to be 2 (priority + non-priority)?

Copy link
Contributor Author

@erikgrinaker erikgrinaker Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's excluding the priority shard. If numWorkers <= shardSize then we keep numShards = 1 from above, but it shouldn't make any difference if we fall through either (except for 0 cases) -- the form (x-1)/y+1 is a trick to do ceiling division, i.e. 5/2==3 rather than 5/2==2. Either way, we'll end up with numShards being 1 at minimum. Added comments for this.

pkg/kv/kvserver/scheduler.go Show resolved Hide resolved
@andrewbaptist
Copy link
Collaborator

I don't remember the full rationale for this, but I'd like to know if we still need this. If we are using all expiration-based leases, then the liveness range is less important as the expiration time won't be used at all. If we use epoch-based leases, then the liveness range is still the most important to keep alive. So maybe the right direction is to leave this for liveness only in 23.2 and if we make more progress on removing epoch leases to drop this prioritization completely.

Was there a specific problem that this is solving?

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Apr 29, 2023

Previously, expiration leases for the meta and liveness ranges were eagerly extended in the store lease renewer. Now, expiration leases are extended by the Raft scheduler, and no longer fall back to extensions during request processing in the latter half of the lease interval. There is some concern that this additional dependency on the Raft scheduler could contribute to meta/liveness unavailability under overload -- especially with the removal of quiescence which puts significantly more pressure on the scheduler.

You're right that this likely becomes less important with expiration leases, but the liveness and meta ranges still serve a fundamental role in the system, and starvation could worsen unavailability or instability even with expiration leases. Given that, I think it makes sense to prioritize these during scheduling, although I suppose the same argument could be made for other SQL-level system ranges.

This prioritization was originally introduced in #56860 following several escalation where liveness range scheduler starvation caused prolonged cluster outages.

Copy link
Collaborator

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly all looks good, although I'm not familiar enough with the internals of the Raft loop to really stamp this as good. Maybe @pavelkalinnikov can do that. Finally, I don't love the way we are overloading the shard 0 as the "high priority" shard, but its probably OK also.

Reviewed 9 of 9 files at r4, 7 of 7 files at r5, 5 of 5 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @pavelkalinnikov)


pkg/kv/kvserver/replica_init.go line 400 at r5 (raw file):

	r.mu.state.Desc = desc

	// Prioritize the NodeLiveness Range in the Raft scheduler above all other

nit: Change this to say "at high priority"


pkg/kv/kvserver/scheduler.go line 35 at r5 (raw file):

const priorityWorkers = 2

var priorityIDsValue = unsafe.Pointer(new(bool))

nit: I don't understand this pattern of using unsafe.Pointer, at a minimum, add a comment to describe it better.


pkg/kv/kvserver/scheduler.go line 176 at r5 (raw file):

	b := raftSchedulerBatchPool.Get().(*raftSchedulerBatch)
	if len(b.rangeIDs) != numShards {
		b.rangeIDs = make([][]roachpb.RangeID, numShards)

Don't we lose the previous b.rangeIDs here by recreating this? It seems like you should have copied the existing ones over. The old b.priorityIDs would not be lost though. This seems wrong (but I could be missing something). Have you tested calling this with a different numShards?


pkg/kv/kvserver/scheduler.go line 183 at r5 (raw file):

	// Cache the priority range IDs in an owned map, since we expect this to be
	// very small or empty and we do a lookup for every Add() call.
	priorityIDs.Range(func(id int64, _ unsafe.Pointer) bool {

nit: I didn't know about int_map.Range. Should we create a PR to convert this to use generics? I'm not sure if there is a perf benefit. Also we should consider adding int_map.KeyRange since it would avoid an unnecessary load of the value.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll get a review from @pavelkalinnikov.

I don't love the way we are overloading the shard 0 as the "high priority" shard, but its probably OK also.

What would you suggest? Maybe a separate priorityShard field or some such? I think it'll become a bit unwieldy, because we have to special-case or duplicate code every time we loop over shards, and the shard addressing becomes two-dimensional. Any better ideas?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @nvanbenschoten, and @pavelkalinnikov)


pkg/kv/kvserver/replica_init.go line 400 at r5 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

nit: Change this to say "at high priority"

Done, is something like this what you had in mind?


pkg/kv/kvserver/scheduler.go line 35 at r5 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

nit: I don't understand this pattern of using unsafe.Pointer, at a minimum, add a comment to describe it better.

IntMap takes unsafe.Pointer values. We don't care about the map value, so we can reuse the same value for every key, avoiding additional allocations. They don't really matter here though, tbh. Added a comment.


pkg/kv/kvserver/scheduler.go line 176 at r5 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

Don't we lose the previous b.rangeIDs here by recreating this? It seems like you should have copied the existing ones over. The old b.priorityIDs would not be lost though. This seems wrong (but I could be missing something). Have you tested calling this with a different numShards?

We do. However, numShards only ever changes in tests, where it doesn't matter that we're constructing a new slice instead of reusing the existing one -- in practice, this path was only taken when we got a newly created batch with rangeIDs = nil.

But I added some code to reuse the slice where possible, to make it obviously correct.


pkg/kv/kvserver/scheduler.go line 183 at r5 (raw file):

Should we create a PR to convert this to use generics?

Yeah, it could probably be done if we handle all of the atomic pointer fiddling internally. It might hamper performance though, because generics prevent certain compiler optimizations.

In any case, that seems like a distraction right now -- we have more important things to do, but I wrote up #103147.

we should consider adding int_map.KeyRange since it would avoid an unnecessary load of the value

We'd really want an IntSet or some such, we don't care about the value here at all. I can write up a quick prototype and see if it matters at all performance-wise, but I suspect latency will be dominated by the actual enqueueing rather than this tiny little loop here. It might matter slightly for the IntMap lookup when we enqueue individual ranges though.

@erikgrinaker erikgrinaker force-pushed the raft-scheduler-priority branch from bb04272 to e3fb687 Compare May 11, 2023 22:30
@erikgrinaker erikgrinaker requested a review from pav-kv May 11, 2023 22:30
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist, @nvanbenschoten, and @pavelkalinnikov)


pkg/kv/kvserver/scheduler.go line 183 at r5 (raw file):

I can write up a quick prototype and see if it matters at all

Gave this a quick shot, but we still need to atomically load the value to look for a deletion marker in order to handle concurrent deletions. Fetching an IntMap value is just doing an atomic pointer load, which is exactly equivalent to looking for a set deletion marker, so there's nothing to save by getting rid of the value (we'd be loading an atomic bool rather than a pointer, but I benchmarked them and the performance is the same).

Might be a better way to design a lock-free set though, but I'm not going to do that now (could probably nerd-snipe @pavelkalinnikov into it).

This patch adds `priority=false|true` variants to measure enqueue
performance when there is 0 or 1 priority ranges respectively (the
common cases given that only the liveness range currently has priority).

Epic: none
Release note: None
@erikgrinaker erikgrinaker force-pushed the raft-scheduler-priority branch from e3fb687 to 973d3f8 Compare May 14, 2023 22:35
@erikgrinaker
Copy link
Contributor Author

Added COCKROACH_SCHEDULER_PRIORITY_SHARD_SIZE to configure the number of priority workers per store (minimum 1), just in case we'll need to bump it.

Copy link
Collaborator

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Looks like some tests need fixing.

Previously, the Raft scheduler only supported a single priority range
(the liveness range). However, other critical system ranges should also
get priority, in particular when we move responsibility for
expiration-based lease extensions to the Raft scheduler.

This patch implements prioritization of multiple ranges. It does so by
creating a separate Raft scheduler shard at index 0 with 2 worker
goroutines reserved for priority ranges, configurable via
`COCKROACH_SCHEDULER_PRIORITY_SHARD_SIZE`. This has the added benefit of
avoiding scheduler starvation by non-priority ranges (although unlikely
in practice), and of amortizing the priority check costs at enqueue time
rather than incurring it for every dequeue. These goroutines do have a
minor cost, especially considering many nodes may not have any priority
ranges at all, but the added complexity and risk of dynamically sizing
this worker pool down to 0 does not appear justified to avoid 2 idle
goroutines.

Benchmarks show some additional enqueue overhead, but this approaches 0%
when there are no priority ranges (`priority=false`) and 12% when there
is a single priority range (`priority=true`). In absolute terms the
overhead is negligible, considering we only enqueue ticks every 500 ms.

```
name                                                                               old time/op  new time/op  delta
SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=1/workers=64-24        158ns ± 3%   201ns ± 2%  +26.58%  (p=0.000 n=10+9)
SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=10/workers=64-24       696ns ± 3%   777ns ± 2%  +11.55%  (p=0.000 n=10+10)
SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=100/workers=64-24     5.69µs ± 2%  6.12µs ± 1%   +7.59%  (p=0.000 n=10+10)
SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=1000/workers=64-24    59.9µs ± 1%  62.4µs ± 1%   +4.19%  (p=0.000 n=10+10)
SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=10000/workers=64-24    622µs ± 1%   650µs ± 1%   +4.49%  (p=0.000 n=9+9)
SchedulerEnqueueRaftTicks/collect=true/priority=false/ranges=100000/workers=64-24  8.30ms ± 6%  8.15ms ± 1%     ~     (p=0.211 n=10+9)
SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=1/workers=64-24         158ns ± 1%   259ns ± 2%  +64.58%  (p=0.000 n=9+10)
SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=10/workers=64-24        696ns ± 2%   985ns ± 4%  +41.54%  (p=0.000 n=9+10)
SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=100/workers=64-24      5.68µs ± 2%  6.96µs ± 2%  +22.65%  (p=0.000 n=10+10)
SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=1000/workers=64-24     59.8µs ± 1%  70.1µs ± 1%  +17.34%  (p=0.000 n=10+10)
SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=10000/workers=64-24     625µs ± 1%   725µs ± 1%  +15.96%  (p=0.000 n=10+9)
SchedulerEnqueueRaftTicks/collect=true/priority=true/ranges=100000/workers=64-24   8.23ms ± 5%  9.27ms ± 3%  +12.56%  (p=0.000 n=10+10)
```

Epic: none
Release note: None
Previously, only the node liveness range was prioritized in the Raft
scheduler. This patch also prioritizes scheduling for the meta ranges.

Epic: none
Release note: None
@erikgrinaker erikgrinaker force-pushed the raft-scheduler-priority branch from 973d3f8 to a69d0ea Compare May 15, 2023 13:26
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented May 15, 2023

TFTR! I'll give @andrewbaptist a chance to respond here before borsing.

I also had a look at Telemetry, which is a moderately large cluster (8 nodes with 100k ranges and 35k replicas per node), and it had 4 meta ranges in addition to liveness. With RF=5 on system ranges, each node thus had 3 replicas to process in the priority shard. Since clusters will generally not have more than 50k replicas per node or so (15 TB at 300 MB average range size), I think we can assume that typically the priority shard will have no more than 5 low-traffic replicas, which should be manageable for 2 workers.

@erikgrinaker
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented May 17, 2023

👎 Rejected by code reviews

@erikgrinaker
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented May 17, 2023

Build succeeded:

@craig craig bot merged commit 1e86a91 into cockroachdb:master May 17, 2023
@erikgrinaker erikgrinaker deleted the raft-scheduler-priority branch May 30, 2023 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: support multiple priority ranges in Raft scheduler
4 participants