kv: add raft.scheduler.latency histogram metric #56943

nvanbenschoten · 2020-11-20T04:27:20Z

This commit adds a new raft.scheduler.latency metric, which monitors
the latency between initially enqueuing a range in the Raft scheduler
and that range being picked up by a worker and processed. This metric
would be helpful to watch in cases where Raft slows down due to many
active ranges.

Release note (ui change): A new metric called raft.scheduler.latency
was added which monitors the latency for operations to be picked up
and processed by the Raft scheduler.

cockroach-teamcity · 2020-11-20T04:27:32Z

This change is

nvanbenschoten · 2020-11-23T02:47:31Z

#56860 just merged, so this is now rebased on master.

57007: kv: eliminate heap allocations throughout Raft messaging and processing logic r=nvanbenschoten a=nvanbenschoten This PR contains a few improvements within the Raft messaging and processing code to avoid common heap allocations that stuck out during tests in #56860. This will improve the general efficiency of the Raft pipeline and will specifically reduce the cost of handling large coalesced heartbeats. This will reduce the impact that these heartbeats have on the [scheduler latency](#56943) of other active ranges. Co-authored-by: Nathan VanBenschoten <[email protected]>

nvanbenschoten · 2020-11-30T01:53:32Z

@petermattis do you have any interest in reviewing this change? It's not a problem if not, though I think you still have the most context on this code.

petermattis

Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)

pkg/kv/kvserver/scheduler.go, line 281 at r1 (raw file):

			// range ID for further processing.
			s.mu.queue.Push(id)
			s.mu.cond.Signal()

I believe the intention behind this call to Signal is to be able to wake up another worker so we possibly get more concurrency. The queue may not have been empty before id was added, so this worker will pick up a different id to process.

pkg/kv/kvserver/scheduler.go, line 159 at r2 (raw file):

type raftScheduleState struct {
	flags raftScheduleFlags
	begin int64 // nanoseconds

Is there a reason you're recording the start time as nanos rather than using time.Time? Using time.Time would get you the monotonic clock handling which I think isn't true if you're using nanos.

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @petermattis)

pkg/kv/kvserver/scheduler.go, line 281 at r1 (raw file):
Wouldn't the goroutine that enqueued any other range IDs have already signaled the condition variable a sufficient number of times? So in the case where there was one other range in the queue, shouldn't one other worker have already been awoken? This was so deliberate before that I feel like I'm missing something.

so this worker will pick up a different id to process.

This is interesting. I'd actually expect this to be undesirable, all else being equal, due to cache locality effects. Are there reasons why we'd want shuffling between worker goroutines and range IDs?

pkg/kv/kvserver/scheduler.go, line 159 at r2 (raw file):
I didn't want to stick a pointer (time.Time.loc) or an extra 64-bit int (time.Time.ext) in the raftScheduleState because of the impact that doing so would have on the size of the hashmap and on its visibility to the GC.

Using time.Time would get you the monotonic clock handling which I think isn't true if you're using nanos.

This is a good question. It doesn't look like UnixNanos provides monotonic clock handling.

The histogram seems to take the abs value of negative values, so the worst that this would lead to is inaccurate latency values during a clock jump. I don't see a huge issue with that, given the trade-offs, but I'm curious whether you feel differently.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)

pkg/kv/kvserver/scheduler.go, line 281 at r1 (raw file):

Wouldn't the goroutine that enqueued any other range IDs have already signaled the condition variable a sufficient number of times? So in the case where there was one other range in the queue, shouldn't one other worker have already been awoken? This was so deliberate before that I feel like I'm missing something.

We only signal if enqueue1Locked actually adds to the queue. If we call enqueue1Locked while a range is being processed we don't signal. Whether this is all worthwhile is something I'm not sure of.

This is interesting. I'd actually expect this to be undesirable, all else being equal, due to cache locality effects. Are there reasons why we'd want shuffling between worker goroutines and range IDs?

No, we wouldn't want the ID to be shuffled, but how to prevent that? This code is quite old and I've forgotten much of the rationale behind it. It is quite possibly do for a rethink.

pkg/kv/kvserver/scheduler.go, line 159 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I didn't want to stick a pointer (time.Time.loc) or an extra 64-bit int (time.Time.ext) in the raftScheduleState because of the impact that doing so would have on the size of the hashmap and on its visibility to the GC.

Using time.Time would get you the monotonic clock handling which I think isn't true if you're using nanos.

This is a good question. It doesn't look like UnixNanos provides monotonic clock handling.

The histogram seems to take the abs value of negative values, so the worst that this would lead to is inaccurate latency values during a clock jump. I don't see a huge issue with that, given the trade-offs, but I'm curious whether you feel differently.

Ack. Ok. Probably doesn't matter much as clock jumps should be rare in comparison to these latency calculations.

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @petermattis)

pkg/kv/kvserver/scheduler.go, line 281 at r1 (raw file):

We only signal if enqueue1Locked actually adds to the queue. If we call enqueue1Locked while a range is being processed we don't signal. Whether this is all worthwhile is something I'm not sure of.

Right, but do we need to signal again if we know that a range is currently being processed? Doesn't the fact that the range is being processed (newState&stateQueued != 0) indicate that a goroutine is already awake and won't go back to sleep before either processing the same range again or grabbing a new range, allowing a different goroutine that was awoken for the new range to process the repeated range?

Put another way, it seems like we want awake_workers = min(max_workers, num_ranges) at all times for optimal concurrency. If awake_workers = cur_awake_workers + num_signals, then this means we need num_signals = min(max_workers, num_ranges) - cur_awake_workers. So if we re-enqueue a range that's currently being processed, then num_ranges doesn't change, so we shouldn't need another signal if we never allow the worker that processed the range to go back to sleep.

No, we wouldn't want the ID to be shuffled, but how to prevent that?

Yeah, this seems hard to prevent, especially if we want to ensure fairness between goroutines.

petermattis

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)

pkg/kv/kvserver/scheduler.go, line 281 at r1 (raw file):

Put another way, it seems like we want awake_workers = min(max_workers, num_ranges) at all times for optimal concurrency. If awake_workers = cur_awake_workers + num_signals, then this means we need num_signals = min(max_workers, num_ranges) - cur_awake_workers. So if we re-enqueue a range that's currently being processed, then num_ranges doesn't change, so we shouldn't need another signal if we never allow the worker that processed the range to go back to sleep.

This looks right to me and is a nice way of explaining why the signal isn't necessary here. Mind translating this into a comment?

The current worker is already going to loop back around and try to pull off the queue before waiting, so it can guarantee that someone will process the range. This commit makes me a little nervous because the code seems so deliberate, so I may be missing something, but I've stared at this for a while and don't see what that could be.

This commit adds a new `raft.scheduler.latency` metric, which monitors the latency between initially enqueuing a range in the Raft scheduler and that range being picked up by a worker and processed. This metric would be helpful to watch in cases where Raft slows down due to many active ranges. Release note (ui change): A new metric called `raft.scheduler.latency` was added which monitors the latency for operations to be picked up and processed by the Raft scheduler.

nvanbenschoten

TFTR!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @petermattis)

pkg/kv/kvserver/scheduler.go, line 281 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Put another way, it seems like we want awake_workers = min(max_workers, num_ranges) at all times for optimal concurrency. If awake_workers = cur_awake_workers + num_signals, then this means we need num_signals = min(max_workers, num_ranges) - cur_awake_workers. So if we re-enqueue a range that's currently being processed, then num_ranges doesn't change, so we shouldn't need another signal if we never allow the worker that processed the range to go back to sleep.

This looks right to me and is a nice way of explaining why the signal isn't necessary here. Mind translating this into a comment?

Done.

craig · 2020-12-01T23:12:05Z

Build succeeded:

GitHub CI (Cockroach)

nvanbenschoten requested a review from petermattis November 20, 2020 04:27

This was referenced Nov 20, 2020

kv: large-disk cluster has poor performance post-restart #56876

Closed

kv: eliminate heap allocations throughout Raft messaging and processing logic #57007

Merged

nvanbenschoten force-pushed the nvanbenschoten/raftSchedulerLatency branch from 8171519 to 27a0fa8 Compare November 23, 2020 02:46

petermattis reviewed Nov 30, 2020

View reviewed changes

nvanbenschoten commented Nov 30, 2020

View reviewed changes

petermattis reviewed Nov 30, 2020

View reviewed changes

nvanbenschoten commented Nov 30, 2020

View reviewed changes

petermattis approved these changes Dec 1, 2020

View reviewed changes

nvanbenschoten added 2 commits December 1, 2020 13:31

nvanbenschoten force-pushed the nvanbenschoten/raftSchedulerLatency branch from 27a0fa8 to 4c53af1 Compare December 1, 2020 18:32

nvanbenschoten commented Dec 1, 2020

View reviewed changes

craig bot merged commit 5740050 into cockroachdb:master Dec 1, 2020

nvanbenschoten deleted the nvanbenschoten/raftSchedulerLatency branch December 2, 2020 01:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: add raft.scheduler.latency histogram metric #56943

kv: add raft.scheduler.latency histogram metric #56943

nvanbenschoten commented Nov 20, 2020 •

edited

Loading

cockroach-teamcity commented Nov 20, 2020

nvanbenschoten commented Nov 23, 2020

nvanbenschoten commented Nov 30, 2020

petermattis left a comment

nvanbenschoten left a comment

petermattis left a comment

nvanbenschoten left a comment

petermattis left a comment

nvanbenschoten left a comment

craig bot commented Dec 1, 2020

kv: add raft.scheduler.latency histogram metric #56943

kv: add raft.scheduler.latency histogram metric #56943

Conversation

nvanbenschoten commented Nov 20, 2020 • edited Loading

cockroach-teamcity commented Nov 20, 2020

nvanbenschoten commented Nov 23, 2020

nvanbenschoten commented Nov 30, 2020

petermattis left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

craig bot commented Dec 1, 2020

nvanbenschoten commented Nov 20, 2020 •

edited

Loading