Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: add raft.scheduler.latency histogram metric #56943

Merged

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Nov 20, 2020

This commit adds a new raft.scheduler.latency metric, which monitors
the latency between initially enqueuing a range in the Raft scheduler
and that range being picked up by a worker and processed. This metric
would be helpful to watch in cases where Raft slows down due to many
active ranges.

Release note (ui change): A new metric called raft.scheduler.latency
was added which monitors the latency for operations to be picked up
and processed by the Raft scheduler.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten
Copy link
Member Author

#56860 just merged, so this is now rebased on master.

craig bot pushed a commit that referenced this pull request Nov 24, 2020
57007: kv: eliminate heap allocations throughout Raft messaging and processing logic r=nvanbenschoten a=nvanbenschoten

This PR contains a few improvements within the Raft messaging and processing code to avoid common heap allocations that stuck out during tests in #56860. This will improve the general efficiency of the Raft pipeline and will specifically reduce the cost of handling large coalesced heartbeats. This will reduce the impact that these heartbeats have on the [scheduler latency](#56943) of other active ranges.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@nvanbenschoten
Copy link
Member Author

@petermattis do you have any interest in reviewing this change? It's not a problem if not, though I think you still have the most context on this code.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)


pkg/kv/kvserver/scheduler.go, line 281 at r1 (raw file):

			// range ID for further processing.
			s.mu.queue.Push(id)
			s.mu.cond.Signal()

I believe the intention behind this call to Signal is to be able to wake up another worker so we possibly get more concurrency. The queue may not have been empty before id was added, so this worker will pick up a different id to process.


pkg/kv/kvserver/scheduler.go, line 159 at r2 (raw file):

type raftScheduleState struct {
	flags raftScheduleFlags
	begin int64 // nanoseconds

Is there a reason you're recording the start time as nanos rather than using time.Time? Using time.Time would get you the monotonic clock handling which I think isn't true if you're using nanos.

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @petermattis)


pkg/kv/kvserver/scheduler.go, line 281 at r1 (raw file):
Wouldn't the goroutine that enqueued any other range IDs have already signaled the condition variable a sufficient number of times? So in the case where there was one other range in the queue, shouldn't one other worker have already been awoken? This was so deliberate before that I feel like I'm missing something.

so this worker will pick up a different id to process.

This is interesting. I'd actually expect this to be undesirable, all else being equal, due to cache locality effects. Are there reasons why we'd want shuffling between worker goroutines and range IDs?


pkg/kv/kvserver/scheduler.go, line 159 at r2 (raw file):
I didn't want to stick a pointer (time.Time.loc) or an extra 64-bit int (time.Time.ext) in the raftScheduleState because of the impact that doing so would have on the size of the hashmap and on its visibility to the GC.

Using time.Time would get you the monotonic clock handling which I think isn't true if you're using nanos.

This is a good question. It doesn't look like UnixNanos provides monotonic clock handling.

The histogram seems to take the abs value of negative values, so the worst that this would lead to is inaccurate latency values during a clock jump. I don't see a huge issue with that, given the trade-offs, but I'm curious whether you feel differently.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)


pkg/kv/kvserver/scheduler.go, line 281 at r1 (raw file):

Wouldn't the goroutine that enqueued any other range IDs have already signaled the condition variable a sufficient number of times? So in the case where there was one other range in the queue, shouldn't one other worker have already been awoken? This was so deliberate before that I feel like I'm missing something.

We only signal if enqueue1Locked actually adds to the queue. If we call enqueue1Locked while a range is being processed we don't signal. Whether this is all worthwhile is something I'm not sure of.

This is interesting. I'd actually expect this to be undesirable, all else being equal, due to cache locality effects. Are there reasons why we'd want shuffling between worker goroutines and range IDs?

No, we wouldn't want the ID to be shuffled, but how to prevent that? This code is quite old and I've forgotten much of the rationale behind it. It is quite possibly do for a rethink.


pkg/kv/kvserver/scheduler.go, line 159 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I didn't want to stick a pointer (time.Time.loc) or an extra 64-bit int (time.Time.ext) in the raftScheduleState because of the impact that doing so would have on the size of the hashmap and on its visibility to the GC.

Using time.Time would get you the monotonic clock handling which I think isn't true if you're using nanos.

This is a good question. It doesn't look like UnixNanos provides monotonic clock handling.

The histogram seems to take the abs value of negative values, so the worst that this would lead to is inaccurate latency values during a clock jump. I don't see a huge issue with that, given the trade-offs, but I'm curious whether you feel differently.

Ack. Ok. Probably doesn't matter much as clock jumps should be rare in comparison to these latency calculations.

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @petermattis)


pkg/kv/kvserver/scheduler.go, line 281 at r1 (raw file):

We only signal if enqueue1Locked actually adds to the queue. If we call enqueue1Locked while a range is being processed we don't signal. Whether this is all worthwhile is something I'm not sure of.

Right, but do we need to signal again if we know that a range is currently being processed? Doesn't the fact that the range is being processed (newState&stateQueued != 0) indicate that a goroutine is already awake and won't go back to sleep before either processing the same range again or grabbing a new range, allowing a different goroutine that was awoken for the new range to process the repeated range?

Put another way, it seems like we want awake_workers = min(max_workers, num_ranges) at all times for optimal concurrency. If awake_workers = cur_awake_workers + num_signals, then this means we need num_signals = min(max_workers, num_ranges) - cur_awake_workers. So if we re-enqueue a range that's currently being processed, then num_ranges doesn't change, so we shouldn't need another signal if we never allow the worker that processed the range to go back to sleep.

No, we wouldn't want the ID to be shuffled, but how to prevent that?

Yeah, this seems hard to prevent, especially if we want to ensure fairness between goroutines.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)


pkg/kv/kvserver/scheduler.go, line 281 at r1 (raw file):

Put another way, it seems like we want awake_workers = min(max_workers, num_ranges) at all times for optimal concurrency. If awake_workers = cur_awake_workers + num_signals, then this means we need num_signals = min(max_workers, num_ranges) - cur_awake_workers. So if we re-enqueue a range that's currently being processed, then num_ranges doesn't change, so we shouldn't need another signal if we never allow the worker that processed the range to go back to sleep.

This looks right to me and is a nice way of explaining why the signal isn't necessary here. Mind translating this into a comment?

The current worker is already going to loop back around and try to pull
off the queue before waiting, so it can guarantee that someone will
process the range.

This commit makes me a little nervous because the code seems so
deliberate, so I may be missing something, but I've stared at this for a
while and don't see what that could be.
This commit adds a new `raft.scheduler.latency` metric, which monitors
the latency between initially enqueuing a range in the Raft scheduler
and that range being picked up by a worker and processed. This metric
would be helpful to watch in cases where Raft slows down due to many
active ranges.

Release note (ui change): A new metric called `raft.scheduler.latency`
was added which monitors the latency for operations to be picked up
and processed by the Raft scheduler.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/raftSchedulerLatency branch from 27a0fa8 to 4c53af1 Compare December 1, 2020 18:32
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @petermattis)


pkg/kv/kvserver/scheduler.go, line 281 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Put another way, it seems like we want awake_workers = min(max_workers, num_ranges) at all times for optimal concurrency. If awake_workers = cur_awake_workers + num_signals, then this means we need num_signals = min(max_workers, num_ranges) - cur_awake_workers. So if we re-enqueue a range that's currently being processed, then num_ranges doesn't change, so we shouldn't need another signal if we never allow the worker that processed the range to go back to sleep.

This looks right to me and is a nice way of explaining why the signal isn't necessary here. Mind translating this into a comment?

Done.

@craig
Copy link
Contributor

craig bot commented Dec 1, 2020

Build succeeded:

@craig craig bot merged commit 5740050 into cockroachdb:master Dec 1, 2020
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/raftSchedulerLatency branch December 2, 2020 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants