-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: add raft.scheduler.latency histogram metric #56943
kv: add raft.scheduler.latency histogram metric #56943
Conversation
8171519
to
27a0fa8
Compare
#56860 just merged, so this is now rebased on |
57007: kv: eliminate heap allocations throughout Raft messaging and processing logic r=nvanbenschoten a=nvanbenschoten This PR contains a few improvements within the Raft messaging and processing code to avoid common heap allocations that stuck out during tests in #56860. This will improve the general efficiency of the Raft pipeline and will specifically reduce the cost of handling large coalesced heartbeats. This will reduce the impact that these heartbeats have on the [scheduler latency](#56943) of other active ranges. Co-authored-by: Nathan VanBenschoten <[email protected]>
@petermattis do you have any interest in reviewing this change? It's not a problem if not, though I think you still have the most context on this code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)
pkg/kv/kvserver/scheduler.go, line 281 at r1 (raw file):
// range ID for further processing. s.mu.queue.Push(id) s.mu.cond.Signal()
I believe the intention behind this call to Signal
is to be able to wake up another worker so we possibly get more concurrency. The queue may not have been empty before id
was added, so this worker will pick up a different id to process.
pkg/kv/kvserver/scheduler.go, line 159 at r2 (raw file):
type raftScheduleState struct { flags raftScheduleFlags begin int64 // nanoseconds
Is there a reason you're recording the start time as nanos rather than using time.Time
? Using time.Time
would get you the monotonic clock handling which I think isn't true if you're using nanos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @petermattis)
pkg/kv/kvserver/scheduler.go, line 281 at r1 (raw file):
Wouldn't the goroutine that enqueued any other range IDs have already signaled the condition variable a sufficient number of times? So in the case where there was one other range in the queue, shouldn't one other worker have already been awoken? This was so deliberate before that I feel like I'm missing something.
so this worker will pick up a different id to process.
This is interesting. I'd actually expect this to be undesirable, all else being equal, due to cache locality effects. Are there reasons why we'd want shuffling between worker goroutines and range IDs?
pkg/kv/kvserver/scheduler.go, line 159 at r2 (raw file):
I didn't want to stick a pointer (time.Time.loc
) or an extra 64-bit int (time.Time.ext
) in the raftScheduleState
because of the impact that doing so would have on the size of the hashmap and on its visibility to the GC.
Using time.Time would get you the monotonic clock handling which I think isn't true if you're using nanos.
This is a good question. It doesn't look like UnixNanos
provides monotonic clock handling.
The histogram seems to take the abs value of negative values, so the worst that this would lead to is inaccurate latency values during a clock jump. I don't see a huge issue with that, given the trade-offs, but I'm curious whether you feel differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)
pkg/kv/kvserver/scheduler.go, line 281 at r1 (raw file):
Wouldn't the goroutine that enqueued any other range IDs have already signaled the condition variable a sufficient number of times? So in the case where there was one other range in the queue, shouldn't one other worker have already been awoken? This was so deliberate before that I feel like I'm missing something.
We only signal if enqueue1Locked
actually adds to the queue. If we call enqueue1Locked
while a range is being processed we don't signal. Whether this is all worthwhile is something I'm not sure of.
This is interesting. I'd actually expect this to be undesirable, all else being equal, due to cache locality effects. Are there reasons why we'd want shuffling between worker goroutines and range IDs?
No, we wouldn't want the ID to be shuffled, but how to prevent that? This code is quite old and I've forgotten much of the rationale behind it. It is quite possibly do for a rethink.
pkg/kv/kvserver/scheduler.go, line 159 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I didn't want to stick a pointer (
time.Time.loc
) or an extra 64-bit int (time.Time.ext
) in theraftScheduleState
because of the impact that doing so would have on the size of the hashmap and on its visibility to the GC.Using time.Time would get you the monotonic clock handling which I think isn't true if you're using nanos.
This is a good question. It doesn't look like
UnixNanos
provides monotonic clock handling.The histogram seems to take the abs value of negative values, so the worst that this would lead to is inaccurate latency values during a clock jump. I don't see a huge issue with that, given the trade-offs, but I'm curious whether you feel differently.
Ack. Ok. Probably doesn't matter much as clock jumps should be rare in comparison to these latency calculations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @petermattis)
pkg/kv/kvserver/scheduler.go, line 281 at r1 (raw file):
We only signal if enqueue1Locked actually adds to the queue. If we call enqueue1Locked while a range is being processed we don't signal. Whether this is all worthwhile is something I'm not sure of.
Right, but do we need to signal again if we know that a range is currently being processed? Doesn't the fact that the range is being processed (newState&stateQueued != 0
) indicate that a goroutine is already awake and won't go back to sleep before either processing the same range again or grabbing a new range, allowing a different goroutine that was awoken for the new range to process the repeated range?
Put another way, it seems like we want awake_workers = min(max_workers, num_ranges)
at all times for optimal concurrency. If awake_workers = cur_awake_workers + num_signals
, then this means we need num_signals = min(max_workers, num_ranges) - cur_awake_workers
. So if we re-enqueue a range that's currently being processed, then num_ranges
doesn't change, so we shouldn't need another signal if we never allow the worker that processed the range to go back to sleep.
No, we wouldn't want the ID to be shuffled, but how to prevent that?
Yeah, this seems hard to prevent, especially if we want to ensure fairness between goroutines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @petermattis)
pkg/kv/kvserver/scheduler.go, line 281 at r1 (raw file):
Put another way, it seems like we want awake_workers = min(max_workers, num_ranges) at all times for optimal concurrency. If awake_workers = cur_awake_workers + num_signals, then this means we need num_signals = min(max_workers, num_ranges) - cur_awake_workers. So if we re-enqueue a range that's currently being processed, then num_ranges doesn't change, so we shouldn't need another signal if we never allow the worker that processed the range to go back to sleep.
This looks right to me and is a nice way of explaining why the signal isn't necessary here. Mind translating this into a comment?
The current worker is already going to loop back around and try to pull off the queue before waiting, so it can guarantee that someone will process the range. This commit makes me a little nervous because the code seems so deliberate, so I may be missing something, but I've stared at this for a while and don't see what that could be.
This commit adds a new `raft.scheduler.latency` metric, which monitors the latency between initially enqueuing a range in the Raft scheduler and that range being picked up by a worker and processed. This metric would be helpful to watch in cases where Raft slows down due to many active ranges. Release note (ui change): A new metric called `raft.scheduler.latency` was added which monitors the latency for operations to be picked up and processed by the Raft scheduler.
27a0fa8
to
4c53af1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @petermattis)
pkg/kv/kvserver/scheduler.go, line 281 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Put another way, it seems like we want awake_workers = min(max_workers, num_ranges) at all times for optimal concurrency. If awake_workers = cur_awake_workers + num_signals, then this means we need num_signals = min(max_workers, num_ranges) - cur_awake_workers. So if we re-enqueue a range that's currently being processed, then num_ranges doesn't change, so we shouldn't need another signal if we never allow the worker that processed the range to go back to sleep.
This looks right to me and is a nice way of explaining why the signal isn't necessary here. Mind translating this into a comment?
Done.
Build succeeded: |
This commit adds a new
raft.scheduler.latency
metric, which monitorsthe latency between initially enqueuing a range in the Raft scheduler
and that range being picked up by a worker and processed. This metric
would be helpful to watch in cases where Raft slows down due to many
active ranges.
Release note (ui change): A new metric called
raft.scheduler.latency
was added which monitors the latency for operations to be picked up
and processed by the Raft scheduler.