-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: truncate aggressively only after 4mb of logs #32437
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
glad to see this change! The old behavior was always surpising. I'm interested in hearing the results of your testing.
Reviewed 1 of 1 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 2 of 0 LGTMs obtained
pkg/storage/raft_log_queue.go, line 108 at r1 (raw file):
// and end transaction operations. If the estimated raft log size becomes // larger than the replica size, we're better off recovering the replica // using a snapshot.
This comment needs updating. It also explains the rationale behind the old heuristic. A snapshot can be cheaper than sending Raft log entries, though applying Raft log entries can be done in parallel while snapshots are serialized.
Another thought of something that can be done in this area is to use a size-based quota system for concurrent snapshot application. Instead of limiting the number of concurrent snapshots based on count, we'd limit based on bytes so that a large number of tiny snapshots could be allowed concurrently.
5c8b086
to
50a619c
Compare
Whenever the "max raft log size" is exceeded, log truncations become more aggressive in that they aim at the quorum commit index, potentially cutting off followers (which then need Raft snapshots). The effective threshold log size is 4mb for replicas larger than 4mb and the replica size otherwise. This latter case can be problematic since replicas can be persistently small despite having steady log progress (for example, range 4 receives node status updates which are large inline puts). If in such a range a follower falls behind just slightly, it'll need a snapshot. This isn't in itself the biggest deal since the snapshot is fairly rare (the required log entries are usually already on in transit to the follower) and would be small, but it's not ideal. Always use a 4mb threshold instead. Note that we also truncate the log to the minimum replicated index if the log size is above 64kb. This is similarly aggressive but respects followers (until they fall behind by 4mb or more). My expectation is that this will not functionally change anything. It might leave behind a little bit more Raft log on quiescent ranges, but I think the solution here is performing "one last truncation" for ranges that are quiescent to make sure they shed the remainder of their Raft log. Touches cockroachdb#32046. Release note: None
50a619c
to
53cecf1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs! Holding off on the merge until I see benchmark parity.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)
pkg/storage/raft_log_queue.go, line 108 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
This comment needs updating. It also explains the rationale behind the old heuristic. A snapshot can be cheaper than sending Raft log entries, though applying Raft log entries can be done in parallel while snapshots are serialized.
Another thought of something that can be done in this area is to use a size-based quota system for concurrent snapshot application. Instead of limiting the number of concurrent snapshots based on count, we'd limit based on bytes so that a large number of tiny snapshots could be allowed concurrently.
Comment updated.
I agree that the snapshot queue needs a reworking independently of this change. Added your remark to #32046 (comment).
Just to be clear though, I want to reach a state in which we never cause Raft snapshots when they're not necessary irrespective of whether they get the system in a bad state or not. Think of this as a stability metric driving my particular string of investigations right now. Any snapshot we're not expecting to see is a problem. (There may be diminishing returns at some point, but I don't think I'm there yet).
@nvanbenschoten here are some (bad, because I used an individual cluster for each run) numbers:
I might've gotten lucky with the machines for the PR runs. I wouldn't really expect to see a perf difference. Going to run some more where I vary the stale trunc threshold, but going to merge this first (once it passes CI). |
bors r=nvanbenschoten,petermattis |
32437: storage: truncate aggressively only after 4mb of logs r=nvanbenschoten,petermattis a=tbg cc @nvanbenschoten. I'm going to run some kv95 experiments in which I vary the 64kb threshold in both directions to see if there are any effects on performance in doing so. ---- Whenever the "max raft log size" is exceeded, log truncations become more aggressive in that they aim at the quorum commit index, potentially cutting off followers (which then need Raft snapshots). The effective threshold log size is 4mb for replicas larger than 4mb and the replica size otherwise. This latter case can be problematic since replicas can be persistently small despite having steady log progress (for example, range 4 receives node status updates which are large inline puts). If in such a range a follower falls behind just slightly, it'll need a snapshot. This isn't in itself the biggest deal since the snapshot is fairly rare (the required log entries are usually already on in transit to the follower) and would be small, but it's not ideal. Always use a 4mb threshold instead. Note that we also truncate the log to the minimum replicated index if the log size is above 64kb. This is similarly aggressive but respects followers (until they fall behind by 4mb or more). My expectation is that this will not functionally change anything. It might leave behind a little bit more Raft log on quiescent ranges, but I think the solution here is performing "one last truncation" for ranges that are quiescent to make sure they shed the remainder of their Raft log. Touches #32046. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
1-node numbers from my gceworker also show that if anything, things have gotten better.
|
PS @nvanbenschoten it's quite likely I didn't do exactly what you had in mind. By running the roachtest, I think my table gets pre-split into 1000 pieces, so each individual Raft log will see only a small amount of activity. Is that what you wanted or was I supposed to run this against a single instance? I unfortunately lost track of where we talked about this initially. |
Build succeeded |
I wasn't even considering a perf benchmark because we shouldn't be snapshotting at all during steady-state load. I was more interested in whether this helped with some of your other testing around snapshots and overly aggressive log truncation. |
Oh, sorry. No, it won't help with that, not beyond this fix. Ok, nothing to do then. |
cc @nvanbenschoten. I'm going to run some kv95 experiments in which I
vary the 64kb threshold in both directions to see if there are any
effects on performance in doing so.
Whenever the "max raft log size" is exceeded, log truncations become
more aggressive in that they aim at the quorum commit index, potentially
cutting off followers (which then need Raft snapshots).
The effective threshold log size is 4mb for replicas larger than 4mb and
the replica size otherwise. This latter case can be problematic since
replicas can be persistently small despite having steady log progress
(for example, range 4 receives node status updates which are large
inline puts). If in such a range a follower falls behind just slightly,
it'll need a snapshot. This isn't in itself the biggest deal since the
snapshot is fairly rare (the required log entries are usually already on
in transit to the follower) and would be small, but it's not ideal.
Always use a 4mb threshold instead. Note that we also truncate the log
to the minimum replicated index if the log size is above 64kb. This is
similarly aggressive but respects followers (until they fall behind by
4mb or more).
My expectation is that this will not functionally change anything. It
might leave behind a little bit more Raft log on quiescent ranges, but I
think the solution here is performing "one last truncation" for ranges
that are quiescent to make sure they shed the remainder of their Raft
log.
Touches #32046.
Release note: None