-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: smarten logic to wait for split trigger #32782
Conversation
54fac73
to
c3828b0
Compare
c3828b0
to
a1d3052
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very happy to see this! With this extra information, we'll be able to make much more informed decisions on the follower side.
Reviewed 1 of 7 files at r1, 6 of 6 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 5214 at r2 (raw file):
_ = maybeDropMsgApp if status := r.raftStatusRLocked(); status != nil { if pr, ok := status.Progress[msg.To]; ok && pr.State == raft.ProgressStateProbe {
How often do we expect a follower to be in this state? I'd hate for this to blow up Raft message sizes in cases where we're not even going to use the key.
This does get me wondering about whether this is the right place to pass this information. I'm out of the loop with these change so I'm probably off, but what happens if we add similar logic to ignore/reject snapshots. Is the idea that once we reject a MsgApp (and enter ProgressStateSnapshot
) it's hard to get back to the right state without an eventual snapshot?
pkg/storage/replica.go, line 5218 at r2 (raw file):
log.Fatalf(ctx, "msg.Context nonempty for MsgApp: %v", msg.Context) } msg.Context = r.descRLocked().StartKey
Should we make the Context
a structured message so we can extend it later without compatibility concerns?
pkg/storage/split_trigger_helper.go, line 58 at r2 (raw file):
func maybeDropMsgApp( ctx context.Context, r interface {
This is pretty fragile. I would pull it into a named interface.
pkg/storage/split_trigger_helper.go, line 86 at r2 (raw file):
// replica): // // sender (leader) [a--lhs--b)[b---rhs----c)
Nice diagram!
pkg/storage/split_trigger_helper.go, line 124 at r2 (raw file):
// message, resulting in a snapshot, as intended. verbose := verboseRaftLoggingEnabled()
nit: this all seems a little verbose (unintended pun). You're only using eventf
twice below, which doesn't really justify the cost of creating it.
pkg/storage/split_trigger_helper.go, line 145 at r2 (raw file):
// the above analysis) in which a split trigger just isn't coming. If // there are, the idea is that we notice this log message and improve // the heuristics.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 5214 at r1 (raw file):
// below for more context: _ = maybeDropMsgApp if status := r.raftStatusRLocked(); status != nil {
@nvanbenschoten I'm unhappy about the allocation here and I know you've disliked this kind of thing as well. On master
I'm planning to solve this problem with a change in etcd/raft
, though that likely won't be backportable (and I'd like to at least be able to consider this change for a backport, for we've backported its predecessor). How bad is this allocation in the meantime (and on 2.1)?
edit: noticed I hadn't actually sent this comment earlier.
pkg/storage/replica.go, line 5214 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
How often do we expect a follower to be in this state? I'd hate for this to blow up Raft message sizes in cases where we're not even going to use the key.
This does get me wondering about whether this is the right place to pass this information. I'm out of the loop with these change so I'm probably off, but what happens if we add similar logic to ignore/reject snapshots. Is the idea that once we reject a MsgApp (and enter
ProgressStateSnapshot
) it's hard to get back to the right state without an eventual snapshot?
Yeah, we don't want to enter the snapshot queue erroneously. It may be possible to get Raft out of that state cleanly (at least after appropriate upstream changes), but I want raftsnapshot.pending
to be a flat line at some point. We'll send at most one MsgApp per heartbeat interval to a follower that is being probed, so perf shouldn't be a concern here. (in steady state, followers are never expected to be in that state, this only occurs around cluster restarts and replication changes and such).
I morally also want to send something along in MsgAppResp instead, but it's tricky. The uninitialized follower is really dumb and doesn't know anything about itself and the world, so in particular it has no idea where in the key space it should be. The leader (who is the right-hand-side) on the other hand doesn't know about the left-hand-side's bound (and in particular I've often seen its corresponding left-hand-side replica not be leader any more in a rebalancing unit test I have laying around in a WIP that attempted to go this route).
Let me know if you see something I missed. As far as I can tell, we'll need information to go from the leader to the follower and back (the "back" being the rejection in this PR, we could also send an explicit context if we find a reason for it).
pkg/storage/replica.go, line 5218 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should we make the
Context
a structured message so we can extend it later without compatibility concerns?
That sounds like a good idea.
df9ac85
to
5bcdb4b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, PTAL.
Dismissed @nvanbenschoten from 2 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 5218 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
That sounds like a good idea.
Done, sort of. I actually took a step back and reconsidered where I was putting this information. The reason I chose the Context
field is because it was there and so there wasn't any migration headache. However, it's a bit of a wart especially since it gets passed into Raft and we'd have to put protos inside protos. So what I've done instead is put it on RaftMessageRequest
(as StartKey
). Old versions receiving the hint simply won't use it, and since RaftMessageRequest doesn't inform any downstream-of-Raft-logic, there shouldn't be a problem.
pkg/storage/split_trigger_helper.go, line 58 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This is pretty fragile. I would pull it into a named interface.
I don't understand why it's fragile, but I understand that you didn't like the looks of it. Done.
pkg/storage/split_trigger_helper.go, line 124 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: this all seems a little verbose (unintended pun). You're only using
eventf
twice below, which doesn't really justify the cost of creating it.
Yeah, there were more at some point. Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave it another pass. One question I'm left with is wondering how you expect this to interact with repeated chains of splits and merges. I think it should all be fine, but it's worth exploring that and writing down a comment somewhere.
Reviewed 2 of 6 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 5214 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Yeah, we don't want to enter the snapshot queue erroneously. It may be possible to get Raft out of that state cleanly (at least after appropriate upstream changes), but I want
raftsnapshot.pending
to be a flat line at some point. We'll send at most one MsgApp per heartbeat interval to a follower that is being probed, so perf shouldn't be a concern here. (in steady state, followers are never expected to be in that state, this only occurs around cluster restarts and replication changes and such).I morally also want to send something along in MsgAppResp instead, but it's tricky. The uninitialized follower is really dumb and doesn't know anything about itself and the world, so in particular it has no idea where in the key space it should be. The leader (who is the right-hand-side) on the other hand doesn't know about the left-hand-side's bound (and in particular I've often seen its corresponding left-hand-side replica not be leader any more in a rebalancing unit test I have laying around in a WIP that attempted to go this route).
Let me know if you see something I missed. As far as I can tell, we'll need information to go from the leader to the follower and back (the "back" being the rejection in this PR, we could also send an explicit context if we find a reason for it).
Thanks for the explanation. I'd leave a comment pointing at https://github.com/etcd-io/etcd/blob/7f450bf6967638673dd88fd4e730b01d1303d5ff/raft/progress.go#L41, which backs up what you said about a single message per heartbeat interval.
pkg/storage/replica.go, line 5218 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Done, sort of. I actually took a step back and reconsidered where I was putting this information. The reason I chose the
Context
field is because it was there and so there wasn't any migration headache. However, it's a bit of a wart especially since it gets passed into Raft and we'd have to put protos inside protos. So what I've done instead is put it onRaftMessageRequest
(asStartKey
). Old versions receiving the hint simply won't use it, and since RaftMessageRequest doesn't inform any downstream-of-Raft-logic, there shouldn't be a problem.
👍 that's much better.
pkg/storage/split_trigger_helper.go, line 58 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I don't understand why it's fragile, but I understand that you didn't like the looks of it. Done.
Fragile in that we can't perform static implementation assertions and can't "talk" about the type outside of using it.
pkg/storage/split_trigger_helper.go, line 66 at r3 (raw file):
// Run the cheapest check first. If the leader doesn't think this replica is // probing, it won't set msg.Context (the common case). if msg.Type != raftpb.MsgApp || startKey == nil {
nit: len(startKey) == 0
feels a little safer since we're pulling this out of a proto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add the comment tomorrow. I think I managed to address the rest.
Dismissed @nvanbenschoten from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 5214 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
@nvanbenschoten I'm unhappy about the allocation here and I know you've disliked this kind of thing as well. On
master
I'm planning to solve this problem with a change inetcd/raft
, though that likely won't be backportable (and I'd like to at least be able to consider this change for a backport, for we've backported its predecessor). How bad is this allocation in the meantime (and on 2.1)?edit: noticed I hadn't actually sent this comment earlier.
Ping @nvanbenschoten
pkg/storage/replica.go, line 5214 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Thanks for the explanation. I'd leave a comment pointing at https://github.com/etcd-io/etcd/blob/7f450bf6967638673dd88fd4e730b01d1303d5ff/raft/progress.go#L41, which backs up what you said about a single message per heartbeat interval.
Done.
pkg/storage/split_trigger_helper.go, line 66 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit:
len(startKey) == 0
feels a little safer since we're pulling this out of a proto.
But if this is happening to the first range, the start key will intentionally be empty. Does TestProtoZeroNilSlice (added in this revision) address your concern?
10c7f79
to
210a61a
Compare
Ok, all set. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 5214 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Ping @nvanbenschoten
I was thinking that we would only hit the allocation when we need to add the start key, but you're right that we'll hit it for every Raft message. That seems like a pretty big issue. RawNode.Status
allocates at least three times and we regularly see it come up on allocation profiles. Calling it this frequently is likely to have a measurable performance impact. We would either need to prove that this is not the case or make that upstream change (RawNode.UnsafeStatus()
?) before merging this one. I'd strongly lean towards the latter.
I'm not sure about the backport problem. Seems the like the work we're doing here is much more tricky than whatever accessor you add to etcd/raft
, so I don't expect to see this backported. If it is though, I'd push to backport that change as well.
pkg/storage/split_trigger_helper.go, line 66 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
But if this is happening to the first range, the start key will intentionally be empty. Does TestProtoZeroNilSlice (added in this revision) address your concern?
Yep, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 5214 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I was thinking that we would only hit the allocation when we need to add the start key, but you're right that we'll hit it for every Raft message. That seems like a pretty big issue.
RawNode.Status
allocates at least three times and we regularly see it come up on allocation profiles. Calling it this frequently is likely to have a measurable performance impact. We would either need to prove that this is not the case or make that upstream change (RawNode.UnsafeStatus()
?) before merging this one. I'd strongly lean towards the latter.I'm not sure about the backport problem. Seems the like the work we're doing here is much more tricky than whatever accessor you add to
etcd/raft
, so I don't expect to see this backported. If it is though, I'd push to backport that change as well.
Ok, I sent a PR upstream: etcd-io/etcd#10309
Once this (or whatever its final form is) is in, I'll bump the dep and update this PR.
210a61a
to
71e038b
Compare
Is this ready for another look? |
@nvanbenschoten yep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for following through with this. Two minor nits but otherwise ready to merge.
Reviewed 6 of 6 files at r4, 1 of 1 files at r5, 1 of 1 files at r6.
Reviewable status: complete! 1 of 0 LGTMs obtained
pkg/storage/raft.proto, line 73 at r4 (raw file):
repeated RaftHeartbeat heartbeat_resps = 7 [(gogoproto.nullable) = false]; // Optionally, the start key of the sending replica. This is only populated
nit (take it or leave it) now that I'm reading over this again: name this range_start_key
and move it right below range_id
.
pkg/storage/replica.go, line 315 at r6 (raw file):
// aren't held up by this latch, which could result in reads on the RHS // executed through the LHS after this is valid. For more detail, see: // https://github.com/cockroachdb/cockroach/issues/32583
nit: period at end.
We previously "waited" for the split trigger quite literally, by dropping an outgoing rejecting MsgAppResp for a few seconds. The downside of doing so was that it would also artificially delay Raft snapshots that were actually required. With the introduction of a mechanism that delays split that would cause Raft snapshots this is a lesser problem than before, but it's still ugly. The revamped mechanism in this commit goes the extra mile to make an informed decision on whether a split trigger is expected to apply: 1. when sending MsgApp to a follower that is being probed, send the start key along with the message 2. when receiving said MsgApp on an uninitialized replica, check whether the store has a replica that currently owns the start key. This replica is the one in charge of applying the split trigger, 3. unless it's waiting for GC (so trigger a check for that). There's also a time-based escape hatch to avoid delaying snapshots indefinitely should there be a flaw in the above logic. Release note: None
It exposes a bug (likely in range merges), cockroachdb#32784. Release note: None
Release note: None
7329f96
to
2e3af6b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! Going to wait for TC to green this before I merge.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/raft.proto, line 73 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit (take it or leave it) now that I'm reading over this again: name this
range_start_key
and move it right belowrange_id
.
No bother at all, done.
bors r=nvanbenschoten |
32782: storage: smarten logic to wait for split trigger r=nvanbenschoten a=tbg We previously "waited" for the split trigger quite literally, by dropping an outgoing rejecting MsgAppResp for a few seconds. The downside of doing so was that it would also artificially delay Raft snapshots that were actually required. With the introduction of a mechanism that delays split that would cause Raft snapshots this is a lesser problem than before, but it's still ugly. The revamped mechanism in this commit goes the extra mile to make an informed decision on whether a split trigger is expected to apply: 1. when sending MsgApp to a follower that is being probed, send the start key along with the message 2. when receiving said MsgApp on an uninitialized replica, check whether the store has a replica that currently owns the start key. This replica is the one in charge of applying the split trigger, 3. unless it's waiting for GC (so trigger a check for that). There's also a time-based escape hatch to avoid delaying snapshots indefinitely should there be a flaw in the above logic. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
Build succeeded |
The admin split path must accommodate a scenario where a range with not-yet-replicated followers is being manually split multiple times (eg. IMPORT during TPCC test fixtures). This scenario results in a bunch of replicas that all need to be populated with snapshots. To avoid backing up the raft snapshot queue, a heuristic was put in place (cockroachdb#32594) to delay the admin split if there is another snapshot being applied already. As shown by investigation in a failing test, there is a mismatch between the configured max delay for this heuristic (20s) and the actual duration of the snapshot application - the latter is limited by the max bandwidth for snapshots, 2 MB/s resulting in 32s applications in the worst case. We (Tobias and I) suspect that the heuristic thus fails to wait long enough to have the protective effect it was designed to provide. The current patch increases this delay to exceed this snapshot application duration estimate to about 50s. Note that this scenario is not likely to occur now that cockroachdb#32782 has been merged (this reduces the need for raft snapshots during splits); however in release-2.1 where that patch was not applied, the scenario is more likely. Release note (bug fix): resolve a cluster degradation scenario that could occur during IMPORT/RESTORE operations, manifested through a high number of pending Raft snapshots.
The admin split path must accommodate a scenario where a range with not-yet-replicated followers is being manually split multiple times (eg. IMPORT during TPCC test fixtures). This scenario results in a bunch of replicas that all need to be populated with snapshots. To avoid backing up the raft snapshot queue, a heuristic was put in place (cockroachdb#32594) to delay the admin split if there is another snapshot being applied already. As shown by investigation in a failing test, there is a mismatch between the configured max delay for this heuristic (20s) and the actual duration of the snapshot application - the latter is limited by the max bandwidth for snapshots, 2 MB/s resulting in 32s applications in the worst case. We (Tobias and I) suspect that the heuristic thus fails to wait long enough to have the protective effect it was designed to provide. The current patch increases this delay to exceed this snapshot application duration estimate to about 50s. Note that this scenario is not likely to occur now that cockroachdb#32782 has been merged (this reduces the need for raft snapshots during splits); however in release-2.1 where that patch was not applied, the scenario is more likely. Release note (bug fix): resolve a cluster degradation scenario that could occur during IMPORT/RESTORE operations, manifested through a high number of pending Raft snapshots.
33573: roachtest: add MinVersion to scrub tests r=lucy-zhang a=lucy-zhang The `scrub` tests have been passing on master, but failing on 2.1 because there was a bug (fixed in #32908). Release note: None 33582: storage: bump RaftDelaySplitToSuppressSnapshotTicks r=knz a=knz The admin split path must accommodate a scenario where a range with not-yet-replicated followers is being manually split multiple times (eg. IMPORT during TPCC test fixtures). This scenario results in a bunch of replicas that all need to be populated with snapshots. To avoid backing up the raft snapshot queue, a heuristic was put in place (#32594) to delay the admin split if there is another snapshot being applied already. As shown by investigation in a failing test, there is a mismatch between the configured max delay for this heuristic (20s) and the actual duration of the snapshot application - the latter is limited by the max bandwidth for snapshots, 2 MB/s resulting in 32s applications in the worst case. We (Tobias and I) suspect that the heuristic thus fails to wait long enough to have the protective effect it was designed to provide. The current patch increases this delay to exceed this snapshot application duration estimate to about 50s. Note that this scenario is not likely to occur now that #32782 has been merged (this reduces the need for raft snapshots during splits); however in release-2.1 where that patch was not applied, the scenario is more likely. Release note (bug fix): resolve a cluster degradation scenario that could occur during IMPORT/RESTORE operations, manifested through a high number of pending Raft snapshots. Co-authored-by: Lucy Zhang <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>
Bursts of splits (i.e. a sequence of splits for which each split splits the right-hand side of a prior split) can cause issues. This is because splitting a range in which a replica needs a snapshot results in two ranges in which a replica needs a snapshot where additionally there needs to be a sequencing between the two snapshots (one cannot apply a snapshot for the post-split replica until the pre-split replica has moved out of the way). The result of a long burst of splits such as occurring in RESTORE and IMPORT operations is then an overload of the snapshot queue with lots of wasted work, unavailable followers with operations hanging on them, and general mayhem in the logs. Since bulk operations also write a lot of data to the raft logs, log truncations then create an additional snapshot burden; in short, everything will be unhappy for a few hours and the cluster may effectively be unavailable. This isn't news to us and in fact was a big problem "back in 2018". When we first started to understand the issue, we introduced a mechanism that would delay splits (cockroachdb#32594) with the desired effect of ensuring that, all followers had caught up to ~all of the previous splits. This helped, but didn't explain why we were seeing snapshots in the first place. Investigating more, we realized that snapshots were sometimes spuriously requested by an uninitialized replica on the right-hand side which was contacted by another member of the right-hand side that had already been initialized by the split executing on the left-hand side; this snapshot was almost always unnecessary since the local left-hand side would usually initialize the right-hand side moments later. To address this, in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth of requests to an uninitialized range, and the mechanism was improved in cockroachdb#32782 and will now only do this if a local neighboring replica is expected to perform the split soon. With all this in place, you would've expected us to have all bases covered but it turns out that we are still running into issues prior to this PR. Concretely, whenever the aforementioned mechanism drops a message from the leader (a MsgApp), the leader will only contact the replica every second until it responds. It responds when it has been initialized via its left neighbor's splits and the leader reaches out again, i.e. an average of ~500ms after being initialized. However, by that time, it is itself already at the bottom of a long chain of splits, and the 500ms delay is delaying how long it takes for the rest of the chain to get initialized. Since the delay compounds on each link of the chain, the depth of the chain effectively determines the total delay experienced at the end. This would eventually exceed the patience of the mechanism that would suppress the snapshots, and snapshots would be requested. We would descend into madness similar to that experienced in the absence of the mechanism in the first place. The mechanism in cockroachdb#32594 could have helped here, but unfortunately it did not, as it routinely missed the fact that followers were not initialized yet. This is because during a split burst, the replica orchestrating the split was typically only created an instant before, and its raft group hadn't properly transitioned to leader status yet. This meant that in effect it wasn't delaying the splits at all. This commit adjusts the logic to delay splits to avoid this problem. While clamoring for leadership, the delay is upheld. Once collapsed into a definite state, the existing logic pretty much did the right thing, as it waited for the right-hand side to be in initialized. Release note (bug fix): Fixed a scenario in which a rapid sequence of splits could trigger a storm of Raft snapshots. This would be accompanied by log messages of the form "would have dropped incoming MsgApp, but allowing due to ..." and tended to occur as part of RESTORE/IMPORT operations.
Bursts of splits (i.e. a sequence of splits for which each split splits the right-hand side of a prior split) can cause issues. This is because splitting a range in which a replica needs a snapshot results in two ranges in which a replica needs a snapshot where additionally there needs to be a sequencing between the two snapshots (one cannot apply a snapshot for the post-split replica until the pre-split replica has moved out of the way). The result of a long burst of splits such as occurring in RESTORE and IMPORT operations is then an overload of the snapshot queue with lots of wasted work, unavailable followers with operations hanging on them, and general mayhem in the logs. Since bulk operations also write a lot of data to the raft logs, log truncations then create an additional snapshot burden; in short, everything will be unhappy for a few hours and the cluster may effectively be unavailable. This isn't news to us and in fact was a big problem "back in 2018". When we first started to understand the issue, we introduced a mechanism that would delay splits (cockroachdb#32594) with the desired effect of ensuring that, all followers had caught up to ~all of the previous splits. This helped, but didn't explain why we were seeing snapshots in the first place. Investigating more, we realized that snapshots were sometimes spuriously requested by an uninitialized replica on the right-hand side which was contacted by another member of the right-hand side that had already been initialized by the split executing on the left-hand side; this snapshot was almost always unnecessary since the local left-hand side would usually initialize the right-hand side moments later. To address this, in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth of requests to an uninitialized range, and the mechanism was improved in cockroachdb#32782 and will now only do this if a local neighboring replica is expected to perform the split soon. With all this in place, you would've expected us to have all bases covered but it turns out that we are still running into issues prior to this PR. Concretely, whenever the aforementioned mechanism drops a message from the leader (a MsgApp), the leader will only contact the replica every second until it responds. It responds when it has been initialized via its left neighbor's splits and the leader reaches out again, i.e. an average of ~500ms after being initialized. However, by that time, it is itself already at the bottom of a long chain of splits, and the 500ms delay is delaying how long it takes for the rest of the chain to get initialized. Since the delay compounds on each link of the chain, the depth of the chain effectively determines the total delay experienced at the end. This would eventually exceed the patience of the mechanism that would suppress the snapshots, and snapshots would be requested. We would descend into madness similar to that experienced in the absence of the mechanism in the first place. The mechanism in cockroachdb#32594 could have helped here, but unfortunately it did not, as it routinely missed the fact that followers were not initialized yet. This is because during a split burst, the replica orchestrating the split was typically only created an instant before, and its raft group hadn't properly transitioned to leader status yet. This meant that in effect it wasn't delaying the splits at all. This commit adjusts the logic to delay splits to avoid this problem. While clamoring for leadership, the delay is upheld. Once collapsed into a definite state, the existing logic pretty much did the right thing, as it waited for the right-hand side to be in initialized. Release note (bug fix): Fixed a scenario in which a rapid sequence of splits could trigger a storm of Raft snapshots. This would be accompanied by log messages of the form "would have dropped incoming MsgApp, but allowing due to ..." and tended to occur as part of RESTORE/IMPORT operations.
Bursts of splits (i.e. a sequence of splits for which each split splits the right-hand side of a prior split) can cause issues. This is because splitting a range in which a replica needs a snapshot results in two ranges in which a replica needs a snapshot where additionally there needs to be a sequencing between the two snapshots (one cannot apply a snapshot for the post-split replica until the pre-split replica has moved out of the way). The result of a long burst of splits such as occurring in RESTORE and IMPORT operations is then an overload of the snapshot queue with lots of wasted work, unavailable followers with operations hanging on them, and general mayhem in the logs. Since bulk operations also write a lot of data to the raft logs, log truncations then create an additional snapshot burden; in short, everything will be unhappy for a few hours and the cluster may effectively be unavailable. This isn't news to us and in fact was a big problem "back in 2018". When we first started to understand the issue, we introduced a mechanism that would delay splits (cockroachdb#32594) with the desired effect of ensuring that, all followers had caught up to ~all of the previous splits. This helped, but didn't explain why we were seeing snapshots in the first place. Investigating more, we realized that snapshots were sometimes spuriously requested by an uninitialized replica on the right-hand side which was contacted by another member of the right-hand side that had already been initialized by the split executing on the left-hand side; this snapshot was almost always unnecessary since the local left-hand side would usually initialize the right-hand side moments later. To address this, in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth of requests to an uninitialized range, and the mechanism was improved in cockroachdb#32782 and will now only do this if a local neighboring replica is expected to perform the split soon. With all this in place, you would've expected us to have all bases covered but it turns out that we are still running into issues prior to this PR. Concretely, whenever the aforementioned mechanism drops a message from the leader (a MsgApp), the leader will only contact the replica every second until it responds. It responds when it has been initialized via its left neighbor's splits and the leader reaches out again, i.e. an average of ~500ms after being initialized. However, by that time, it is itself already at the bottom of a long chain of splits, and the 500ms delay is delaying how long it takes for the rest of the chain to get initialized. Since the delay compounds on each link of the chain, the depth of the chain effectively determines the total delay experienced at the end. This would eventually exceed the patience of the mechanism that would suppress the snapshots, and snapshots would be requested. We would descend into madness similar to that experienced in the absence of the mechanism in the first place. The mechanism in cockroachdb#32594 could have helped here, but unfortunately it did not, as it routinely missed the fact that followers were not initialized yet. This is because during a split burst, the replica orchestrating the split was typically only created an instant before, and its raft group hadn't properly transitioned to leader status yet. This meant that in effect it wasn't delaying the splits at all. This commit adjusts the logic to delay splits to avoid this problem. While clamoring for leadership, the delay is upheld. Once collapsed into a definite state, the existing logic pretty much did the right thing, as it waited for the right-hand side to be in initialized. Release note (bug fix): Fixed a scenario in which a rapid sequence of splits could trigger a storm of Raft snapshots. This would be accompanied by log messages of the form "would have dropped incoming MsgApp, but allowing due to ..." and tended to occur as part of RESTORE/IMPORT operations.
Bursts of splits (i.e. a sequence of splits for which each split splits the right-hand side of a prior split) can cause issues. This is because splitting a range in which a replica needs a snapshot results in two ranges in which a replica needs a snapshot where additionally there needs to be a sequencing between the two snapshots (one cannot apply a snapshot for the post-split replica until the pre-split replica has moved out of the way). The result of a long burst of splits such as occurring in RESTORE and IMPORT operations is then an overload of the snapshot queue with lots of wasted work, unavailable followers with operations hanging on them, and general mayhem in the logs. Since bulk operations also write a lot of data to the raft logs, log truncations then create an additional snapshot burden; in short, everything will be unhappy for a few hours and the cluster may effectively be unavailable. This isn't news to us and in fact was a big problem "back in 2018". When we first started to understand the issue, we introduced a mechanism that would delay splits (cockroachdb#32594) with the desired effect of ensuring that, all followers had caught up to ~all of the previous splits. This helped, but didn't explain why we were seeing snapshots in the first place. Investigating more, we realized that snapshots were sometimes spuriously requested by an uninitialized replica on the right-hand side which was contacted by another member of the right-hand side that had already been initialized by the split executing on the left-hand side; this snapshot was almost always unnecessary since the local left-hand side would usually initialize the right-hand side moments later. To address this, in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth of requests to an uninitialized range, and the mechanism was improved in cockroachdb#32782 and will now only do this if a local neighboring replica is expected to perform the split soon. With all this in place, you would've expected us to have all bases covered but it turns out that we are still running into issues prior to this PR. Concretely, whenever the aforementioned mechanism drops a message from the leader (a MsgApp), the leader will only contact the replica every second until it responds. It responds when it has been initialized via its left neighbor's splits and the leader reaches out again, i.e. an average of ~500ms after being initialized. However, by that time, it is itself already at the bottom of a long chain of splits, and the 500ms delay is delaying how long it takes for the rest of the chain to get initialized. Since the delay compounds on each link of the chain, the depth of the chain effectively determines the total delay experienced at the end. This would eventually exceed the patience of the mechanism that would suppress the snapshots, and snapshots would be requested. We would descend into madness similar to that experienced in the absence of the mechanism in the first place. The mechanism in cockroachdb#32594 could have helped here, but unfortunately it did not, as it routinely missed the fact that followers were not initialized yet. This is because during a split burst, the replica orchestrating the split was typically only created an instant before, and its raft group hadn't properly transitioned to leader status yet. This meant that in effect it wasn't delaying the splits at all. This commit adjusts the logic to delay splits to avoid this problem. While clamoring for leadership, the delay is upheld. Once collapsed into a definite state, the existing logic pretty much did the right thing, as it waited for the right-hand side to be in initialized. Release note (bug fix): Fixed a scenario in which a rapid sequence of splits could trigger a storm of Raft snapshots. This would be accompanied by log messages of the form "would have dropped incoming MsgApp, but allowing due to ..." and tended to occur as part of RESTORE/IMPORT operations.
64060: kvserver: fix delaying of splits with uninitialized followers r=erikgrinaker a=tbg Bursts of splits (i.e. a sequence of splits for which each split splits the right-hand side of a prior split) can cause issues. This is because splitting a range in which a replica needs a snapshot results in two ranges in which a replica needs a snapshot where additionally there needs to be a sequencing between the two snapshots (one cannot apply a snapshot for the post-split replica until the pre-split replica has moved out of the way). The result of a long burst of splits such as occurring in RESTORE and IMPORT operations is then an overload of the snapshot queue with lots of wasted work, unavailable followers with operations hanging on them, and general mayhem in the logs. Since bulk operations also write a lot of data to the raft logs, log truncations then create an additional snapshot burden; in short, everything will be unhappy for a few hours and the cluster may effectively be unavailable. This isn't news to us and in fact was a big problem "back in 2018". When we first started to understand the issue, we introduced a mechanism that would delay splits (#32594) with the desired effect of ensuring that, all followers had caught up to ~all of the previous splits. This helped, but didn't explain why we were seeing snapshots in the first place. Investigating more, we realized that snapshots were sometimes spuriously requested by an uninitialized replica on the right-hand side which was contacted by another member of the right-hand side that had already been initialized by the split executing on the left-hand side; this snapshot was almost always unnecessary since the local left-hand side would usually initialize the right-hand side moments later. To address this, in #32594 we started unconditionally dropping the first ~seconds worth of requests to an uninitialized range, and the mechanism was improved in #32782 and will now only do this if a local neighboring replica is expected to perform the split soon. With all this in place, you would've expected us to have all bases covered but it turns out that we are still running into issues prior to this PR. Concretely, whenever the aforementioned mechanism drops a message from the leader (a MsgApp), the leader will only contact the replica every second until it responds. It responds when it has been initialized via its left neighbor's splits and the leader reaches out again, i.e. an average of ~500ms after being initialized. However, by that time, it is itself already at the bottom of a long chain of splits, and the 500ms delay is delaying how long it takes for the rest of the chain to get initialized. Since the delay compounds on each link of the chain, the depth of the chain effectively determines the total delay experienced at the end. This would eventually exceed the patience of the mechanism that would suppress the snapshots, and snapshots would be requested. We would descend into madness similar to that experienced in the absence of the mechanism in the first place. The mechanism in #32594 could have helped here, but unfortunately it did not, as it routinely missed the fact that followers were not initialized yet. This is because during a split burst, the replica orchestrating the split was typically only created an instant before, and its raft group hadn't properly transitioned to leader status yet. This meant that in effect it wasn't delaying the splits at all. This commit adjusts the logic to delay splits to avoid this problem. While clamoring for leadership, the delay is upheld. Once collapsed into a definite state, the existing logic pretty much did the right thing, as it waited for the right-hand side to be in initialized. Closes #61396. cc @cockroachdb/kv Release note (bug fix): Fixed a scenario in which a rapid sequence of splits could trigger a storm of Raft snapshots. This would be accompanied by log messages of the form "would have dropped incoming MsgApp, but allowing due to ..." and tended to occur as part of RESTORE/IMPORT operations. Co-authored-by: Tobias Grieger <[email protected]>
Bursts of splits (i.e. a sequence of splits for which each split splits the right-hand side of a prior split) can cause issues. This is because splitting a range in which a replica needs a snapshot results in two ranges in which a replica needs a snapshot where additionally there needs to be a sequencing between the two snapshots (one cannot apply a snapshot for the post-split replica until the pre-split replica has moved out of the way). The result of a long burst of splits such as occurring in RESTORE and IMPORT operations is then an overload of the snapshot queue with lots of wasted work, unavailable followers with operations hanging on them, and general mayhem in the logs. Since bulk operations also write a lot of data to the raft logs, log truncations then create an additional snapshot burden; in short, everything will be unhappy for a few hours and the cluster may effectively be unavailable. This isn't news to us and in fact was a big problem "back in 2018". When we first started to understand the issue, we introduced a mechanism that would delay splits (cockroachdb#32594) with the desired effect of ensuring that, all followers had caught up to ~all of the previous splits. This helped, but didn't explain why we were seeing snapshots in the first place. Investigating more, we realized that snapshots were sometimes spuriously requested by an uninitialized replica on the right-hand side which was contacted by another member of the right-hand side that had already been initialized by the split executing on the left-hand side; this snapshot was almost always unnecessary since the local left-hand side would usually initialize the right-hand side moments later. To address this, in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth of requests to an uninitialized range, and the mechanism was improved in cockroachdb#32782 and will now only do this if a local neighboring replica is expected to perform the split soon. With all this in place, you would've expected us to have all bases covered but it turns out that we are still running into issues prior to this PR. Concretely, whenever the aforementioned mechanism drops a message from the leader (a MsgApp), the leader will only contact the replica every second until it responds. It responds when it has been initialized via its left neighbor's splits and the leader reaches out again, i.e. an average of ~500ms after being initialized. However, by that time, it is itself already at the bottom of a long chain of splits, and the 500ms delay is delaying how long it takes for the rest of the chain to get initialized. Since the delay compounds on each link of the chain, the depth of the chain effectively determines the total delay experienced at the end. This would eventually exceed the patience of the mechanism that would suppress the snapshots, and snapshots would be requested. We would descend into madness similar to that experienced in the absence of the mechanism in the first place. The mechanism in cockroachdb#32594 could have helped here, but unfortunately it did not, as it routinely missed the fact that followers were not initialized yet. This is because during a split burst, the replica orchestrating the split was typically only created an instant before, and its raft group hadn't properly transitioned to leader status yet. This meant that in effect it wasn't delaying the splits at all. This commit adjusts the logic to delay splits to avoid this problem. While clamoring for leadership, the delay is upheld. Once collapsed into a definite state, the existing logic pretty much did the right thing, as it waited for the right-hand side to be in initialized. Release note (bug fix): Fixed a scenario in which a rapid sequence of splits could trigger a storm of Raft snapshots. This would be accompanied by log messages of the form "would have dropped incoming MsgApp, but allowing due to ..." and tended to occur as part of RESTORE/IMPORT operations.
Bursts of splits (i.e. a sequence of splits for which each split splits the right-hand side of a prior split) can cause issues. This is because splitting a range in which a replica needs a snapshot results in two ranges in which a replica needs a snapshot where additionally there needs to be a sequencing between the two snapshots (one cannot apply a snapshot for the post-split replica until the pre-split replica has moved out of the way). The result of a long burst of splits such as occurring in RESTORE and IMPORT operations is then an overload of the snapshot queue with lots of wasted work, unavailable followers with operations hanging on them, and general mayhem in the logs. Since bulk operations also write a lot of data to the raft logs, log truncations then create an additional snapshot burden; in short, everything will be unhappy for a few hours and the cluster may effectively be unavailable. This isn't news to us and in fact was a big problem "back in 2018". When we first started to understand the issue, we introduced a mechanism that would delay splits (cockroachdb#32594) with the desired effect of ensuring that, all followers had caught up to ~all of the previous splits. This helped, but didn't explain why we were seeing snapshots in the first place. Investigating more, we realized that snapshots were sometimes spuriously requested by an uninitialized replica on the right-hand side which was contacted by another member of the right-hand side that had already been initialized by the split executing on the left-hand side; this snapshot was almost always unnecessary since the local left-hand side would usually initialize the right-hand side moments later. To address this, in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth of requests to an uninitialized range, and the mechanism was improved in cockroachdb#32782 and will now only do this if a local neighboring replica is expected to perform the split soon. With all this in place, you would've expected us to have all bases covered but it turns out that we are still running into issues prior to this PR. Concretely, whenever the aforementioned mechanism drops a message from the leader (a MsgApp), the leader will only contact the replica every second until it responds. It responds when it has been initialized via its left neighbor's splits and the leader reaches out again, i.e. an average of ~500ms after being initialized. However, by that time, it is itself already at the bottom of a long chain of splits, and the 500ms delay is delaying how long it takes for the rest of the chain to get initialized. Since the delay compounds on each link of the chain, the depth of the chain effectively determines the total delay experienced at the end. This would eventually exceed the patience of the mechanism that would suppress the snapshots, and snapshots would be requested. We would descend into madness similar to that experienced in the absence of the mechanism in the first place. The mechanism in cockroachdb#32594 could have helped here, but unfortunately it did not, as it routinely missed the fact that followers were not initialized yet. This is because during a split burst, the replica orchestrating the split was typically only created an instant before, and its raft group hadn't properly transitioned to leader status yet. This meant that in effect it wasn't delaying the splits at all. This commit adjusts the logic to delay splits to avoid this problem. While clamoring for leadership, the delay is upheld. Once collapsed into a definite state, the existing logic pretty much did the right thing, as it waited for the right-hand side to be in initialized. Release note (bug fix): Fixed a scenario in which a rapid sequence of splits could trigger a storm of Raft snapshots. This would be accompanied by log messages of the form "would have dropped incoming MsgApp, but allowing due to ..." and tended to occur as part of RESTORE/IMPORT operations.
We previously "waited" for the split trigger quite literally, by
dropping an outgoing rejecting MsgAppResp for a few seconds. The
downside of doing so was that it would also artificially delay Raft
snapshots that were actually required. With the introduction of a
mechanism that delays split that would cause Raft snapshots this is a
lesser problem than before, but it's still ugly.
The revamped mechanism in this commit goes the extra mile to make
an informed decision on whether a split trigger is expected to apply:
start key along with the message
the store has a replica that currently owns the start key. This replica
is the one in charge of applying the split trigger,
There's also a time-based escape hatch to avoid delaying snapshots
indefinitely should there be a flaw in the above logic.
Release note: None