-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: delay manual splits that would result in more snapshots #32594
Conversation
063f372
to
21da18d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica_command.go, line 282 at r2 (raw file):
} if tPostWait == (time.Time{}) { tPostWait = timeutil.Now()
I'm not understanding what is motivating this handling of tPostWait
. I think you want a boolean indicating whether you ever had a time.After
call, and if yes then output the delayed split
for message. As it is, it appears you'll always output that message.
pkg/storage/replica_command.go, line 393 at r2 (raw file):
log.Infof(ctx, "initiating a split of this range at key %s [r%d]%s", extra, splitKey, rightDesc.RangeID)
Is the argument reordering intentional? Seems like that messes with the log message strangely.
02d4795
to
3f1a1f4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The small fry missing from this change aside, assuming this addresses the problem, is this what we want? Does any alternative approach come to mind? I concede defeat on avoiding Raft snapshots. The RaftPostSplitSuppressSnapshotTicks
mechanism makes snapshots much worse when it delays without the split trigger showing up, and yet it seems impossible to do much better. I tried (unpushed WIP) to move that detection to the receiver (i.e. the Raft leader of the uninitialized RHS) and this works very well unless the Raft leader of the RHS isn't colocated with that of the LHS. This is mostly true today (though not in the test that I added for this mechanism, which is now unfixably flaky) and I imagine that in the future we won't want splits to pin the RHS and LHS leaseholder together as is done today. (And even if that were somehow circumvented, the problem is still that a single Raft snapshot for whatever reason is multiplied by whatever number of splits hit the range in that state, unless there's a mechanism such as this one).
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica_command.go, line 282 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I'm not understanding what is motivating this handling of
tPostWait
. I think you want a boolean indicating whether you ever had atime.After
call, and if yes then output thedelayed split
for message. As it is, it appears you'll always output that message.
I think this works as intended. Note that we initialize tPostWait := tPreWait
and zero it out only when we're actually waiting (i.e. when ticks > 0
, though that should really be ticks = 1
). I agree that a bool is saner though, done.
pkg/storage/replica_command.go, line 393 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Is the argument reordering intentional? Seems like that messes with the log message strangely.
No, and I just fixed this because the output looked really confusing.
3f1a1f4
to
0e46399
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other mitigation we might try is to speed up Raft snapshots when they are very small. For restore / import, the Raft snapshots should be mostly empty. For a range with real data that requires a snapshot, splitting will double the number of snapshots, but each snapshot should be half the size.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica_command.go, line 290 at r3 (raw file):
} if elapsed := timeutil.Since(tPreWait); waited {
Nit: this should probably be if waited { extra += fmt.Sprintf("...", timeutil.Since(tPreWait)) }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained
pkg/storage/replica_command.go, line 235 at r3 (raw file):
// up. If we run out of time, we'll resume the split no matter what. _ = r.maybeDropMsgAppResp maxDelaySplitToAvoidSnapshotTicks := 5 + r.store.cfg.RaftPostSplitSuppressSnapshotTicks
I'm not sure this is long enough to wait. If the Raft snapshot queue is backed up it can take minutes to clear. So this just seems like it will wait for a bit, then add more work to the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this but if it can put an end to the saga of raft snapshots I'm for it.
One more idea is whether we can tune the order in which splits happen. Would reversing the order of splits make us less likely to hit pathological cases? Or if both ascending and descending are problematic, maybe a pattern like binary search to avoid hitting the same ranges repeatedly?
Reviewed 1 of 1 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained
pkg/storage/replica_command.go, line 234 at r3 (raw file):
// plus a bit of padding to give a snapshot a chance to catch the follower // up. If we run out of time, we'll resume the split no matter what. _ = r.maybeDropMsgAppResp
Is this just to provide a grep/guru target or did you mean to call this method? (if it's just a guru target say so in the comment)
Note that the settings in this PR delay splits by up to 5s. In testing, @tbg and I have found this to be insufficient for getting rid of Raft snapshot backlog. I've been testing with settings that delay splits for up to 3.5m. The longest split I've seen delayed is 34.5s, though that comes with the benefit of no Raft snapshot backlog. Is that still palatable? One thing that leaves me a bit dissatisfied is that I don't have a clear picture of what is going on here. This is my own failing: I haven't taken the time to trace through how these Raft snapshots are accumulating. I'm hoping to either do that, or wait until @tbg produces a clear writeup. |
As long as that worst-case only happens either during large restore/imports or during network distress, I think it's bearable. |
I second what @petermattis said, it's still not 100% clear what's going on. I think we might actually be seeing priority inversion at work here because preemptive snapshots routinely take 15+ seconds. Putting logging like in this PR in place has been helpful. I'd like to confirm this (or a similar theory) before merging this, and while we're there, the Raft snapshot queue (which very likely is the bottleneck here) needs to get better introspection. |
4173ec2
to
f4040c1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/replica_command.go, line 235 at r3 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I'm not sure this is long enough to wait. If the Raft snapshot queue is backed up it can take minutes to clear. So this just seems like it will wait for a bit, then add more work to the queue.
In my experimentation with this change, bumping this value significantly produces a benefit. In particular, without this change I see a number of messages about delaying a split for 5s only to let it through despite it possibly causing a Raft snapshot. If a bump the value I see splits delayed for longer (17.2s in one example), but no spike in Raft snapshots.
pkg/storage/replica_command.go, line 296 at r9 (raw file):
return false, nil }) r.raftMu.Unlock()
Don't you want to do this before the select
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No code updates in this round of comments. More tomorrow.
Using the backoff only for certain AdminSplit (those issued by backup/restore) is something we can possibly backport so I'm planning to make that change (after asserting that it fixes all the problems and making sure that long backoffs are traced back to a root cause via a log message).
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/replica_command.go, line 234 at r3 (raw file):
see the _ assignment below)
What should I change this to? (This helps navigate and also prevents naming rot)
_ = ... // guru assignment
?
pkg/storage/replica_command.go, line 235 at r3 (raw file):
Previously, petermattis (Peter Mattis) wrote…
In my experimentation with this change, bumping this value significantly produces a benefit. In particular, without this change I see a number of messages about delaying a split for 5s only to let it through despite it possibly causing a Raft snapshot. If a bump the value I see splits delayed for longer (17.2s in one example), but no spike in Raft snapshots.
I've seen the same. I think ~15s is sort of a magic number because that's how long the average snapshot will be blocked. A preemptive snapshot is usually ~30mb in this workload and they're rate limited to 2mb/s and can block a Raft snapshot to the same target. So a Raft snapshot will often have to wait ~15s.
pkg/storage/replica_command.go, line 296 at r9 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Don't you want to do this before the
select
?
Good point, will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm also going to experiment with splitting the receiver-side snapshot semaphore into two, so that we always allow one Raft snapshot to be in flight, even if there's already a preemptive one).
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
For the record (I think I only mentioned this privately so far): testing this PR has discovered what I think is a Raft bug, where a follower isn't tracked as replicating after it has applied a snapshot that is up-to-date with the leader's index. This is mostly inconsequential, but would trip up the wait logic. I'll link to the upstream PR once it's out (and I still think I found a bug). |
7b0f789
to
417df15
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/replica_command.go, line 234 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
see the _ assignment below)
What should I change this to? (This helps navigate and also prevents naming rot)
_ = ... // guru assignment
?
Done.
pkg/storage/replica_command.go, line 290 at r3 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Nit: this should probably be
if waited { extra += fmt.Sprintf("...", timeutil.Since(tPreWait)) }
Done, I think.
pkg/storage/replica_command.go, line 296 at r9 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Good point, will do.
Done.
417df15
to
2eb4f12
Compare
It now differentiates between the case in which the follower needs a snapshot vs the one in which it is still being probed. We're not expecting the probing case once the bug described here is fixed: cockroachdb#32594 (comment) Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR needs lots of cleanup love, but I think the basic mechanism now works. I've run the first few minutes of restore2TB repeatedly and while I always see snapshots and sometimes see the backoff mechanism "give up" (14s in my current testing branch), the reason is always because a follower still needs a snapshot. I definitely saw the Raft bug pop up in almost every run until I had tuned all the Raft proposal knobs required (need to wake the leader), and it's been reliably gone since. I think we're getting somewhere here!
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Progress! Let me know when you want me to give this another thorough look.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/replica_command.go, line 305 at r23 (raw file):
if pr.State != raft.ProgressStateReplicate { extra += fmt.Sprintf("; replica r%d/%d not caught up: %+v", r.RangeID, replicaID, &pr)
Adding on to extra
every tick loop seems excessive. I think you want to do something where you construct a newExtra
string for the progress, and then after the loop if !done
you set extra = newExtra
.
It now differentiates between the case in which the follower needs a snapshot vs the one in which it is still being probed. We're not expecting the probing case once the bug described here is fixed: cockroachdb#32594 (comment) Release note: None
Persists a recent finding I made experimenting with cockroachdb#32594. Release note: None
…snapshots Backports cockroachdb#32594. This didn't apply cleanly, but only because I never cherry-picked cockroachdb#32594 and cockroachdb#32594 refers a variable introduced within. Fixes cockroachdb#32898. Fixes cockroachdb#32900. Fixes cockroachdb#32895. /cc @cockroachdb/release Release note (bug fix): resolve a cluster degradation scenario that could occur during IMPORT/RESTORE operations, manifested through a high number of pending Raft snapshots.
The admin split path must accommodate a scenario where a range with not-yet-replicated followers is being manually split multiple times (eg. IMPORT during TPCC test fixtures). This scenario results in a bunch of replicas that all need to be populated with snapshots. To avoid backing up the raft snapshot queue, a heuristic was put in place (cockroachdb#32594) to delay the admin split if there is another snapshot being applied already. As shown by investigation in a failing test, there is a mismatch between the configured max delay for this heuristic (20s) and the actual duration of the snapshot application - the latter is limited by the max bandwidth for snapshots, 2 MB/s resulting in 32s applications in the worst case. We (Tobias and I) suspect that the heuristic thus fails to wait long enough to have the protective effect it was designed to provide. The current patch increases this delay to exceed this snapshot application duration estimate to about 50s. Note that this scenario is not likely to occur now that cockroachdb#32782 has been merged (this reduces the need for raft snapshots during splits); however in release-2.1 where that patch was not applied, the scenario is more likely. Release note (bug fix): resolve a cluster degradation scenario that could occur during IMPORT/RESTORE operations, manifested through a high number of pending Raft snapshots.
The admin split path must accommodate a scenario where a range with not-yet-replicated followers is being manually split multiple times (eg. IMPORT during TPCC test fixtures). This scenario results in a bunch of replicas that all need to be populated with snapshots. To avoid backing up the raft snapshot queue, a heuristic was put in place (cockroachdb#32594) to delay the admin split if there is another snapshot being applied already. As shown by investigation in a failing test, there is a mismatch between the configured max delay for this heuristic (20s) and the actual duration of the snapshot application - the latter is limited by the max bandwidth for snapshots, 2 MB/s resulting in 32s applications in the worst case. We (Tobias and I) suspect that the heuristic thus fails to wait long enough to have the protective effect it was designed to provide. The current patch increases this delay to exceed this snapshot application duration estimate to about 50s. Note that this scenario is not likely to occur now that cockroachdb#32782 has been merged (this reduces the need for raft snapshots during splits); however in release-2.1 where that patch was not applied, the scenario is more likely. Release note (bug fix): resolve a cluster degradation scenario that could occur during IMPORT/RESTORE operations, manifested through a high number of pending Raft snapshots.
33573: roachtest: add MinVersion to scrub tests r=lucy-zhang a=lucy-zhang The `scrub` tests have been passing on master, but failing on 2.1 because there was a bug (fixed in #32908). Release note: None 33582: storage: bump RaftDelaySplitToSuppressSnapshotTicks r=knz a=knz The admin split path must accommodate a scenario where a range with not-yet-replicated followers is being manually split multiple times (eg. IMPORT during TPCC test fixtures). This scenario results in a bunch of replicas that all need to be populated with snapshots. To avoid backing up the raft snapshot queue, a heuristic was put in place (#32594) to delay the admin split if there is another snapshot being applied already. As shown by investigation in a failing test, there is a mismatch between the configured max delay for this heuristic (20s) and the actual duration of the snapshot application - the latter is limited by the max bandwidth for snapshots, 2 MB/s resulting in 32s applications in the worst case. We (Tobias and I) suspect that the heuristic thus fails to wait long enough to have the protective effect it was designed to provide. The current patch increases this delay to exceed this snapshot application duration estimate to about 50s. Note that this scenario is not likely to occur now that #32782 has been merged (this reduces the need for raft snapshots during splits); however in release-2.1 where that patch was not applied, the scenario is more likely. Release note (bug fix): resolve a cluster degradation scenario that could occur during IMPORT/RESTORE operations, manifested through a high number of pending Raft snapshots. Co-authored-by: Lucy Zhang <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>
The admin split path must accommodate a scenario where a range with not-yet-replicated followers is being manually split multiple times (eg. IMPORT during TPCC test fixtures). This scenario results in a bunch of replicas that all need to be populated with snapshots. To avoid backing up the raft snapshot queue, a heuristic was put in place (cockroachdb#32594) to delay the admin split if there is another snapshot being applied already. As shown by investigation in a failing test, there is a mismatch between the configured max delay for this heuristic (20s) and the actual duration of the snapshot application - the latter is limited by the max bandwidth for snapshots, 2 MB/s resulting in 32s applications in the worst case. We (Tobias and I) suspect that the heuristic thus fails to wait long enough to have the protective effect it was designed to provide. The current patch increases this delay to exceed this snapshot application duration estimate to about 50s. Release note (bug fix): resolve a cluster degradation scenario that could occur during IMPORT/RESTORE operations, manifested through a high number of pending Raft snapshots.
Bursts of splits (i.e. a sequence of splits for which each split splits the right-hand side of a prior split) can cause issues. This is because splitting a range in which a replica needs a snapshot results in two ranges in which a replica needs a snapshot where additionally there needs to be a sequencing between the two snapshots (one cannot apply a snapshot for the post-split replica until the pre-split replica has moved out of the way). The result of a long burst of splits such as occurring in RESTORE and IMPORT operations is then an overload of the snapshot queue with lots of wasted work, unavailable followers with operations hanging on them, and general mayhem in the logs. Since bulk operations also write a lot of data to the raft logs, log truncations then create an additional snapshot burden; in short, everything will be unhappy for a few hours and the cluster may effectively be unavailable. This isn't news to us and in fact was a big problem "back in 2018". When we first started to understand the issue, we introduced a mechanism that would delay splits (cockroachdb#32594) with the desired effect of ensuring that, all followers had caught up to ~all of the previous splits. This helped, but didn't explain why we were seeing snapshots in the first place. Investigating more, we realized that snapshots were sometimes spuriously requested by an uninitialized replica on the right-hand side which was contacted by another member of the right-hand side that had already been initialized by the split executing on the left-hand side; this snapshot was almost always unnecessary since the local left-hand side would usually initialize the right-hand side moments later. To address this, in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth of requests to an uninitialized range, and the mechanism was improved in cockroachdb#32782 and will now only do this if a local neighboring replica is expected to perform the split soon. With all this in place, you would've expected us to have all bases covered but it turns out that we are still running into issues prior to this PR. Concretely, whenever the aforementioned mechanism drops a message from the leader (a MsgApp), the leader will only contact the replica every second until it responds. It responds when it has been initialized via its left neighbor's splits and the leader reaches out again, i.e. an average of ~500ms after being initialized. However, by that time, it is itself already at the bottom of a long chain of splits, and the 500ms delay is delaying how long it takes for the rest of the chain to get initialized. Since the delay compounds on each link of the chain, the depth of the chain effectively determines the total delay experienced at the end. This would eventually exceed the patience of the mechanism that would suppress the snapshots, and snapshots would be requested. We would descend into madness similar to that experienced in the absence of the mechanism in the first place. The mechanism in cockroachdb#32594 could have helped here, but unfortunately it did not, as it routinely missed the fact that followers were not initialized yet. This is because during a split burst, the replica orchestrating the split was typically only created an instant before, and its raft group hadn't properly transitioned to leader status yet. This meant that in effect it wasn't delaying the splits at all. This commit adjusts the logic to delay splits to avoid this problem. While clamoring for leadership, the delay is upheld. Once collapsed into a definite state, the existing logic pretty much did the right thing, as it waited for the right-hand side to be in initialized. Release note (bug fix): Fixed a scenario in which a rapid sequence of splits could trigger a storm of Raft snapshots. This would be accompanied by log messages of the form "would have dropped incoming MsgApp, but allowing due to ..." and tended to occur as part of RESTORE/IMPORT operations.
Bursts of splits (i.e. a sequence of splits for which each split splits the right-hand side of a prior split) can cause issues. This is because splitting a range in which a replica needs a snapshot results in two ranges in which a replica needs a snapshot where additionally there needs to be a sequencing between the two snapshots (one cannot apply a snapshot for the post-split replica until the pre-split replica has moved out of the way). The result of a long burst of splits such as occurring in RESTORE and IMPORT operations is then an overload of the snapshot queue with lots of wasted work, unavailable followers with operations hanging on them, and general mayhem in the logs. Since bulk operations also write a lot of data to the raft logs, log truncations then create an additional snapshot burden; in short, everything will be unhappy for a few hours and the cluster may effectively be unavailable. This isn't news to us and in fact was a big problem "back in 2018". When we first started to understand the issue, we introduced a mechanism that would delay splits (cockroachdb#32594) with the desired effect of ensuring that, all followers had caught up to ~all of the previous splits. This helped, but didn't explain why we were seeing snapshots in the first place. Investigating more, we realized that snapshots were sometimes spuriously requested by an uninitialized replica on the right-hand side which was contacted by another member of the right-hand side that had already been initialized by the split executing on the left-hand side; this snapshot was almost always unnecessary since the local left-hand side would usually initialize the right-hand side moments later. To address this, in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth of requests to an uninitialized range, and the mechanism was improved in cockroachdb#32782 and will now only do this if a local neighboring replica is expected to perform the split soon. With all this in place, you would've expected us to have all bases covered but it turns out that we are still running into issues prior to this PR. Concretely, whenever the aforementioned mechanism drops a message from the leader (a MsgApp), the leader will only contact the replica every second until it responds. It responds when it has been initialized via its left neighbor's splits and the leader reaches out again, i.e. an average of ~500ms after being initialized. However, by that time, it is itself already at the bottom of a long chain of splits, and the 500ms delay is delaying how long it takes for the rest of the chain to get initialized. Since the delay compounds on each link of the chain, the depth of the chain effectively determines the total delay experienced at the end. This would eventually exceed the patience of the mechanism that would suppress the snapshots, and snapshots would be requested. We would descend into madness similar to that experienced in the absence of the mechanism in the first place. The mechanism in cockroachdb#32594 could have helped here, but unfortunately it did not, as it routinely missed the fact that followers were not initialized yet. This is because during a split burst, the replica orchestrating the split was typically only created an instant before, and its raft group hadn't properly transitioned to leader status yet. This meant that in effect it wasn't delaying the splits at all. This commit adjusts the logic to delay splits to avoid this problem. While clamoring for leadership, the delay is upheld. Once collapsed into a definite state, the existing logic pretty much did the right thing, as it waited for the right-hand side to be in initialized. Release note (bug fix): Fixed a scenario in which a rapid sequence of splits could trigger a storm of Raft snapshots. This would be accompanied by log messages of the form "would have dropped incoming MsgApp, but allowing due to ..." and tended to occur as part of RESTORE/IMPORT operations.
Bursts of splits (i.e. a sequence of splits for which each split splits the right-hand side of a prior split) can cause issues. This is because splitting a range in which a replica needs a snapshot results in two ranges in which a replica needs a snapshot where additionally there needs to be a sequencing between the two snapshots (one cannot apply a snapshot for the post-split replica until the pre-split replica has moved out of the way). The result of a long burst of splits such as occurring in RESTORE and IMPORT operations is then an overload of the snapshot queue with lots of wasted work, unavailable followers with operations hanging on them, and general mayhem in the logs. Since bulk operations also write a lot of data to the raft logs, log truncations then create an additional snapshot burden; in short, everything will be unhappy for a few hours and the cluster may effectively be unavailable. This isn't news to us and in fact was a big problem "back in 2018". When we first started to understand the issue, we introduced a mechanism that would delay splits (cockroachdb#32594) with the desired effect of ensuring that, all followers had caught up to ~all of the previous splits. This helped, but didn't explain why we were seeing snapshots in the first place. Investigating more, we realized that snapshots were sometimes spuriously requested by an uninitialized replica on the right-hand side which was contacted by another member of the right-hand side that had already been initialized by the split executing on the left-hand side; this snapshot was almost always unnecessary since the local left-hand side would usually initialize the right-hand side moments later. To address this, in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth of requests to an uninitialized range, and the mechanism was improved in cockroachdb#32782 and will now only do this if a local neighboring replica is expected to perform the split soon. With all this in place, you would've expected us to have all bases covered but it turns out that we are still running into issues prior to this PR. Concretely, whenever the aforementioned mechanism drops a message from the leader (a MsgApp), the leader will only contact the replica every second until it responds. It responds when it has been initialized via its left neighbor's splits and the leader reaches out again, i.e. an average of ~500ms after being initialized. However, by that time, it is itself already at the bottom of a long chain of splits, and the 500ms delay is delaying how long it takes for the rest of the chain to get initialized. Since the delay compounds on each link of the chain, the depth of the chain effectively determines the total delay experienced at the end. This would eventually exceed the patience of the mechanism that would suppress the snapshots, and snapshots would be requested. We would descend into madness similar to that experienced in the absence of the mechanism in the first place. The mechanism in cockroachdb#32594 could have helped here, but unfortunately it did not, as it routinely missed the fact that followers were not initialized yet. This is because during a split burst, the replica orchestrating the split was typically only created an instant before, and its raft group hadn't properly transitioned to leader status yet. This meant that in effect it wasn't delaying the splits at all. This commit adjusts the logic to delay splits to avoid this problem. While clamoring for leadership, the delay is upheld. Once collapsed into a definite state, the existing logic pretty much did the right thing, as it waited for the right-hand side to be in initialized. Release note (bug fix): Fixed a scenario in which a rapid sequence of splits could trigger a storm of Raft snapshots. This would be accompanied by log messages of the form "would have dropped incoming MsgApp, but allowing due to ..." and tended to occur as part of RESTORE/IMPORT operations.
Bursts of splits (i.e. a sequence of splits for which each split splits the right-hand side of a prior split) can cause issues. This is because splitting a range in which a replica needs a snapshot results in two ranges in which a replica needs a snapshot where additionally there needs to be a sequencing between the two snapshots (one cannot apply a snapshot for the post-split replica until the pre-split replica has moved out of the way). The result of a long burst of splits such as occurring in RESTORE and IMPORT operations is then an overload of the snapshot queue with lots of wasted work, unavailable followers with operations hanging on them, and general mayhem in the logs. Since bulk operations also write a lot of data to the raft logs, log truncations then create an additional snapshot burden; in short, everything will be unhappy for a few hours and the cluster may effectively be unavailable. This isn't news to us and in fact was a big problem "back in 2018". When we first started to understand the issue, we introduced a mechanism that would delay splits (cockroachdb#32594) with the desired effect of ensuring that, all followers had caught up to ~all of the previous splits. This helped, but didn't explain why we were seeing snapshots in the first place. Investigating more, we realized that snapshots were sometimes spuriously requested by an uninitialized replica on the right-hand side which was contacted by another member of the right-hand side that had already been initialized by the split executing on the left-hand side; this snapshot was almost always unnecessary since the local left-hand side would usually initialize the right-hand side moments later. To address this, in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth of requests to an uninitialized range, and the mechanism was improved in cockroachdb#32782 and will now only do this if a local neighboring replica is expected to perform the split soon. With all this in place, you would've expected us to have all bases covered but it turns out that we are still running into issues prior to this PR. Concretely, whenever the aforementioned mechanism drops a message from the leader (a MsgApp), the leader will only contact the replica every second until it responds. It responds when it has been initialized via its left neighbor's splits and the leader reaches out again, i.e. an average of ~500ms after being initialized. However, by that time, it is itself already at the bottom of a long chain of splits, and the 500ms delay is delaying how long it takes for the rest of the chain to get initialized. Since the delay compounds on each link of the chain, the depth of the chain effectively determines the total delay experienced at the end. This would eventually exceed the patience of the mechanism that would suppress the snapshots, and snapshots would be requested. We would descend into madness similar to that experienced in the absence of the mechanism in the first place. The mechanism in cockroachdb#32594 could have helped here, but unfortunately it did not, as it routinely missed the fact that followers were not initialized yet. This is because during a split burst, the replica orchestrating the split was typically only created an instant before, and its raft group hadn't properly transitioned to leader status yet. This meant that in effect it wasn't delaying the splits at all. This commit adjusts the logic to delay splits to avoid this problem. While clamoring for leadership, the delay is upheld. Once collapsed into a definite state, the existing logic pretty much did the right thing, as it waited for the right-hand side to be in initialized. Release note (bug fix): Fixed a scenario in which a rapid sequence of splits could trigger a storm of Raft snapshots. This would be accompanied by log messages of the form "would have dropped incoming MsgApp, but allowing due to ..." and tended to occur as part of RESTORE/IMPORT operations.
64060: kvserver: fix delaying of splits with uninitialized followers r=erikgrinaker a=tbg Bursts of splits (i.e. a sequence of splits for which each split splits the right-hand side of a prior split) can cause issues. This is because splitting a range in which a replica needs a snapshot results in two ranges in which a replica needs a snapshot where additionally there needs to be a sequencing between the two snapshots (one cannot apply a snapshot for the post-split replica until the pre-split replica has moved out of the way). The result of a long burst of splits such as occurring in RESTORE and IMPORT operations is then an overload of the snapshot queue with lots of wasted work, unavailable followers with operations hanging on them, and general mayhem in the logs. Since bulk operations also write a lot of data to the raft logs, log truncations then create an additional snapshot burden; in short, everything will be unhappy for a few hours and the cluster may effectively be unavailable. This isn't news to us and in fact was a big problem "back in 2018". When we first started to understand the issue, we introduced a mechanism that would delay splits (#32594) with the desired effect of ensuring that, all followers had caught up to ~all of the previous splits. This helped, but didn't explain why we were seeing snapshots in the first place. Investigating more, we realized that snapshots were sometimes spuriously requested by an uninitialized replica on the right-hand side which was contacted by another member of the right-hand side that had already been initialized by the split executing on the left-hand side; this snapshot was almost always unnecessary since the local left-hand side would usually initialize the right-hand side moments later. To address this, in #32594 we started unconditionally dropping the first ~seconds worth of requests to an uninitialized range, and the mechanism was improved in #32782 and will now only do this if a local neighboring replica is expected to perform the split soon. With all this in place, you would've expected us to have all bases covered but it turns out that we are still running into issues prior to this PR. Concretely, whenever the aforementioned mechanism drops a message from the leader (a MsgApp), the leader will only contact the replica every second until it responds. It responds when it has been initialized via its left neighbor's splits and the leader reaches out again, i.e. an average of ~500ms after being initialized. However, by that time, it is itself already at the bottom of a long chain of splits, and the 500ms delay is delaying how long it takes for the rest of the chain to get initialized. Since the delay compounds on each link of the chain, the depth of the chain effectively determines the total delay experienced at the end. This would eventually exceed the patience of the mechanism that would suppress the snapshots, and snapshots would be requested. We would descend into madness similar to that experienced in the absence of the mechanism in the first place. The mechanism in #32594 could have helped here, but unfortunately it did not, as it routinely missed the fact that followers were not initialized yet. This is because during a split burst, the replica orchestrating the split was typically only created an instant before, and its raft group hadn't properly transitioned to leader status yet. This meant that in effect it wasn't delaying the splits at all. This commit adjusts the logic to delay splits to avoid this problem. While clamoring for leadership, the delay is upheld. Once collapsed into a definite state, the existing logic pretty much did the right thing, as it waited for the right-hand side to be in initialized. Closes #61396. cc @cockroachdb/kv Release note (bug fix): Fixed a scenario in which a rapid sequence of splits could trigger a storm of Raft snapshots. This would be accompanied by log messages of the form "would have dropped incoming MsgApp, but allowing due to ..." and tended to occur as part of RESTORE/IMPORT operations. Co-authored-by: Tobias Grieger <[email protected]>
Bursts of splits (i.e. a sequence of splits for which each split splits the right-hand side of a prior split) can cause issues. This is because splitting a range in which a replica needs a snapshot results in two ranges in which a replica needs a snapshot where additionally there needs to be a sequencing between the two snapshots (one cannot apply a snapshot for the post-split replica until the pre-split replica has moved out of the way). The result of a long burst of splits such as occurring in RESTORE and IMPORT operations is then an overload of the snapshot queue with lots of wasted work, unavailable followers with operations hanging on them, and general mayhem in the logs. Since bulk operations also write a lot of data to the raft logs, log truncations then create an additional snapshot burden; in short, everything will be unhappy for a few hours and the cluster may effectively be unavailable. This isn't news to us and in fact was a big problem "back in 2018". When we first started to understand the issue, we introduced a mechanism that would delay splits (cockroachdb#32594) with the desired effect of ensuring that, all followers had caught up to ~all of the previous splits. This helped, but didn't explain why we were seeing snapshots in the first place. Investigating more, we realized that snapshots were sometimes spuriously requested by an uninitialized replica on the right-hand side which was contacted by another member of the right-hand side that had already been initialized by the split executing on the left-hand side; this snapshot was almost always unnecessary since the local left-hand side would usually initialize the right-hand side moments later. To address this, in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth of requests to an uninitialized range, and the mechanism was improved in cockroachdb#32782 and will now only do this if a local neighboring replica is expected to perform the split soon. With all this in place, you would've expected us to have all bases covered but it turns out that we are still running into issues prior to this PR. Concretely, whenever the aforementioned mechanism drops a message from the leader (a MsgApp), the leader will only contact the replica every second until it responds. It responds when it has been initialized via its left neighbor's splits and the leader reaches out again, i.e. an average of ~500ms after being initialized. However, by that time, it is itself already at the bottom of a long chain of splits, and the 500ms delay is delaying how long it takes for the rest of the chain to get initialized. Since the delay compounds on each link of the chain, the depth of the chain effectively determines the total delay experienced at the end. This would eventually exceed the patience of the mechanism that would suppress the snapshots, and snapshots would be requested. We would descend into madness similar to that experienced in the absence of the mechanism in the first place. The mechanism in cockroachdb#32594 could have helped here, but unfortunately it did not, as it routinely missed the fact that followers were not initialized yet. This is because during a split burst, the replica orchestrating the split was typically only created an instant before, and its raft group hadn't properly transitioned to leader status yet. This meant that in effect it wasn't delaying the splits at all. This commit adjusts the logic to delay splits to avoid this problem. While clamoring for leadership, the delay is upheld. Once collapsed into a definite state, the existing logic pretty much did the right thing, as it waited for the right-hand side to be in initialized. Release note (bug fix): Fixed a scenario in which a rapid sequence of splits could trigger a storm of Raft snapshots. This would be accompanied by log messages of the form "would have dropped incoming MsgApp, but allowing due to ..." and tended to occur as part of RESTORE/IMPORT operations.
Bursts of splits (i.e. a sequence of splits for which each split splits the right-hand side of a prior split) can cause issues. This is because splitting a range in which a replica needs a snapshot results in two ranges in which a replica needs a snapshot where additionally there needs to be a sequencing between the two snapshots (one cannot apply a snapshot for the post-split replica until the pre-split replica has moved out of the way). The result of a long burst of splits such as occurring in RESTORE and IMPORT operations is then an overload of the snapshot queue with lots of wasted work, unavailable followers with operations hanging on them, and general mayhem in the logs. Since bulk operations also write a lot of data to the raft logs, log truncations then create an additional snapshot burden; in short, everything will be unhappy for a few hours and the cluster may effectively be unavailable. This isn't news to us and in fact was a big problem "back in 2018". When we first started to understand the issue, we introduced a mechanism that would delay splits (cockroachdb#32594) with the desired effect of ensuring that, all followers had caught up to ~all of the previous splits. This helped, but didn't explain why we were seeing snapshots in the first place. Investigating more, we realized that snapshots were sometimes spuriously requested by an uninitialized replica on the right-hand side which was contacted by another member of the right-hand side that had already been initialized by the split executing on the left-hand side; this snapshot was almost always unnecessary since the local left-hand side would usually initialize the right-hand side moments later. To address this, in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth of requests to an uninitialized range, and the mechanism was improved in cockroachdb#32782 and will now only do this if a local neighboring replica is expected to perform the split soon. With all this in place, you would've expected us to have all bases covered but it turns out that we are still running into issues prior to this PR. Concretely, whenever the aforementioned mechanism drops a message from the leader (a MsgApp), the leader will only contact the replica every second until it responds. It responds when it has been initialized via its left neighbor's splits and the leader reaches out again, i.e. an average of ~500ms after being initialized. However, by that time, it is itself already at the bottom of a long chain of splits, and the 500ms delay is delaying how long it takes for the rest of the chain to get initialized. Since the delay compounds on each link of the chain, the depth of the chain effectively determines the total delay experienced at the end. This would eventually exceed the patience of the mechanism that would suppress the snapshots, and snapshots would be requested. We would descend into madness similar to that experienced in the absence of the mechanism in the first place. The mechanism in cockroachdb#32594 could have helped here, but unfortunately it did not, as it routinely missed the fact that followers were not initialized yet. This is because during a split burst, the replica orchestrating the split was typically only created an instant before, and its raft group hadn't properly transitioned to leader status yet. This meant that in effect it wasn't delaying the splits at all. This commit adjusts the logic to delay splits to avoid this problem. While clamoring for leadership, the delay is upheld. Once collapsed into a definite state, the existing logic pretty much did the right thing, as it waited for the right-hand side to be in initialized. Release note (bug fix): Fixed a scenario in which a rapid sequence of splits could trigger a storm of Raft snapshots. This would be accompanied by log messages of the form "would have dropped incoming MsgApp, but allowing due to ..." and tended to occur as part of RESTORE/IMPORT operations.
This is unpolished, but I had used an earlier version of this with what at the
time looked like success. At this point I suspect that this is the best way to
suppress Raft snapshot growth in IMPORT/RESTORE.
(Definitely needs tests).
When a Range has followers that aren't replicating properly, splitting
that range results in a right-hand side with followers in a similar
state. Certain workloads (restore/import/presplit) can run large numbers
of splits against a given range, and this can result in a large number
of Raft snapshots that backs up the Raft snapshot queue.
Ideally we'd never have any ranges that require a snapshot, but over
the last weeks it has become clear that this is very difficult to
achieve since the knowledge required to decide whether a snapshot
can efficiently be prevented is distributed across multiple nodes
that don't share the necessary information.
This is a bit of a nuclear option to prevent the likely last big
culprit in large numbers of Raft snapshots in #31409.
With this change, we should expect to see Raft snapshots regularly when
a split/scatter phase of an import/restore is active, but never large
volumes at once.
Release note: None