Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: delay manual splits that would result in more snapshots #32594

Merged
merged 1 commit into from
Nov 30, 2018

Conversation

tbg
Copy link
Member

@tbg tbg commented Nov 26, 2018

This is unpolished, but I had used an earlier version of this with what at the
time looked like success. At this point I suspect that this is the best way to
suppress Raft snapshot growth in IMPORT/RESTORE.

(Definitely needs tests).


When a Range has followers that aren't replicating properly, splitting
that range results in a right-hand side with followers in a similar
state. Certain workloads (restore/import/presplit) can run large numbers
of splits against a given range, and this can result in a large number
of Raft snapshots that backs up the Raft snapshot queue.

Ideally we'd never have any ranges that require a snapshot, but over
the last weeks it has become clear that this is very difficult to
achieve since the knowledge required to decide whether a snapshot
can efficiently be prevented is distributed across multiple nodes
that don't share the necessary information.

This is a bit of a nuclear option to prevent the likely last big
culprit in large numbers of Raft snapshots in #31409.

With this change, we should expect to see Raft snapshots regularly when
a split/scatter phase of an import/restore is active, but never large
volumes at once.

Release note: None

@tbg tbg requested a review from a team November 26, 2018 16:12
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested a review from petermattis November 26, 2018 16:12
@tbg tbg force-pushed the fix/raft-delayed-snaps branch from 063f372 to 21da18d Compare November 26, 2018 16:16
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica_command.go, line 282 at r2 (raw file):

	}
	if tPostWait == (time.Time{}) {
		tPostWait = timeutil.Now()

I'm not understanding what is motivating this handling of tPostWait. I think you want a boolean indicating whether you ever had a time.After call, and if yes then output the delayed split for message. As it is, it appears you'll always output that message.


pkg/storage/replica_command.go, line 393 at r2 (raw file):

	log.Infof(ctx, "initiating a split of this range at key %s [r%d]%s",
		extra, splitKey, rightDesc.RangeID)

Is the argument reordering intentional? Seems like that messes with the log message strangely.

@tbg tbg force-pushed the fix/raft-delayed-snaps branch 2 times, most recently from 02d4795 to 3f1a1f4 Compare November 26, 2018 17:15
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The small fry missing from this change aside, assuming this addresses the problem, is this what we want? Does any alternative approach come to mind? I concede defeat on avoiding Raft snapshots. The RaftPostSplitSuppressSnapshotTicks mechanism makes snapshots much worse when it delays without the split trigger showing up, and yet it seems impossible to do much better. I tried (unpushed WIP) to move that detection to the receiver (i.e. the Raft leader of the uninitialized RHS) and this works very well unless the Raft leader of the RHS isn't colocated with that of the LHS. This is mostly true today (though not in the test that I added for this mechanism, which is now unfixably flaky) and I imagine that in the future we won't want splits to pin the RHS and LHS leaseholder together as is done today. (And even if that were somehow circumvented, the problem is still that a single Raft snapshot for whatever reason is multiplied by whatever number of splits hit the range in that state, unless there's a mechanism such as this one).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica_command.go, line 282 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm not understanding what is motivating this handling of tPostWait. I think you want a boolean indicating whether you ever had a time.After call, and if yes then output the delayed split for message. As it is, it appears you'll always output that message.

I think this works as intended. Note that we initialize tPostWait := tPreWait and zero it out only when we're actually waiting (i.e. when ticks > 0, though that should really be ticks = 1). I agree that a bool is saner though, done.


pkg/storage/replica_command.go, line 393 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is the argument reordering intentional? Seems like that messes with the log message strangely.

No, and I just fixed this because the output looked really confusing.

@tbg tbg force-pushed the fix/raft-delayed-snaps branch from 3f1a1f4 to 0e46399 Compare November 26, 2018 17:52
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other mitigation we might try is to speed up Raft snapshots when they are very small. For restore / import, the Raft snapshots should be mostly empty. For a range with real data that requires a snapshot, splitting will double the number of snapshots, but each snapshot should be half the size.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica_command.go, line 290 at r3 (raw file):

	}

	if elapsed := timeutil.Since(tPreWait); waited {

Nit: this should probably be if waited { extra += fmt.Sprintf("...", timeutil.Since(tPreWait)) }

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/storage/replica_command.go, line 235 at r3 (raw file):

	// up. If we run out of time, we'll resume the split no matter what.
	_ = r.maybeDropMsgAppResp
	maxDelaySplitToAvoidSnapshotTicks := 5 + r.store.cfg.RaftPostSplitSuppressSnapshotTicks

I'm not sure this is long enough to wait. If the Raft snapshot queue is backed up it can take minutes to clear. So this just seems like it will wait for a bit, then add more work to the queue.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this but if it can put an end to the saga of raft snapshots I'm for it.

One more idea is whether we can tune the order in which splits happen. Would reversing the order of splits make us less likely to hit pathological cases? Or if both ascending and descending are problematic, maybe a pattern like binary search to avoid hitting the same ranges repeatedly?

Reviewed 1 of 1 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/storage/replica_command.go, line 234 at r3 (raw file):

	// plus a bit of padding to give a snapshot a chance to catch the follower
	// up. If we run out of time, we'll resume the split no matter what.
	_ = r.maybeDropMsgAppResp

Is this just to provide a grep/guru target or did you mean to call this method? (if it's just a guru target say so in the comment)

@petermattis
Copy link
Collaborator

I don't love this but if it can put an end to the saga of raft snapshots I'm for it.

Note that the settings in this PR delay splits by up to 5s. In testing, @tbg and I have found this to be insufficient for getting rid of Raft snapshot backlog. I've been testing with settings that delay splits for up to 3.5m. The longest split I've seen delayed is 34.5s, though that comes with the benefit of no Raft snapshot backlog. Is that still palatable?

One thing that leaves me a bit dissatisfied is that I don't have a clear picture of what is going on here. This is my own failing: I haven't taken the time to trace through how these Raft snapshots are accumulating. I'm hoping to either do that, or wait until @tbg produces a clear writeup.

@bdarnell
Copy link
Contributor

The longest split I've seen delayed is 34.5s, though that comes with the benefit of no Raft snapshot backlog. Is that still palatable?

As long as that worst-case only happens either during large restore/imports or during network distress, I think it's bearable.

@tbg
Copy link
Member Author

tbg commented Nov 27, 2018

I second what @petermattis said, it's still not 100% clear what's going on. I think we might actually be seeing priority inversion at work here because preemptive snapshots routinely take 15+ seconds. Putting logging like in this PR in place has been helpful. I'd like to confirm this (or a similar theory) before merging this, and while we're there, the Raft snapshot queue (which very likely is the bottleneck here) needs to get better introspection.

@tbg tbg force-pushed the fix/raft-delayed-snaps branch 2 times, most recently from 4173ec2 to f4040c1 Compare November 27, 2018 11:58
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica_command.go, line 235 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm not sure this is long enough to wait. If the Raft snapshot queue is backed up it can take minutes to clear. So this just seems like it will wait for a bit, then add more work to the queue.

In my experimentation with this change, bumping this value significantly produces a benefit. In particular, without this change I see a number of messages about delaying a split for 5s only to let it through despite it possibly causing a Raft snapshot. If a bump the value I see splits delayed for longer (17.2s in one example), but no spike in Raft snapshots.


pkg/storage/replica_command.go, line 296 at r9 (raw file):

				return false, nil
			})
			r.raftMu.Unlock()

Don't you want to do this before the select?

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No code updates in this round of comments. More tomorrow.

Using the backoff only for certain AdminSplit (those issued by backup/restore) is something we can possibly backport so I'm planning to make that change (after asserting that it fixes all the problems and making sure that long backoffs are traced back to a root cause via a log message).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica_command.go, line 234 at r3 (raw file):

see the _ assignment below)

What should I change this to? (This helps navigate and also prevents naming rot)

_ = ... // guru assignment

?


pkg/storage/replica_command.go, line 235 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

In my experimentation with this change, bumping this value significantly produces a benefit. In particular, without this change I see a number of messages about delaying a split for 5s only to let it through despite it possibly causing a Raft snapshot. If a bump the value I see splits delayed for longer (17.2s in one example), but no spike in Raft snapshots.

I've seen the same. I think ~15s is sort of a magic number because that's how long the average snapshot will be blocked. A preemptive snapshot is usually ~30mb in this workload and they're rate limited to 2mb/s and can block a Raft snapshot to the same target. So a Raft snapshot will often have to wait ~15s.


pkg/storage/replica_command.go, line 296 at r9 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Don't you want to do this before the select?

Good point, will do.

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm also going to experiment with splitting the receiver-side snapshot semaphore into two, so that we always allow one Raft snapshot to be in flight, even if there's already a preemptive one).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

@tbg
Copy link
Member Author

tbg commented Nov 28, 2018

For the record (I think I only mentioned this privately so far): testing this PR has discovered what I think is a Raft bug, where a follower isn't tracked as replicating after it has applied a snapshot that is up-to-date with the leader's index. This is mostly inconsequential, but would trip up the wait logic. I'll link to the upstream PR once it's out (and I still think I found a bug).

@tbg tbg force-pushed the fix/raft-delayed-snaps branch from 7b0f789 to 417df15 Compare November 28, 2018 12:49
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica_command.go, line 234 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…
see the _ assignment below)

What should I change this to? (This helps navigate and also prevents naming rot)

_ = ... // guru assignment

?

Done.


pkg/storage/replica_command.go, line 290 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: this should probably be if waited { extra += fmt.Sprintf("...", timeutil.Since(tPreWait)) }

Done, I think.


pkg/storage/replica_command.go, line 296 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Good point, will do.

Done.

@tbg tbg force-pushed the fix/raft-delayed-snaps branch from 417df15 to 2eb4f12 Compare November 28, 2018 13:27
tbg added a commit to tbg/cockroach that referenced this pull request Nov 28, 2018
It now differentiates between the case in which the follower needs a
snapshot vs the one in which it is still being probed. We're not
expecting the probing case once the bug described here is fixed:

cockroachdb#32594 (comment)

Release note: None
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR needs lots of cleanup love, but I think the basic mechanism now works. I've run the first few minutes of restore2TB repeatedly and while I always see snapshots and sometimes see the backoff mechanism "give up" (14s in my current testing branch), the reason is always because a follower still needs a snapshot. I definitely saw the Raft bug pop up in almost every run until I had tuned all the Raft proposal knobs required (need to wake the leader), and it's been reliably gone since. I think we're getting somewhere here!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Progress! Let me know when you want me to give this another thorough look.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica_command.go, line 305 at r23 (raw file):

			if pr.State != raft.ProgressStateReplicate {
				extra += fmt.Sprintf("; replica r%d/%d not caught up: %+v", r.RangeID, replicaID, &pr)

Adding on to extra every tick loop seems excessive. I think you want to do something where you construct a newExtra string for the progress, and then after the loop if !done you set extra = newExtra.

tbg added a commit to tbg/cockroach that referenced this pull request Dec 11, 2018
It now differentiates between the case in which the follower needs a
snapshot vs the one in which it is still being probed. We're not
expecting the probing case once the bug described here is fixed:

cockroachdb#32594 (comment)

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Dec 11, 2018
Persists a recent finding I made experimenting with cockroachdb#32594.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Dec 11, 2018
…snapshots

Backports cockroachdb#32594. This didn't apply cleanly, but only because I never
cherry-picked cockroachdb#32594 and cockroachdb#32594 refers a variable introduced within.

Fixes cockroachdb#32898.
Fixes cockroachdb#32900.
Fixes cockroachdb#32895.

/cc @cockroachdb/release

Release note (bug fix): resolve a cluster degradation scenario that
could occur during IMPORT/RESTORE operations, manifested through a
high number of pending Raft snapshots.
knz added a commit to knz/cockroach that referenced this pull request Jan 9, 2019
The admin split path must accommodate a scenario where a range with
not-yet-replicated followers is being manually split multiple
times (eg. IMPORT during TPCC test fixtures). This scenario results in
a bunch of replicas that all need to be populated with
snapshots. To avoid backing up the raft snapshot queue, a heuristic
was put in place (cockroachdb#32594) to delay the admin split if there is another
snapshot being applied already.

As shown by investigation in a failing test, there is a mismatch
between the configured max delay for this heuristic (20s) and the
actual duration of the snapshot application - the latter is limited by
the max bandwidth for snapshots, 2 MB/s resulting in 32s applications
in the worst case. We (Tobias and I) suspect that the heuristic thus
fails to wait long enough to have the protective effect it was
designed to provide.

The current patch increases this delay to exceed this snapshot
application duration estimate to about 50s.

Note that this scenario is not likely to occur now that cockroachdb#32782 has
been merged (this reduces the need for raft snapshots during splits);
however in release-2.1 where that patch was not applied, the scenario
is more likely.

Release note (bug fix): resolve a cluster degradation scenario that
could occur during IMPORT/RESTORE operations, manifested through a
high number of pending Raft snapshots.
knz added a commit to knz/cockroach that referenced this pull request Jan 9, 2019
The admin split path must accommodate a scenario where a range with
not-yet-replicated followers is being manually split multiple
times (eg. IMPORT during TPCC test fixtures). This scenario results in
a bunch of replicas that all need to be populated with
snapshots. To avoid backing up the raft snapshot queue, a heuristic
was put in place (cockroachdb#32594) to delay the admin split if there is another
snapshot being applied already.

As shown by investigation in a failing test, there is a mismatch
between the configured max delay for this heuristic (20s) and the
actual duration of the snapshot application - the latter is limited by
the max bandwidth for snapshots, 2 MB/s resulting in 32s applications
in the worst case. We (Tobias and I) suspect that the heuristic thus
fails to wait long enough to have the protective effect it was
designed to provide.

The current patch increases this delay to exceed this snapshot
application duration estimate to about 50s.

Note that this scenario is not likely to occur now that cockroachdb#32782 has
been merged (this reduces the need for raft snapshots during splits);
however in release-2.1 where that patch was not applied, the scenario
is more likely.

Release note (bug fix): resolve a cluster degradation scenario that
could occur during IMPORT/RESTORE operations, manifested through a
high number of pending Raft snapshots.
craig bot pushed a commit that referenced this pull request Jan 9, 2019
33573: roachtest: add MinVersion to scrub tests r=lucy-zhang a=lucy-zhang

The `scrub` tests have been passing on master, but failing on 2.1 because there
was a bug (fixed in #32908).

Release note: None

33582: storage: bump RaftDelaySplitToSuppressSnapshotTicks r=knz a=knz

The admin split path must accommodate a scenario where a range with
not-yet-replicated followers is being manually split multiple
times (eg. IMPORT during TPCC test fixtures). This scenario results in
a bunch of replicas that all need to be populated with
snapshots. To avoid backing up the raft snapshot queue, a heuristic
was put in place (#32594) to delay the admin split if there is another
snapshot being applied already.

As shown by investigation in a failing test, there is a mismatch
between the configured max delay for this heuristic (20s) and the
actual duration of the snapshot application - the latter is limited by
the max bandwidth for snapshots, 2 MB/s resulting in 32s applications
in the worst case. We (Tobias and I) suspect that the heuristic thus
fails to wait long enough to have the protective effect it was
designed to provide.

The current patch increases this delay to exceed this snapshot
application duration estimate to about 50s.

Note that this scenario is not likely to occur now that #32782 has
been merged (this reduces the need for raft snapshots during splits);
however in release-2.1 where that patch was not applied, the scenario
is more likely.

Release note (bug fix): resolve a cluster degradation scenario that
could occur during IMPORT/RESTORE operations, manifested through a
high number of pending Raft snapshots.

Co-authored-by: Lucy Zhang <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
knz added a commit to knz/cockroach that referenced this pull request Jan 9, 2019
The admin split path must accommodate a scenario where a range with
not-yet-replicated followers is being manually split multiple
times (eg. IMPORT during TPCC test fixtures). This scenario results in
a bunch of replicas that all need to be populated with
snapshots. To avoid backing up the raft snapshot queue, a heuristic
was put in place (cockroachdb#32594) to delay the admin split if there is another
snapshot being applied already.

As shown by investigation in a failing test, there is a mismatch
between the configured max delay for this heuristic (20s) and the
actual duration of the snapshot application - the latter is limited by
the max bandwidth for snapshots, 2 MB/s resulting in 32s applications
in the worst case. We (Tobias and I) suspect that the heuristic thus
fails to wait long enough to have the protective effect it was
designed to provide.

The current patch increases this delay to exceed this snapshot
application duration estimate to about 50s.

Release note (bug fix): resolve a cluster degradation scenario that
could occur during IMPORT/RESTORE operations, manifested through a
high number of pending Raft snapshots.
tbg added a commit to tbg/cockroach that referenced this pull request Apr 22, 2021
Bursts of splits (i.e. a sequence of splits for which each split splits
the right-hand side of a prior split) can cause issues. This is because
splitting a range in which a replica needs a snapshot results in two
ranges in which a replica needs a snapshot where additionally there
needs to be a sequencing between the two snapshots (one cannot apply
a snapshot for the post-split replica until the pre-split replica has
moved out of the way). The result of a long burst of splits such as
occurring in RESTORE and IMPORT operations is then an overload of the
snapshot queue with lots of wasted work, unavailable followers with
operations hanging on them, and general mayhem in the logs. Since
bulk operations also write a lot of data to the raft logs, log
truncations then create an additional snapshot burden; in short,
everything will be unhappy for a few hours and the cluster may
effectively be unavailable.

This isn't news to us and in fact was a big problem "back in 2018".
When we first started to understand the issue, we introduced a mechanism
that would delay splits (cockroachdb#32594) with the desired effect of ensuring
that, all followers had caught up to ~all of the previous splits.
This helped, but didn't explain why we were seeing snapshots in the
first place.

Investigating more, we realized that snapshots were sometimes spuriously
requested by an uninitialized replica on the right-hand side which was
contacted by another member of the right-hand side that had already been
initialized by the split executing on the left-hand side; this snapshot
was almost always unnecessary since the local left-hand side would
usually initialize the right-hand side moments later.  To address this,
in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth
of requests to an uninitialized range, and the mechanism was improved in
 cockroachdb#32782 and will now only do this if a local neighboring replica is
expected to perform the split soon.

With all this in place, you would've expected us to have all bases
covered but it turns out that we are still running into issues prior
to this PR.

Concretely, whenever the aforementioned mechanism drops a message from
the leader (a MsgApp), the leader will only contact the replica every
second until it responds. It responds when it has been initialized via
its left neighbor's splits and the leader reaches out again, i.e.  an
average of ~500ms after being initialized. However, by that time, it is
itself already at the bottom of a long chain of splits, and the 500ms
delay is delaying how long it takes for the rest of the chain to get
initialized.  Since the delay compounds on each link of the chain, the
depth of the chain effectively determines the total delay experienced at
the end. This would eventually exceed the patience of the mechanism that
would suppress the snapshots, and snapshots would be requested. We would
descend into madness similar to that experienced in the absence of the
mechanism in the first place.

The mechanism in cockroachdb#32594 could have helped here, but unfortunately it
did not, as it routinely missed the fact that followers were not
initialized yet. This is because during a split burst, the replica
orchestrating the split was typically only created an instant before,
and its raft group hadn't properly transitioned to leader status yet.
This meant that in effect it wasn't delaying the splits at all.

This commit adjusts the logic to delay splits to avoid this problem.
While clamoring for leadership, the delay is upheld. Once collapsed
into a definite state, the existing logic pretty much did the right
thing, as it waited for the right-hand side to be in initialized.

Release note (bug fix): Fixed a scenario in which a rapid sequence
of splits could trigger a storm of Raft snapshots. This would be
accompanied by log messages of the form "would have dropped incoming
MsgApp, but allowing due to ..." and tended to occur as part of
RESTORE/IMPORT operations.
tbg added a commit to tbg/cockroach that referenced this pull request Apr 23, 2021
Bursts of splits (i.e. a sequence of splits for which each split splits
the right-hand side of a prior split) can cause issues. This is because
splitting a range in which a replica needs a snapshot results in two
ranges in which a replica needs a snapshot where additionally there
needs to be a sequencing between the two snapshots (one cannot apply
a snapshot for the post-split replica until the pre-split replica has
moved out of the way). The result of a long burst of splits such as
occurring in RESTORE and IMPORT operations is then an overload of the
snapshot queue with lots of wasted work, unavailable followers with
operations hanging on them, and general mayhem in the logs. Since
bulk operations also write a lot of data to the raft logs, log
truncations then create an additional snapshot burden; in short,
everything will be unhappy for a few hours and the cluster may
effectively be unavailable.

This isn't news to us and in fact was a big problem "back in 2018".
When we first started to understand the issue, we introduced a mechanism
that would delay splits (cockroachdb#32594) with the desired effect of ensuring
that, all followers had caught up to ~all of the previous splits.
This helped, but didn't explain why we were seeing snapshots in the
first place.

Investigating more, we realized that snapshots were sometimes spuriously
requested by an uninitialized replica on the right-hand side which was
contacted by another member of the right-hand side that had already been
initialized by the split executing on the left-hand side; this snapshot
was almost always unnecessary since the local left-hand side would
usually initialize the right-hand side moments later.  To address this,
in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth
of requests to an uninitialized range, and the mechanism was improved in
 cockroachdb#32782 and will now only do this if a local neighboring replica is
expected to perform the split soon.

With all this in place, you would've expected us to have all bases
covered but it turns out that we are still running into issues prior
to this PR.

Concretely, whenever the aforementioned mechanism drops a message from
the leader (a MsgApp), the leader will only contact the replica every
second until it responds. It responds when it has been initialized via
its left neighbor's splits and the leader reaches out again, i.e.  an
average of ~500ms after being initialized. However, by that time, it is
itself already at the bottom of a long chain of splits, and the 500ms
delay is delaying how long it takes for the rest of the chain to get
initialized.  Since the delay compounds on each link of the chain, the
depth of the chain effectively determines the total delay experienced at
the end. This would eventually exceed the patience of the mechanism that
would suppress the snapshots, and snapshots would be requested. We would
descend into madness similar to that experienced in the absence of the
mechanism in the first place.

The mechanism in cockroachdb#32594 could have helped here, but unfortunately it
did not, as it routinely missed the fact that followers were not
initialized yet. This is because during a split burst, the replica
orchestrating the split was typically only created an instant before,
and its raft group hadn't properly transitioned to leader status yet.
This meant that in effect it wasn't delaying the splits at all.

This commit adjusts the logic to delay splits to avoid this problem.
While clamoring for leadership, the delay is upheld. Once collapsed
into a definite state, the existing logic pretty much did the right
thing, as it waited for the right-hand side to be in initialized.

Release note (bug fix): Fixed a scenario in which a rapid sequence
of splits could trigger a storm of Raft snapshots. This would be
accompanied by log messages of the form "would have dropped incoming
MsgApp, but allowing due to ..." and tended to occur as part of
RESTORE/IMPORT operations.
tbg added a commit to tbg/cockroach that referenced this pull request Apr 23, 2021
Bursts of splits (i.e. a sequence of splits for which each split splits
the right-hand side of a prior split) can cause issues. This is because
splitting a range in which a replica needs a snapshot results in two
ranges in which a replica needs a snapshot where additionally there
needs to be a sequencing between the two snapshots (one cannot apply
a snapshot for the post-split replica until the pre-split replica has
moved out of the way). The result of a long burst of splits such as
occurring in RESTORE and IMPORT operations is then an overload of the
snapshot queue with lots of wasted work, unavailable followers with
operations hanging on them, and general mayhem in the logs. Since
bulk operations also write a lot of data to the raft logs, log
truncations then create an additional snapshot burden; in short,
everything will be unhappy for a few hours and the cluster may
effectively be unavailable.

This isn't news to us and in fact was a big problem "back in 2018".
When we first started to understand the issue, we introduced a mechanism
that would delay splits (cockroachdb#32594) with the desired effect of ensuring
that, all followers had caught up to ~all of the previous splits.
This helped, but didn't explain why we were seeing snapshots in the
first place.

Investigating more, we realized that snapshots were sometimes spuriously
requested by an uninitialized replica on the right-hand side which was
contacted by another member of the right-hand side that had already been
initialized by the split executing on the left-hand side; this snapshot
was almost always unnecessary since the local left-hand side would
usually initialize the right-hand side moments later.  To address this,
in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth
of requests to an uninitialized range, and the mechanism was improved in
 cockroachdb#32782 and will now only do this if a local neighboring replica is
expected to perform the split soon.

With all this in place, you would've expected us to have all bases
covered but it turns out that we are still running into issues prior
to this PR.

Concretely, whenever the aforementioned mechanism drops a message from
the leader (a MsgApp), the leader will only contact the replica every
second until it responds. It responds when it has been initialized via
its left neighbor's splits and the leader reaches out again, i.e.  an
average of ~500ms after being initialized. However, by that time, it is
itself already at the bottom of a long chain of splits, and the 500ms
delay is delaying how long it takes for the rest of the chain to get
initialized.  Since the delay compounds on each link of the chain, the
depth of the chain effectively determines the total delay experienced at
the end. This would eventually exceed the patience of the mechanism that
would suppress the snapshots, and snapshots would be requested. We would
descend into madness similar to that experienced in the absence of the
mechanism in the first place.

The mechanism in cockroachdb#32594 could have helped here, but unfortunately it
did not, as it routinely missed the fact that followers were not
initialized yet. This is because during a split burst, the replica
orchestrating the split was typically only created an instant before,
and its raft group hadn't properly transitioned to leader status yet.
This meant that in effect it wasn't delaying the splits at all.

This commit adjusts the logic to delay splits to avoid this problem.
While clamoring for leadership, the delay is upheld. Once collapsed
into a definite state, the existing logic pretty much did the right
thing, as it waited for the right-hand side to be in initialized.

Release note (bug fix): Fixed a scenario in which a rapid sequence
of splits could trigger a storm of Raft snapshots. This would be
accompanied by log messages of the form "would have dropped incoming
MsgApp, but allowing due to ..." and tended to occur as part of
RESTORE/IMPORT operations.
tbg added a commit to tbg/cockroach that referenced this pull request Apr 24, 2021
Bursts of splits (i.e. a sequence of splits for which each split splits
the right-hand side of a prior split) can cause issues. This is because
splitting a range in which a replica needs a snapshot results in two
ranges in which a replica needs a snapshot where additionally there
needs to be a sequencing between the two snapshots (one cannot apply
a snapshot for the post-split replica until the pre-split replica has
moved out of the way). The result of a long burst of splits such as
occurring in RESTORE and IMPORT operations is then an overload of the
snapshot queue with lots of wasted work, unavailable followers with
operations hanging on them, and general mayhem in the logs. Since
bulk operations also write a lot of data to the raft logs, log
truncations then create an additional snapshot burden; in short,
everything will be unhappy for a few hours and the cluster may
effectively be unavailable.

This isn't news to us and in fact was a big problem "back in 2018".
When we first started to understand the issue, we introduced a mechanism
that would delay splits (cockroachdb#32594) with the desired effect of ensuring
that, all followers had caught up to ~all of the previous splits.
This helped, but didn't explain why we were seeing snapshots in the
first place.

Investigating more, we realized that snapshots were sometimes spuriously
requested by an uninitialized replica on the right-hand side which was
contacted by another member of the right-hand side that had already been
initialized by the split executing on the left-hand side; this snapshot
was almost always unnecessary since the local left-hand side would
usually initialize the right-hand side moments later.  To address this,
in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth
of requests to an uninitialized range, and the mechanism was improved in
 cockroachdb#32782 and will now only do this if a local neighboring replica is
expected to perform the split soon.

With all this in place, you would've expected us to have all bases
covered but it turns out that we are still running into issues prior
to this PR.

Concretely, whenever the aforementioned mechanism drops a message from
the leader (a MsgApp), the leader will only contact the replica every
second until it responds. It responds when it has been initialized via
its left neighbor's splits and the leader reaches out again, i.e.  an
average of ~500ms after being initialized. However, by that time, it is
itself already at the bottom of a long chain of splits, and the 500ms
delay is delaying how long it takes for the rest of the chain to get
initialized.  Since the delay compounds on each link of the chain, the
depth of the chain effectively determines the total delay experienced at
the end. This would eventually exceed the patience of the mechanism that
would suppress the snapshots, and snapshots would be requested. We would
descend into madness similar to that experienced in the absence of the
mechanism in the first place.

The mechanism in cockroachdb#32594 could have helped here, but unfortunately it
did not, as it routinely missed the fact that followers were not
initialized yet. This is because during a split burst, the replica
orchestrating the split was typically only created an instant before,
and its raft group hadn't properly transitioned to leader status yet.
This meant that in effect it wasn't delaying the splits at all.

This commit adjusts the logic to delay splits to avoid this problem.
While clamoring for leadership, the delay is upheld. Once collapsed
into a definite state, the existing logic pretty much did the right
thing, as it waited for the right-hand side to be in initialized.

Release note (bug fix): Fixed a scenario in which a rapid sequence
of splits could trigger a storm of Raft snapshots. This would be
accompanied by log messages of the form "would have dropped incoming
MsgApp, but allowing due to ..." and tended to occur as part of
RESTORE/IMPORT operations.
craig bot pushed a commit that referenced this pull request Apr 24, 2021
64060: kvserver: fix delaying of splits with uninitialized followers r=erikgrinaker a=tbg

Bursts of splits (i.e. a sequence of splits for which each split splits
the right-hand side of a prior split) can cause issues. This is because
splitting a range in which a replica needs a snapshot results in two
ranges in which a replica needs a snapshot where additionally there
needs to be a sequencing between the two snapshots (one cannot apply
a snapshot for the post-split replica until the pre-split replica has
moved out of the way). The result of a long burst of splits such as
occurring in RESTORE and IMPORT operations is then an overload of the
snapshot queue with lots of wasted work, unavailable followers with
operations hanging on them, and general mayhem in the logs. Since
bulk operations also write a lot of data to the raft logs, log
truncations then create an additional snapshot burden; in short,
everything will be unhappy for a few hours and the cluster may
effectively be unavailable.

This isn't news to us and in fact was a big problem "back in 2018".
When we first started to understand the issue, we introduced a mechanism
that would delay splits (#32594) with the desired effect of ensuring
that, all followers had caught up to ~all of the previous splits.
This helped, but didn't explain why we were seeing snapshots in the
first place.

Investigating more, we realized that snapshots were sometimes spuriously
requested by an uninitialized replica on the right-hand side which was
contacted by another member of the right-hand side that had already been
initialized by the split executing on the left-hand side; this snapshot
was almost always unnecessary since the local left-hand side would
usually initialize the right-hand side moments later.  To address this,
in #32594 we started unconditionally dropping the first ~seconds worth
of requests to an uninitialized range, and the mechanism was improved in
 #32782 and will now only do this if a local neighboring replica is
expected to perform the split soon.

With all this in place, you would've expected us to have all bases
covered but it turns out that we are still running into issues prior
to this PR.

Concretely, whenever the aforementioned mechanism drops a message from
the leader (a MsgApp), the leader will only contact the replica every
second until it responds. It responds when it has been initialized via
its left neighbor's splits and the leader reaches out again, i.e.  an
average of ~500ms after being initialized. However, by that time, it is
itself already at the bottom of a long chain of splits, and the 500ms
delay is delaying how long it takes for the rest of the chain to get
initialized.  Since the delay compounds on each link of the chain, the
depth of the chain effectively determines the total delay experienced at
the end. This would eventually exceed the patience of the mechanism that
would suppress the snapshots, and snapshots would be requested. We would
descend into madness similar to that experienced in the absence of the
mechanism in the first place.

The mechanism in #32594 could have helped here, but unfortunately it
did not, as it routinely missed the fact that followers were not
initialized yet. This is because during a split burst, the replica
orchestrating the split was typically only created an instant before,
and its raft group hadn't properly transitioned to leader status yet.
This meant that in effect it wasn't delaying the splits at all.

This commit adjusts the logic to delay splits to avoid this problem.
While clamoring for leadership, the delay is upheld. Once collapsed
into a definite state, the existing logic pretty much did the right
thing, as it waited for the right-hand side to be in initialized.

Closes #61396.

cc @cockroachdb/kv

Release note (bug fix): Fixed a scenario in which a rapid sequence
of splits could trigger a storm of Raft snapshots. This would be
accompanied by log messages of the form "would have dropped incoming
MsgApp, but allowing due to ..." and tended to occur as part of
RESTORE/IMPORT operations.


Co-authored-by: Tobias Grieger <[email protected]>
tbg added a commit to tbg/cockroach that referenced this pull request Apr 26, 2021
Bursts of splits (i.e. a sequence of splits for which each split splits
the right-hand side of a prior split) can cause issues. This is because
splitting a range in which a replica needs a snapshot results in two
ranges in which a replica needs a snapshot where additionally there
needs to be a sequencing between the two snapshots (one cannot apply
a snapshot for the post-split replica until the pre-split replica has
moved out of the way). The result of a long burst of splits such as
occurring in RESTORE and IMPORT operations is then an overload of the
snapshot queue with lots of wasted work, unavailable followers with
operations hanging on them, and general mayhem in the logs. Since
bulk operations also write a lot of data to the raft logs, log
truncations then create an additional snapshot burden; in short,
everything will be unhappy for a few hours and the cluster may
effectively be unavailable.

This isn't news to us and in fact was a big problem "back in 2018".
When we first started to understand the issue, we introduced a mechanism
that would delay splits (cockroachdb#32594) with the desired effect of ensuring
that, all followers had caught up to ~all of the previous splits.
This helped, but didn't explain why we were seeing snapshots in the
first place.

Investigating more, we realized that snapshots were sometimes spuriously
requested by an uninitialized replica on the right-hand side which was
contacted by another member of the right-hand side that had already been
initialized by the split executing on the left-hand side; this snapshot
was almost always unnecessary since the local left-hand side would
usually initialize the right-hand side moments later.  To address this,
in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth
of requests to an uninitialized range, and the mechanism was improved in
 cockroachdb#32782 and will now only do this if a local neighboring replica is
expected to perform the split soon.

With all this in place, you would've expected us to have all bases
covered but it turns out that we are still running into issues prior
to this PR.

Concretely, whenever the aforementioned mechanism drops a message from
the leader (a MsgApp), the leader will only contact the replica every
second until it responds. It responds when it has been initialized via
its left neighbor's splits and the leader reaches out again, i.e.  an
average of ~500ms after being initialized. However, by that time, it is
itself already at the bottom of a long chain of splits, and the 500ms
delay is delaying how long it takes for the rest of the chain to get
initialized.  Since the delay compounds on each link of the chain, the
depth of the chain effectively determines the total delay experienced at
the end. This would eventually exceed the patience of the mechanism that
would suppress the snapshots, and snapshots would be requested. We would
descend into madness similar to that experienced in the absence of the
mechanism in the first place.

The mechanism in cockroachdb#32594 could have helped here, but unfortunately it
did not, as it routinely missed the fact that followers were not
initialized yet. This is because during a split burst, the replica
orchestrating the split was typically only created an instant before,
and its raft group hadn't properly transitioned to leader status yet.
This meant that in effect it wasn't delaying the splits at all.

This commit adjusts the logic to delay splits to avoid this problem.
While clamoring for leadership, the delay is upheld. Once collapsed
into a definite state, the existing logic pretty much did the right
thing, as it waited for the right-hand side to be in initialized.

Release note (bug fix): Fixed a scenario in which a rapid sequence
of splits could trigger a storm of Raft snapshots. This would be
accompanied by log messages of the form "would have dropped incoming
MsgApp, but allowing due to ..." and tended to occur as part of
RESTORE/IMPORT operations.
tbg added a commit to tbg/cockroach that referenced this pull request May 20, 2021
Bursts of splits (i.e. a sequence of splits for which each split splits
the right-hand side of a prior split) can cause issues. This is because
splitting a range in which a replica needs a snapshot results in two
ranges in which a replica needs a snapshot where additionally there
needs to be a sequencing between the two snapshots (one cannot apply
a snapshot for the post-split replica until the pre-split replica has
moved out of the way). The result of a long burst of splits such as
occurring in RESTORE and IMPORT operations is then an overload of the
snapshot queue with lots of wasted work, unavailable followers with
operations hanging on them, and general mayhem in the logs. Since
bulk operations also write a lot of data to the raft logs, log
truncations then create an additional snapshot burden; in short,
everything will be unhappy for a few hours and the cluster may
effectively be unavailable.

This isn't news to us and in fact was a big problem "back in 2018".
When we first started to understand the issue, we introduced a mechanism
that would delay splits (cockroachdb#32594) with the desired effect of ensuring
that, all followers had caught up to ~all of the previous splits.
This helped, but didn't explain why we were seeing snapshots in the
first place.

Investigating more, we realized that snapshots were sometimes spuriously
requested by an uninitialized replica on the right-hand side which was
contacted by another member of the right-hand side that had already been
initialized by the split executing on the left-hand side; this snapshot
was almost always unnecessary since the local left-hand side would
usually initialize the right-hand side moments later.  To address this,
in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth
of requests to an uninitialized range, and the mechanism was improved in
 cockroachdb#32782 and will now only do this if a local neighboring replica is
expected to perform the split soon.

With all this in place, you would've expected us to have all bases
covered but it turns out that we are still running into issues prior
to this PR.

Concretely, whenever the aforementioned mechanism drops a message from
the leader (a MsgApp), the leader will only contact the replica every
second until it responds. It responds when it has been initialized via
its left neighbor's splits and the leader reaches out again, i.e.  an
average of ~500ms after being initialized. However, by that time, it is
itself already at the bottom of a long chain of splits, and the 500ms
delay is delaying how long it takes for the rest of the chain to get
initialized.  Since the delay compounds on each link of the chain, the
depth of the chain effectively determines the total delay experienced at
the end. This would eventually exceed the patience of the mechanism that
would suppress the snapshots, and snapshots would be requested. We would
descend into madness similar to that experienced in the absence of the
mechanism in the first place.

The mechanism in cockroachdb#32594 could have helped here, but unfortunately it
did not, as it routinely missed the fact that followers were not
initialized yet. This is because during a split burst, the replica
orchestrating the split was typically only created an instant before,
and its raft group hadn't properly transitioned to leader status yet.
This meant that in effect it wasn't delaying the splits at all.

This commit adjusts the logic to delay splits to avoid this problem.
While clamoring for leadership, the delay is upheld. Once collapsed
into a definite state, the existing logic pretty much did the right
thing, as it waited for the right-hand side to be in initialized.

Release note (bug fix): Fixed a scenario in which a rapid sequence
of splits could trigger a storm of Raft snapshots. This would be
accompanied by log messages of the form "would have dropped incoming
MsgApp, but allowing due to ..." and tended to occur as part of
RESTORE/IMPORT operations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants