Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: don't send historical Raft log with snapshots #35701

Merged
merged 1 commit into from
Mar 18, 2019

Conversation

tbg
Copy link
Member

@tbg tbg commented Mar 13, 2019

Assume a leaseholder wants to send a (Raft or preemptive) snapshot to a
follower. Say the leaseholder's log ranges from 100 to 200, and we assume
that the size
(in bytes) of this log is 200mb. All of the log is successfully committed
and applied, and is thus reflected in the snapshot data.

Prior to this change, we would still send the 200mb of log entries along
with the snapshot, even though the snapshot itself already reflected them.

After this change, we won't send any log entries along with the snapshot,
as sanity would suggest we would.

We were unable to make this change because up until recently, the Raft
truncated state (which dictates the first log index) was replicated and
consistency checked; this was changed in #34660.

Somewhere down the road (19.2?) we should localize all the log truncation
decisions and simplify all this further. I suspect that in doing so we can
avoid tracking the size of the Raft log in the first place; all we really
need for this is some mechanism that makes sure that an "idle" replica
truncates its logs. With unreplicated truncation, this becomes cheap enough
to "just do".

Release note (bug fix): Remove historical log entries from Raft snapshots.
These log entries could lead to failed snapshots with a message such as:

snapshot failed: aborting snapshot because raft log is too large
(25911051 bytes after processing 7 of 37 entries)

@tbg tbg requested a review from a team March 13, 2019 19:11
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested a review from andreimatei March 13, 2019 19:12
@tbg
Copy link
Member Author

tbg commented Mar 13, 2019

I need to look at this again with a clear head tomorrow, but wanted to push this out so that @andreimatei can do preliminary review.

@tbg
Copy link
Member Author

tbg commented Mar 13, 2019

PS I saw recent clearrange/checks=false runs with hundreds of snapshots failed due to size, I had thought that we had fixed all the issues with the log size but apparently there are always more. Hence my renewed will to do this in 19.1.

This will need a cluster version.

@tbg tbg force-pushed the fix/snap-no-past-log branch from 56cdd20 to 5d1124f Compare March 13, 2019 21:11
@tbg
Copy link
Member Author

tbg commented Mar 13, 2019

This will need a cluster version.

I take that back, I did all the hard work in the PR unreplicating the truncated state.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @tbg)


pkg/storage/replica_command.go, line 899 at r2 (raw file):

	if !usesReplicatedTruncatedState && snap.State.TruncatedState.Index < snap.State.RaftAppliedIndex {
		// If we're not using a legacy (replicated) truncated state, we can

s/we can/we


pkg/storage/replica_command.go, line 906 at r2 (raw file):

		// again we're relying on some mechanisms (quota pool, Raft uncommitted
		// size limit) to make sure this doesn't grow without bounds.
		snap.State.TruncatedState = &roachpb.RaftTruncatedState{

So stupid question here so that you get real confidence in this review: how exactly does this patch actually cause the logs to not be included in the snapshot? Does anybody on the sender side look at this snap.State.TruncatedState? :)


pkg/storage/replica_raft.go, line 2338 at r2 (raw file):

			rResult.State.TruncatedState = nil
			rResult.RaftLogDelta = 0
			// We received a truncation that doesn't apply to us, so we know that

I believe this is correct, but the return value for handleTruncatedStateBelowRaft() really needs to be documented. Try to read that function without a tome of context (as I am :P ).

So to make sure I'm with you here, we don't expect this branch to be commonly taken, do we? This is taken is the leaseholder had more logs than we did, and to attempt to truncate up to an index that's below any that we have in our logs. How exactly would that happen?

@tbg tbg force-pushed the fix/snap-no-past-log branch from 5d1124f to ef8a2a4 Compare March 14, 2019 16:40
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/storage/replica_command.go, line 899 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

s/we can/we

Done.


pkg/storage/replica_command.go, line 906 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

So stupid question here so that you get real confidence in this review: how exactly does this patch actually cause the logs to not be included in the snapshot? Does anybody on the sender side look at this snap.State.TruncatedState? :)

Yep, here:

// Iterate over the specified range of Raft entries and send them all out
// together.
firstIndex := header.State.TruncatedState.Index + 1
endIndex := snap.RaftSnap.Metadata.Index + 1
preallocSize := endIndex - firstIndex

I also added a comment to that effect.

I also just figured out that with this change we're effectively not sending any log entries at all, which is great! One less thing to worry about.


pkg/storage/replica_raft.go, line 2338 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I believe this is correct, but the return value for handleTruncatedStateBelowRaft() really needs to be documented. Try to read that function without a tome of context (as I am :P ).

So to make sure I'm with you here, we don't expect this branch to be commonly taken, do we? This is taken is the leaseholder had more logs than we did, and to attempt to truncate up to an index that's below any that we have in our logs. How exactly would that happen?

You're right that this needed more words. I added some on handleTruncatedStateBelowRaft, WDYT?

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

gold Jerry, gold

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @tbg)


pkg/storage/replica_command.go, line 905 at r3 (raw file):

		// to do since it is not a replicated key (and thus not subject to
		// matching across replicas). The actual sending happens here:
		_ = (*kvBatchSnapshotStrategy)(nil).Send

lool I appreciate the innovative thinking, but this line is a bit much


pkg/storage/replica_raft.go, line 2338 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You're right that this needed more words. I added some on handleTruncatedStateBelowRaft, WDYT?

sweet

@tbg
Copy link
Member Author

tbg commented Mar 15, 2019

TFTR! Going to run a few relevant roachtests off of this branch before I merge.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 6 files at r2, 3 of 3 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @tbg)


pkg/storage/replica_raft.go, line 2349 at r3 (raw file):

			// Morally we would want to drop the command in checkForcedErrLocked,
			// but that may be difficult to achieve.
			log.Fatal(ctx, log.Safe(fmt.Sprintf("TruncatedState regressed:\nold: %+v\nnew: %+v", oldTruncatedState, rResult.State.TruncatedState)))

I was just about to add a test that relied on this assertion (when a TruncateLog operation applied twice). I guess I'll have to find a different failure signal for this test, but the strictness here was useful to bring a problem in #35261 to light.


pkg/storage/replica_raftstorage.go, line 1002 at r3 (raw file):

	// wholesale replace r.mu.state.
	r.mu.state = s
	// Snapshots typically have less log entries than the leaseholder. The next

s/less/fewer/

@bdarnell
Copy link
Contributor

For release notes @jseldess: This change is unavailable until the upgrade to 19.1 is finalized (the cluster version was introduced in a separate PR that didn't have a release note: the "unreplicated truncated state" change)

@tbg
Copy link
Member Author

tbg commented Mar 18, 2019

Kicked off runs for clearrange/checks=false and import/tpcc/warehouses=1000/nodes=32. They both run larger imports that I've recently found to cause issues with the large Raft log and that I would now expect to highlight problems related to this change. The former also runs lots of merges which aggressively move replicas for good coverage. Will report back in a few hours.

Assume a leaseholder wants to send a (Raft or preemptive) snapshot to a follower.
Say the leaseholder's log ranges from 100 to 200, and we assume that the size
(in bytes) of this log is 200mb. All of the log is successfully
committed and applied, and is thus reflected in the snapshot data.

Prior to this change, we would still send the 200mb of log entries along
with the snapshot, even though the snapshot itself already reflected
them.

After this change, we won't send any log entries along with the
snapshot, as sanity would suggest we would.

We were unable to make this change because up until recently, the Raft
truncated state (which dictates the first log index) was replicated and
consistency checked; this was changed in cockroachdb#34660. The migration
introduced there makes it straightforward to omit a prefix of the
log in snapshots, as done in this commit.

Somewhere down the road (19.2?) we should localize all the log
truncation decisions and simplify all this further. I suspect
that in doing so we can avoid tracking the size of the Raft log
in the first place; all we really need for this is some mechanism
that makes sure that an "idle" replica truncates its logs. With
unreplicated truncation, this becomes cheap enough to "just do".

Release note (bug fix): Remove historical log entries from Raft snapshots.
These log entries could lead to failed snapshots with a message such as:

    snapshot failed: aborting snapshot because raft log is too large
    (25911051 bytes after processing 7 of 37 entries)
@tbg tbg force-pushed the fix/snap-no-past-log branch from ef8a2a4 to 56b79f1 Compare March 18, 2019 09:54
@tbg
Copy link
Member Author

tbg commented Mar 18, 2019

import/tpcc/warehouses=1000/nodes=32 finished like a charm in 13 minutes (including setting u the machines, etc). Looking at the test history, either this was a fairly lucky run or we've actually shaved a few minutes off the test with this change (16-18 minutes is more typical for this test).

@jseldess
Copy link
Contributor

Thanks for the ping, @bdarnell. I already have it in my list of "performance improvements" here, though I don't think I'll specifically explain most of those.

@tbg
Copy link
Member Author

tbg commented Mar 18, 2019

Ran another import/tpcc/warehouses=1000/nodes=32 in 13 minutes end-to-end. clearrange is also looking good.
TFTRs!

bors r=andreimatei,bdarnell

craig bot pushed a commit that referenced this pull request Mar 18, 2019
35701: storage: don't send historical Raft log with snapshots r=andreimatei,bdarnell a=tbg

Assume a leaseholder wants to send a (Raft or preemptive) snapshot to a
follower. Say the leaseholder's log ranges from 100 to 200, and we assume
that the size
(in bytes) of this log is 200mb. All of the log is successfully committed
and applied, and is thus reflected in the snapshot data.

Prior to this change, we would still send the 200mb of log entries along
with the snapshot, even though the snapshot itself already reflected them.

After this change, we won't send any log entries along with the snapshot,
as sanity would suggest we would.

We were unable to make this change because up until recently, the Raft
truncated state (which dictates the first log index) was replicated and
consistency checked; this was changed in #34660.

Somewhere down the road (19.2?) we should localize all the log truncation
decisions and simplify all this further. I suspect that in doing so we can
avoid tracking the size of the Raft log in the first place; all we really
need for this is some mechanism that makes sure that an "idle" replica
truncates its logs. With unreplicated truncation, this becomes cheap enough
to "just do".

Release note (bug fix): Remove historical log entries from Raft snapshots.
These log entries could lead to failed snapshots with a message such as:

    snapshot failed: aborting snapshot because raft log is too large
    (25911051 bytes after processing 7 of 37 entries)

Co-authored-by: Tobias Schottdorf <[email protected]>
@craig
Copy link
Contributor

craig bot commented Mar 18, 2019

Build succeeded

@craig craig bot merged commit 56b79f1 into cockroachdb:master Mar 18, 2019
@tbg tbg deleted the fix/snap-no-past-log branch March 18, 2019 15:32
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Jan 30, 2020
Raft log entries are no longer sent along with snapshots as of v19.2
(more specifically, cockroachdb#35701), we can/should delete all code around the
sending/handling of log entries in snapshots before v20.1 is cut. This
simplifies a few snapshot related protos and naturally obviates the need
to handle sideloaded proposals within snapshots.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Jan 31, 2020
Raft log entries are no longer sent along with snapshots as of v19.2
(more specifically, cockroachdb#35701), we can/should delete all code around the
sending/handling of log entries in snapshots before v20.1 is cut. This
simplifies a few snapshot related protos and naturally obviates the need
to handle sideloaded proposals within snapshots.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Nov 2, 2021
We used to have to send the raft log along with snapshots as a result of
(in the olden days) requiring all replicas to agree on the truncated
state. This hasn't been (generally) true as of v19.1 (cockroachdb#35701), though it
was still a possibility until v22.1 (cockroachdb#58088). This commit removes the
corresponding code.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Nov 2, 2021
We used to have to send the raft log along with snapshots as a result of
(in the olden days) requiring all replicas to agree on the truncated
state. This hasn't been (generally) true as of v19.1 (cockroachdb#35701), though it
was still a possibility until v22.1 (cockroachdb#58088). This commit removes the
corresponding code.

Release note: None
craig bot pushed a commit that referenced this pull request Nov 2, 2021
72310: kvserver: remove unused snapshot log entry code r=erikgrinaker a=tbg

We used to have to send the raft log along with snapshots as a result of
(in the olden days) requiring all replicas to agree on the truncated
state. This hasn't been (generally) true as of v19.1 (#35701), though it
was still a possibility until v22.1 (#58088). This commit removes the
corresponding code.

Release note: None


Co-authored-by: Tobias Grieger <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 12, 2022
Fixes cockroachdb#81763.
Part of cockroachdb#81561.

\### Background

When performing a lease transfer, the outgoing leaseholder revokes its
lease before proposing the lease transfer request, meaning that it
promises to stop using the previous lease to serve reads or writes. The
lease transfer request is then proposed and committed to the Raft log, at
which point the new lease officially becomes active. However, this new
lease is not usable until the incoming leaseholder applies the Raft entry
that contains the lease transfer and notices that it is now the
leaseholder for the range.

The effect of this handoff is that there exists a "power vacuum" time
period when the outgoing leaseholder has revoked its previous lease but
the incoming leaseholder has not yet applied its new lease. During this
time period, a range is effectively unavailable for strong reads and
writes, because no replica will act as the leaseholder. Instead, requests
that require the lease will be redirected back and forth between the
outgoing leaseholder and the incoming leaseholder (the client backs off).
To minimize the disruption caused by lease transfers, we need to minimize
this time period.

We assume that if a lease transfer target is sufficiently caught up on
its log such that it will be able to apply the lease transfer through log
entry application then this unavailability window will be acceptable.
This may be a faulty assumption in cases with severe replication lag, but
we must balance any heuristics here that attempts to determine "too much
lag" with the possibility of starvation of lease transfers under
sustained write load and a resulting sustained replication lag. See cockroachdb#38065
and cockroachdb#42379, which removed such a heuristic. For now, we don't try to make
such a determination.

\### Patch Details

However, with this change, we now draw a distinction between lease
transfer targets that will be able to apply the lease transfer through
log entry application and those that will require a Raft snapshot to
catch up and apply the lease transfer. Raft snapshots are more expensive
than Raft entry replication. They are also significantly more likely to
be delayed due to queueing behind other snapshot traffic in the system.
This potential for delay makes transferring a lease to a replica that
needs a snapshot very risky, as doing so has the effect of inducing
range unavailability until the snapshot completes, which could take
seconds, minutes, or hours.

In the future, we will likely get better at prioritizing snapshots to
improve the responsiveness of snapshots that are needed to recover
availability. However, even in this world, it is not worth inducing
unavailability that can only be recovered through a Raft snapshot. It is
better to catch the desired lease target up on the log first and then
initiate the lease transfer once its log is connected to the leader's.
For this reason, unless we can guarantee that the lease transfer target
does not need a Raft snapshot, we don't let it through.

This commit adds protection against such risky lease transfers at two
levels. First, it includes hardened protection in the Replica proposal
buffer, immediately before handing the lease transfer proposal off to
`etcd/raft`. Second, it includes best-effort protection before a Replica
begins to initiate a lease transfer in `AdminTransferLease`, which all
lease transfer operations flow through.

The change includes protection at two levels because rejecting a lease
transfer in the proposal buffer after we have revoked our current lease
is more disruptive than doing so earlier, before we have revoked our
current lease. Best-effort protection is also able to respond more
gracefully to invalid targets (e.g. they pick the next best target).

However, the check in the Replica proposal buffer is the only place
where the protection is airtight against race conditions because the
check is performed:
1. by the current Raft leader, else the proposal will fail
2. while holding latches that prevent interleaving log truncation

\### Remaining Work

With this change, there is a single known race which can lead to an
incoming leaseholder needing a snapshot. This is the case when a
leader/leaseholder transfers the lease and then quickly loses Raft
leadership to a peer that has a shorter log. Even if the older leader
could have caught the incoming leaseholder up on its log, the new leader
may not be able to because its log may not go back as far. Such a
scenario has been possible since we stopped ensuring that all replicas
have logs that start at the same index. For more details, see the
discussion about cockroachdb#35701 in cockroachdb#81561.

This race is theoretical — we have not seen it in practice. It's not
clear whether we will try to address it or rely on a mitigation like
the one described in cockroachdb#81764 to limit its blast radius.

----

Release note (bug fix): Range lease transfers are no longer permitted to
follower replicas that may require a Raft snapshot. This ensures that
lease transfers are never delayed behind snapshots, which could
previously create range unavailability until the snapshot completed.
Lease transfers now err on the side of caution and are only allowed when
the outgoing leaseholder can guarantee that the incoming leaseholder
does not need a snapshot.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 21, 2022
Fixes cockroachdb#81763.
Part of cockroachdb#81561.

\### Background

When performing a lease transfer, the outgoing leaseholder revokes its
lease before proposing the lease transfer request, meaning that it
promises to stop using the previous lease to serve reads or writes. The
lease transfer request is then proposed and committed to the Raft log, at
which point the new lease officially becomes active. However, this new
lease is not usable until the incoming leaseholder applies the Raft entry
that contains the lease transfer and notices that it is now the
leaseholder for the range.

The effect of this handoff is that there exists a "power vacuum" time
period when the outgoing leaseholder has revoked its previous lease but
the incoming leaseholder has not yet applied its new lease. During this
time period, a range is effectively unavailable for strong reads and
writes, because no replica will act as the leaseholder. Instead, requests
that require the lease will be redirected back and forth between the
outgoing leaseholder and the incoming leaseholder (the client backs off).
To minimize the disruption caused by lease transfers, we need to minimize
this time period.

We assume that if a lease transfer target is sufficiently caught up on
its log such that it will be able to apply the lease transfer through log
entry application then this unavailability window will be acceptable.
This may be a faulty assumption in cases with severe replication lag, but
we must balance any heuristics here that attempts to determine "too much
lag" with the possibility of starvation of lease transfers under
sustained write load and a resulting sustained replication lag. See cockroachdb#38065
and cockroachdb#42379, which removed such a heuristic. For now, we don't try to make
such a determination.

\### Patch Details

However, with this change, we now draw a distinction between lease
transfer targets that will be able to apply the lease transfer through
log entry application and those that will require a Raft snapshot to
catch up and apply the lease transfer. Raft snapshots are more expensive
than Raft entry replication. They are also significantly more likely to
be delayed due to queueing behind other snapshot traffic in the system.
This potential for delay makes transferring a lease to a replica that
needs a snapshot very risky, as doing so has the effect of inducing
range unavailability until the snapshot completes, which could take
seconds, minutes, or hours.

In the future, we will likely get better at prioritizing snapshots to
improve the responsiveness of snapshots that are needed to recover
availability. However, even in this world, it is not worth inducing
unavailability that can only be recovered through a Raft snapshot. It is
better to catch the desired lease target up on the log first and then
initiate the lease transfer once its log is connected to the leader's.
For this reason, unless we can guarantee that the lease transfer target
does not need a Raft snapshot, we don't let it through.

This commit adds protection against such risky lease transfers at two
levels. First, it includes hardened protection in the Replica proposal
buffer, immediately before handing the lease transfer proposal off to
`etcd/raft`. Second, it includes best-effort protection before a Replica
begins to initiate a lease transfer in `AdminTransferLease`, which all
lease transfer operations flow through.

The change includes protection at two levels because rejecting a lease
transfer in the proposal buffer after we have revoked our current lease
is more disruptive than doing so earlier, before we have revoked our
current lease. Best-effort protection is also able to respond more
gracefully to invalid targets (e.g. they pick the next best target).

However, the check in the Replica proposal buffer is the only place
where the protection is airtight against race conditions because the
check is performed:
1. by the current Raft leader, else the proposal will fail
2. while holding latches that prevent interleaving log truncation

\### Remaining Work

With this change, there is a single known race which can lead to an
incoming leaseholder needing a snapshot. This is the case when a
leader/leaseholder transfers the lease and then quickly loses Raft
leadership to a peer that has a shorter log. Even if the older leader
could have caught the incoming leaseholder up on its log, the new leader
may not be able to because its log may not go back as far. Such a
scenario has been possible since we stopped ensuring that all replicas
have logs that start at the same index. For more details, see the
discussion about cockroachdb#35701 in cockroachdb#81561.

This race is theoretical — we have not seen it in practice. It's not
clear whether we will try to address it or rely on a mitigation like
the one described in cockroachdb#81764 to limit its blast radius.

----

Release note (bug fix): Range lease transfers are no longer permitted to
follower replicas that may require a Raft snapshot. This ensures that
lease transfers are never delayed behind snapshots, which could
previously create range unavailability until the snapshot completed.
Lease transfers now err on the side of caution and are only allowed when
the outgoing leaseholder can guarantee that the incoming leaseholder
does not need a snapshot.
craig bot pushed a commit that referenced this pull request Jun 22, 2022
82560: sql: removed redundant parameter in method to schedule sql stats compaction r=rafiss a=surahman

The `crdb_internal.schedule_sql_stats_compaction` SQL function does not require the `byte` string parameter and has thus been removed. Closes #78332.

Jira issue: [CRDB-14071](https://cockroachlabs.atlassian.net/browse/CRDB-14071)

`@Azhng`

82758: kv: don't allow lease transfers to replicas in need of snapshot r=nvanbenschoten a=nvanbenschoten

Fixes #81763.
Fixes #79385.
Part of #81561.

### Background

When performing a lease transfer, the outgoing leaseholder revokes its lease before proposing the lease transfer request, meaning that it promises to stop using the previous lease to serve reads or writes. The lease transfer request is then proposed and committed to the Raft log, at which point the new lease officially becomes active. However, this new lease is not usable until the incoming leaseholder applies the Raft entry that contains the lease transfer and notices that it is now the leaseholder for the range.

The effect of this handoff is that there exists a "power vacuum" time period when the outgoing leaseholder has revoked its previous lease but the incoming leaseholder has not yet applied its new lease. During this time period, a range is effectively unavailable for strong reads and writes, because no replica will act as the leaseholder. Instead, requests that require the lease will be redirected back and forth between the outgoing leaseholder and the incoming leaseholder (the client backs off). To minimize the disruption caused by lease transfers, we need to minimize this time period.

We assume that if a lease transfer target is sufficiently caught up on its log such that it will be able to apply the lease transfer through log entry application then this unavailability window will be acceptable. This may be a faulty assumption in cases with severe replication lag, but we must balance any heuristics here that attempts to determine "too much lag" with the possibility of starvation of lease transfers under sustained write load and a resulting sustained replication lag. See #38065 and #42379, which removed such a heuristic. For now, we don't try to make such a determination.

### Patch Details

However, with this change, we now draw a distinction between lease transfer targets that will be able to apply the lease transfer through log entry application and those that will require a Raft snapshot to catch up and apply the lease transfer. Raft snapshots are more expensive than Raft entry replication. They are also significantly more likely to be delayed due to queueing behind other snapshot traffic in the system. This potential for delay makes transferring a lease to a replica that needs a snapshot very risky, as doing so has the effect of inducing range unavailability until the snapshot completes, which could take seconds, minutes, or hours.

In the future, we will likely get better at prioritizing snapshots to improve the responsiveness of snapshots that are needed to recover availability. However, even in this world, it is not worth inducing unavailability that can only be recovered through a Raft snapshot. It is better to catch the desired lease target up on the log first and then initiate the lease transfer once its log is connected to the leader's. For this reason, unless we can guarantee that the lease transfer target does not need a Raft snapshot, we don't let it through. 

This commit adds protection against such risky lease transfers at two levels. First, it includes hardened protection in the Replica proposal buffer, immediately before handing the lease transfer proposal off to etcd/raft. Second, it includes best-effort protection before a Replica begins to initiate a lease transfer in `AdminTransferLease`, which all lease transfer operations flow through.

The change includes protection at two levels because rejecting a lease transfer in the proposal buffer after we have revoked our current lease is more disruptive than doing so earlier, before we have revoked our current lease. Best-effort protection is also able to respond more gracefully to invalid targets (e.g. they pick the next best target).

However, the check in the Replica proposal buffer is the only place where the protection is airtight against race conditions because the check is performed:
1. by the current Raft leader, else the proposal will fail
2. while holding latches that prevent interleaving log truncation

### Remaining Work

With this change, there is a single known race which can lead to an incoming leaseholder needing a snapshot. This is the case when a leader/leaseholder transfers the lease and then quickly loses Raft leadership to a peer that has a shorter log. Even if the older leader could have caught the incoming leaseholder up on its log, the new leader may not be able to because its log may not go back as far. Such a scenario has been possible since we stopped ensuring that all replicas have logs that start at the same index. For more details, see the discussion about #35701 in #81561.

This race is theoretical — we have not seen it in practice. It's not clear whether we will try to address it or rely on a mitigation like the one described in #81764 to limit its blast radius.

----

Release note (bug fix): Range lease transfers are no longer permitted to follower replicas that may require a Raft snapshot. This ensures that lease transfers are never delayed behind snapshots, which could previously create range unavailability until the snapshot completed. Lease transfers now err on the side of caution and are only allowed when the outgoing leaseholder can guarantee that the incoming leaseholder does not need a snapshot.

83109: asim: batch workload events r=kvoli a=kvoli

This patch introduces batching for load events. Previously, load events
were generated per-key and applied individually to the simulator state
by finding the range containing that key. This patch batches load events
on the same key, then applies the load events in ascending order, over
the range tree.

This results in a speedup of 5x on a 32 store, 32k replicas, 16k QPS
cluster.

Release note: None

Co-authored-by: Saad Ur Rahman <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants