-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: Improve handling of lease index retries #35261
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you tell the real story in the commit and PR description? I think it's something like this: #28876 (comment) but I wasn't able to piece it together from what's discussed there. But it sounds that there are both NotLeaseholderErrors and illegal lease indexes in the mix, so there are probably two separate stories to tell, both connected to lease transfers. I think that once we initiate a lease transfer, we're not accepting new writes (they bounce immediately with not leaseholder error?), so I think you're saying that a lease transfer is initiated while a write is pending. What I think this means is that the lease transfer should enter the log after the pending write, and should not be able to interfere with it per se. But somehow it does due to the Raft leadership transfer, but how exactly? Are they getting reordered in the logs? Isn't that still counter-intuitive?
I like the fifth commit except that I don't understand why the fourth commit is correct at all. If the leaseholder does change, I don't see that we can just pretend that a command eval'ed by the old leaseholder can simply be reproposed under a new lease index.
Reviewed 4 of 4 files at r1, 2 of 2 files at r2, 1 of 1 files at r3, 3 of 3 files at r4, 1 of 1 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @nvanbenschoten)
pkg/storage/replica_raft.go, line 2067 at r4 (raw file):
// If we failed to apply at the right lease index, try again with a new one. r.mu.Lock() lease := *r.mu.state.Lease
Is this really correct? If this node was the lease holder at the time of the proposal but isn't any more, is there any chance we'd give the command a different lease now even though it didn't eval on that leaseholder, and just so happen to assign a LAI that actually lets the reproposal go through? That wouldn't be correct since the new leaseholder's ts cache isn't protecting us. ISTM that we can only repropose if the lease didn't change (or rather,w e can only repropose with the same lease, and have to make sure that there won't be an infinite loop of that, i.e. notleaseholder takes precedence over illegal lease index).
If this still does work, I'd like to see a lot more commentary than there is now.
By the way, if this is about lease index errors only (i.e. you're not expecting the lease to change) there's probably a bug in how they are assigned, but the code here suggests that it's
pkg/storage/replica_raft.go, line 2073 at r4 (raw file):
log.VEventf(ctx, 2, "retry: proposalIllegalLeaseIndex") // Use a goroutine so this doesn't count as "marshaling a proto downstream of raft". go proposeAfterIllegalLeaseIndex(ctx, r, proposal, *r.mu.state.Lease)
Were you planning on fixing this up? Morally you'd want to just do this synchronously, right? I could settle for running this as a task if there's no good way to work around the linter.
pkg/storage/replica_test.go, line 7420 at r4 (raw file):
ba, ) if retry != proposalNoReevaluation {
The test should verify the internal reproposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipelined writes wait for commands to be evaluated, then detach before they apply. Retryable errors generated at evaluation time are handled correctly (by DistSender or above) even for pipelined writes, but pipelined writes have lost the ability to handle apply-time retry conditions in the executeWriteBatch loop (there used to be more of these, but illegal lease index is now the only condition retried at this level now). NotLeaseHolderError is usually generated at evaluation time so it's not a problem for pipelined writes (in some cases NotLeaseHolderError can be generated at apply time; when this happens we still get an async write failure and I don't see a way around that).
After a table creation, we do a split, and then we (maybe) transfer the lease and then transfer raft leadership. The problem here is when a write occurs between the lease transfer and the end of the raft leader transfer. If it evaluates on the old leaseholder, it gets that error at evaluation time and is redirected by the DistSender. When it evaluates on the new leaseholder, it must be forwarded to the raft leader (almost certainly the old leaseholder). If it arrives at the old leaseholder at the same time that raft leadership is transferred to the new leaseholder, it's likely to get dropped or reordered. This will produce a lease index error when applied, but the new lease is still valid and it can get retried successfully (lease index errors in other situations may result in NotLeaseHolderErrors when retried and ultimately async write failures).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
pkg/storage/replica_raft.go, line 2067 at r4 (raw file):
Yeah, this part still needs a little work. It should validate not just that we still own the lease but that it's the same lease (check the sequence number). NotLeaseHolder does take precedence over illegal lease index, so there's not an infinite loop.
By the way, if this is about lease index errors only (i.e. you're not expecting the lease to change) there's probably a bug in how they are assigned, but the code here suggests that it's
Finish that sentence?
pkg/storage/replica_raft.go, line 2073 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Were you planning on fixing this up? Morally you'd want to just do this synchronously, right? I could settle for running this as a task if there's no good way to work around the linter.
I can't decide whether the async goroutine is the right thing to do (we don't want to block the raft-processing goroutine longer than necessary) or just a hack to placate the linter. Maybe we should just add a t.Helper()
style marker to tell the linter to ignore this.
pkg/storage/replica_raft.go, line 2064 at r5 (raw file):
_, pErr := r.proposeLocked(ctx, proposal, lease) if pErr != nil { panic(pErr)
This part will also change (to call proposal.finishApplication instead of panicking).
pkg/storage/replica_test.go, line 7420 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
The test should verify the internal reproposal.
That's what it does, with the "expected %d proposals" check below. Formerly, you'd get retries from executeWriteBatch
but not from tryExecuteWriteBatch
; now you get the retry either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so the sequence of events is:
- lease transfer from s1 to s2 initiated
- command evals under s2, under the new lease
- command gets proposed (forwarded to s1, which is still Raft leader), in particular it probably has a bad MaxLeaseIndex that doesn't take into account inflight commands at s1
- leadership transfers to s2 (why is this important?)
- command catches illegal lease applied index (but has the right lease)
I think you're saying that the leadership transfer can drop the command, but wouldn't that just lead to a reproposal and that would succeed? I don't know how it would reorder commands, after all, the inflight commands are either already in the new leader's log (in the right order) or they get reproposed (and we sort things so that the order matches). I'm probably still missing some detail.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @nvanbenschoten)
pkg/storage/replica_raft.go, line 2067 at r4 (raw file):
Finish that sentence?
Sorry. I think I was asking the same question as in the PR-wide comment thread about why we see a reordering.
Yeah, this part still needs a little work. It should validate not just that we still own the lease but that it's the same lease (check the sequence number). NotLeaseHolder does take precedence over illegal lease index, so there's not an infinite loop.
Why doesn't it blindly repropose (btw we need a word for that - reproposing means not touching MaxLeaseIndex, but here we do -- reinsert)? If the lease doesn't match, it won't apply. This is less efficient, but we're not expecting to hit this case frequently anyway (?)
All we need to be doing it seems is to check the sequence number though, so maybe this is easy enough.
pkg/storage/replica_raft.go, line 2073 at r4 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
I can't decide whether the async goroutine is the right thing to do (we don't want to block the raft-processing goroutine longer than necessary) or just a hack to placate the linter. Maybe we should just add a
t.Helper()
style marker to tell the linter to ignore this.
calling proposeLocked
outside of the loop opens the door to further reorderings, though. Not saying that these would happen, but if we expect to never hit this path in the steady state maybe it's fine to do it synchronously. Raft leadership transfers are much more expensive than doing this on the Raft goroutine.
pkg/storage/replica_test.go, line 7420 at r4 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
That's what it does, with the "expected %d proposals" check below. Formerly, you'd get retries from
executeWriteBatch
but not fromtryExecuteWriteBatch
; now you get the retry either way.
Thanks, read that the wrong way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The leadership transfer is important because during a transfer (a rigged raft election) proposals get dropped rather than forwarded. The two writes in our transaction could get retried as follows:
- w1 evaluates on s2, gets lease index i1, forwards to raft leader s1
- s1 initiates leader transfer to s2. this puts it into a temporarily leaderless state so it drops the forwarded proposal
- s2 is elected raft leader
- w2 evaluates on s2, gets lease index i2, goes right in the log
- s2 refreshes proposals as a side effect of becoming leader, writes w1 to log with lease index i1
- s2 applies w2, then w1. w1 fails because of the out of order lease index
(note that I haven't actually verified that this is what is going on, but it's consistent with what I've observed).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @nvanbenschoten)
Makes sense. Thanks for the example and please work it into the commit messages once you're wrapping up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How close do you think this gets us to removing proposalReevaluationReason
and its handling in Replica.executeWriteBatch
entirely?
Reviewed 4 of 4 files at r1, 2 of 2 files at r2, 1 of 1 files at r3, 2 of 3 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @tbg)
pkg/storage/replica_raft.go, line 94 at r1 (raw file):
} // evalAndPropose prepares the necessary pending command struct and initializes a
I like the rename. Let's mention that this method includes evaluation in the comment as well.
pkg/storage/replica_raft.go, line 2067 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Finish that sentence?
Sorry. I think I was asking the same question as in the PR-wide comment thread about why we see a reordering.
Yeah, this part still needs a little work. It should validate not just that we still own the lease but that it's the same lease (check the sequence number). NotLeaseHolder does take precedence over illegal lease index, so there's not an infinite loop.
Why doesn't it blindly repropose (btw we need a word for that - reproposing means not touching MaxLeaseIndex, but here we do -- reinsert)? If the lease doesn't match, it won't apply. This is less efficient, but we're not expecting to hit this case frequently anyway (?)
All we need to be doing it seems is to check the sequence number though, so maybe this is easy enough.
Yeah, this should be comparing lease sequence numbers so that we know that the new lease is either identical to the proposer's lease or an extension of the proposer's lease.
Do we also need to short-circuit this logic if raftCmd.DeprecatedProposerLease != nil
like we do in checkForcedErrLocked
? We don't have any guarantee that this entry was proposed with a lease sequence number.
pkg/storage/replica_raft.go, line 2073 at r4 (raw file):
log.VEventf(ctx, 2, "retry: proposalIllegalLeaseIndex") // Use a goroutine so this doesn't count as "marshaling a proto downstream of raft". go proposeAfterIllegalLeaseIndex(ctx, r, proposal, *r.mu.state.Lease)
The "marshaling a proto downstream of raft" lint is interesting. It's worth noting here that the entry will never be reproposed with a different proto version because only the leaseholder who originally proposed the entry will ever re-marshall it and repropose it again. That invariant helps alleviate some of the compatibility/migration concerns I have.
849f68b
to
1e6484f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How close do you think this gets us to removing proposalReevaluationReason and its handling in Replica.executeWriteBatch entirely?
Added two more commits. tryExecuteWriteBatch is gone. The type and constants are still there, but the handling is localized in processRaftCommand.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
pkg/storage/replica_raft.go, line 2067 at r4 (raw file):
Why doesn't it blindly repropose
r.mu.state.Lease
is the current lease, which might be owned by someone else. If it is, we might hit upon a lease index that they assigned by chance, so it could apply. I'm pretty sure we currently rely on the proposer to guarantee that it only proposes under leases it owns.
I think if the code were factored a little differently, though, we could blindly repropose with the same lease - exactly the same raft command but a new max lease index. But with the way the max lease index we currently need a lease proto, so it seems easier to check if the lease still matches.
Do we also need to short-circuit this logic if raftCmd.DeprecatedProposerLease != nil like we do in checkForcedErrLocked? We don't have any guarantee that this entry was proposed with a lease sequence number.
Not necessary here - we don't reach this point for commands read from the on-disk raft log, it's only for commands that were proposed by the current process, so there won't be any old deprecated fields in use.
pkg/storage/replica_raft.go, line 2073 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
calling
proposeLocked
outside of the loop opens the door to further reorderings, though. Not saying that these would happen, but if we expect to never hit this path in the steady state maybe it's fine to do it synchronously. Raft leadership transfers are much more expensive than doing this on the Raft goroutine.
Yeah, I'll inline it and whitelist it for the linter.
pkg/storage/replica_raft.go, line 2073 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The "marshaling a proto downstream of raft" lint is interesting. It's worth noting here that the entry will never be reproposed with a different proto version because only the leaseholder who originally proposed the entry will ever re-marshall it and repropose it again. That invariant helps alleviate some of the compatibility/migration concerns I have.
The "downstream of raft" linter isn't really a concern here. It's there because we don't want to marshal the same proto differently across the different replicas, but here we're already limiting ourselves to the node that proposed the command.
pkg/storage/replica_test.go, line 7420 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Thanks, read that the wrong way.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r9, 1 of 2 files at r10, 3 of 5 files at r11, 3 of 3 files at r12.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @tbg)
pkg/storage/replica_raft.go, line 2073 at r4 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
The "downstream of raft" linter isn't really a concern here. It's there because we don't want to marshal the same proto differently across the different replicas, but here we're already limiting ourselves to the node that proposed the command.
Right, that's what I'm saying. It will never be reproposed using a different proto definition so we don't have to worry about losing information. It's worth noting that somewhere.
pkg/storage/replica_raft.go, line 2060 at r12 (raw file):
defer r.mu.Unlock() lease := *r.mu.state.Lease if lease.OwnedBy(r.StoreID()) && lease.Sequence == proposal.command.ProposerLeaseSequence {
Is it possible for !lease.OwnedBy(r.StoreID())
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r6, 2 of 2 files at r7, 1 of 1 files at r8, 3 of 3 files at r9, 2 of 2 files at r10, 5 of 5 files at r11, 3 of 3 files at r12.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @nvanbenschoten)
pkg/storage/replica_raft.go, line 2067 at r4 (raw file):
Why doesn't it blindly repropose
Yeah, I wasn't saying "blindly stuff a new lease in" but "why change anything about the proposer lease". Your second sentence explains why: the code wants a lease (I assume for not much else than to pick the seqno). But I must still be missing something because it seems that the lease
passed in is only ever accessed in insertProposalLocked
, which cares about nothing but the seqno, so we could just change the signatures to take only that.
pkg/storage/replica_raft.go, line 2060 at r12 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is it possible for
!lease.OwnedBy(r.StoreID())
?
You're saying that this wouldn't happen because in this case the proposal would've caught a different error from the below-Raft check (not leaseholder)? That's probably right, though I'd be hesitant to rely on that for correctness. For example, if we moved this out of the Raft goroutine, it might stop being true because the lease can be any future lease as opposed to the one active when the cmd bounced.
pkg/storage/replica_write.go, line 197 at r11 (raw file):
} if propResult.ProposalRetry != proposalNoReevaluation { log.Fatalf(ctx, "expected no reevaluation")
Do you want to print the actual value and log.Safe
it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @nvanbenschoten, and @tbg)
pkg/storage/replica_raft.go, line 2067 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Why doesn't it blindly repropose
Yeah, I wasn't saying "blindly stuff a new lease in" but "why change anything about the proposer lease". Your second sentence explains why: the code wants a lease (I assume for not much else than to pick the seqno). But I must still be missing something because it seems that the
lease
passed in is only ever accessed ininsertProposalLocked
, which cares about nothing but the seqno, so we could just change the signatures to take only that.
Good idea. I changed things around so that filling in the proposer fields is a prerequisite for insertProposalLocked instead of something it does by passing through arguments you give it. This means we don't have to think about it on the retry path.
pkg/storage/replica_write.go, line 197 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Do you want to print the actual value and
log.Safe
it?
This fatal is removed by a subsequent commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the explicit examples and verbose explanation to the commit message of
storage: Move illegal lease index retries down into raft
and the PR description.
Also I wouldn't mind another close pass by @nvanbenschoten to make sure we're not introducing something undesirable here.
Reviewed 2 of 2 files at r13.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @bdarnell)
pkg/storage/replica_raft.go, line 2043 at r13 (raw file):
if proposedLocally { // If we failed to apply at the right lease index, try again with a new one.
Mention that this is important for pipelined writes.
pkg/storage/replica_raft.go, line 2057 at r13 (raw file):
// tryReproposeWithNewLeaseIndex is used by processRaftCommand to // repropose commands that have gotten an illegal lease index error.
Clarify that this is only called when we know not only that there has been an illegal lease index error, but we also know that the command has never succeeded (that is, we observed the first time this command was applied, not a reproposal).
bf17333
to
c4eed18
Compare
After rebasing, the test is no longer flaky under stress as described in #28876 (comment) Assuming it's also not flaky on teamcity I'm going to go ahead and merge this and handle the timestamp cache issue in a new PR. I've updated the commit message with more detail. |
4a8bde8
to
ac7f14c
Compare
Still flaky, but with a different failure:
I don't see any other reports of this failure, so it was probably introduced by these changes. |
6cceb7a
to
a77f0f5
Compare
d8197e3
to
bb1a567
Compare
Since that "ready for one last look" I had to make a subtle tweak (re-inserting to the proposals map) to prevent another stress flake, and also give it some pretty generous timeouts. It's still passing (at least locally) with those more generous timeouts, so it's not deadlocking or anything, but I'm surprised it's needing so long. I'm not sure whether that's indicating a problem or if that's just the nature of our stress (and especially stressrace) tests these days. All in all, I don't feel great about merging this while we have other open release blockers being investigated. It might introduce instability or at least complicate those investigations. |
bb1a567
to
26a87f5
Compare
OK, the flakiness appears to be resolved now (and this time was all on the test side, not the code side) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 9 of 9 files at r14, 2 of 2 files at r15, 1 of 1 files at r16, 3 of 3 files at r17, 2 of 2 files at r18, 5 of 5 files at r19, 3 of 3 files at r20, 2 of 2 files at r21, 1 of 1 files at r22, 3 of 3 files at r23.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @bdarnell)
pkg/storage/replica_raft.go, line 2060 at r12 (raw file):
Previously, tbg (Tobias Grieger) wrote…
You're saying that this wouldn't happen because in this case the proposal would've caught a different error from the below-Raft check (not leaseholder)? That's probably right, though I'd be hesitant to rely on that for correctness. For example, if we moved this out of the Raft goroutine, it might stop being true because the lease can be any future lease as opposed to the one active when the cmd bounced.
I think I'm saying that for us to hit this code path we know that proposedLocally == true
, so lease.Sequence == proposal.command.ProposerLeaseSequence
is the only other condition we should need to ensure that we own the current lease.
EDIT: it looks like Ben went ahead and removed the entire check, even the sequence part. That wasn't what I was suggesting, but I have to figure it's safe because we would get a different error if the lease had changed between proposing and attempting to repropose. @bdarnell mind confirming this and adding a comment?
pkg/storage/batcheval/cmd_lease_test.go, line 47 at r23 (raw file):
db := tc.ServerConn(0) // More than 30 iterations is flaky under stressrace on teamcity.
What kind of flaky? Is it timing out?
Moving lease index reproposals below raft introduced a bug in which the same proposal could be applied more than once at different lease indexes. (This was originally caught via an assertion in the handling of TruncateLog below raft). This commit adds a test and fixes the bug. Release note: None
26a87f5
to
f7dc847
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @bdarnell and @nvanbenschoten)
pkg/storage/replica_raft.go, line 2060 at r12 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think I'm saying that for us to hit this code path we know that
proposedLocally == true
, solease.Sequence == proposal.command.ProposerLeaseSequence
is the only other condition we should need to ensure that we own the current lease.EDIT: it looks like Ben went ahead and removed the entire check, even the sequence part. That wasn't what I was suggesting, but I have to figure it's safe because we would get a different error if the lease had changed between proposing and attempting to repropose. @bdarnell mind confirming this and adding a comment?
Done
pkg/storage/batcheval/cmd_lease_test.go, line 47 at r23 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
What kind of flaky? Is it timing out?
I understand the mystery now. The on-every-PR stress script runs with a timeout that is a function of the number of packages in which tests were added (!). When I added the second test, the first one got a shorter timeout and became flaky again. Also the iterations here turn out to be extremely expensive with the race detector on - more than a second each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r24.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @bdarnell)
pkg/storage/replica_raft.go, line 2060 at r12 (raw file):
Previously, bdarnell (Ben Darnell) wrote…
Done
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 9 of 9 files at r14, 2 of 2 files at r15, 1 of 1 files at r16, 3 of 3 files at r17, 2 of 2 files at r18, 5 of 5 files at r19, 3 of 3 files at r20, 2 of 2 files at r21, 1 of 1 files at r22, 2 of 3 files at r23, 1 of 1 files at r24.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale)
bors r=tbg,nvanbenschoten |
35261: storage: Improve handling of lease index retries r=tbg,nvanbenschoten a=bdarnell Pipelined writes wait for commands to be evaluated, then detach before they apply. Retryable errors generated at evaluation time are handled correctly (by DistSender or above) even for pipelined writes, but pipelined writes have lost the ability to handle apply-time retry conditions in the executeWriteBatch loop (there used to be more of these, but illegal lease index is now the only condition retried at this level now). To remedy this, this commit moves reproposals due to illegal lease indexes below raft (but only on the proposing node) In practice, this can happen to writes that race with a raft leadership transfer. This is observable immediately after table creation, since table creation performs a split, and then may perform a lease transfer to balance load (and if the lease is transfer, raft leadership is transferred afterward). Specifically, 1. Lease is transferred from store s1 to s2. s1 is still raft leader. 2. Write w1 evaluates on store s2 and is assigned lease index i1. s2 forwards the proposal to s1. 3. s1 initiates raft leader transfer to s2. This puts it into a temporarily leaderless state so it drops the forwarded proposal. 4. s2 is elected raft leader, completing the transfer. 5. A second write w2 evalutes on s2, is assigned lease index i2, and goes right in the raft log since s2 is both leaseholder and leader. 6. s2 refreshes proposals as a side effect of becoming leader, and writes w1 to the log with lease index i1. 7. s2 applies w2, then w1. w1 fails because of the out of order lease index. If w1 was pipelined, the client stops listening after step 2, and won't learn of the failure until it tries to commit. At this point the commit returns an ASYNC_WRITE_FAILURE retry to the client. Note that in some cases, NotLeaseHolderError can be generated at apply time. These errors cannot be retried (since the proposer has lost its lease) so they will unavoidably result in an ASYNC_WRITE_FAILURE error to the client. However, this is uncommon - most NotLeaseHolderErrors are generated at evaluation time, which is compatible with pipelined writes. Fixes #28876 The fourth commit is the important part of this change. The fifth is optional; it's a simplification but it changes an under-tested error path. Assuming we like the fifth commit, I'd like to add a sixth that rips out proposalReevaluationReason and the executeWriteBatch/tryExecuteWriteBatch loop altogether. Co-authored-by: Ben Darnell <[email protected]>
Build succeeded |
…ading SSTs Fixes cockroachdb#36861. This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals` were immutable over the lifetime of a Raft proposal. This invariant was violated by cockroachdb#35261, which allowed a lease index error to trigger an immediate reproposal. This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`. Combined with aliasing between multiple Raft proposals due to reproposals due to ticks, this resulted in cases where a leaseholder's Raft logs could diverge from its followers and cause Raft groups to become inconsistent. Release note: None
…ading SSTs Fixes cockroachdb#36861. This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals` were immutable over the lifetime of a Raft proposal. This invariant was violated by cockroachdb#35261, which allowed a lease index error to trigger an immediate reproposal. This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`. Combined with aliasing between multiple Raft proposals due to reproposals due to ticks, this resulted in cases where a leaseholder's Raft logs could diverge from its followers and cause Raft groups to become inconsistent. Release note: None
36939: storage: remove optimization to use in-memory RaftCommand when sideloading SSTs r=nvanbenschoten a=nvanbenschoten Fixes #36861. This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals` were immutable over the lifetime of a Raft proposal. This invariant was violated by #35261, which allowed a lease index error to trigger an immediate reproposal. This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`. Combined with aliasing between multiple Raft proposals due to reproposals from ticks, this resulted in cases where a leaseholder's Raft logs could diverge from its followers and cause Raft groups to become inconsistent. We can justify the new cost of unmarshalling `RaftCommands` in `maybeInlineSideloadedRaftCommand` on leaseholders because https://github.com/cockroachdb/cockroach/pull/36670/files#diff-8a33b5571aa9ab154d4f3c2a16724697R230 just landed. That change removes an equally sized allocation and memory copy from the function on both leaseholders and followers. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
…ading SSTs Fixes cockroachdb#36861. This optimization relied on the fact that `RaftCommands` in `Replica.mu.proposals` were immutable over the lifetime of a Raft proposal. This invariant was violated by cockroachdb#35261, which allowed a lease index error to trigger an immediate reproposal. This reproposal mutated the corresponding `RaftCommand` in `Replica.mu.proposals`. Combined with aliasing between multiple Raft proposals due to reproposals due to ticks, this resulted in cases where a leaseholder's Raft logs could diverge from its followers and cause Raft groups to become inconsistent. Release note: None
This commit pulls the replica leaseholder check on the read and write evaluation paths above concurrency manager sequencing. This change makes semantic sense because the contents of the concurrency manager and its lockTable are a reflection of the state of the range, so they should only be exposed to requests if the request is evaluating on a leaseholder replica or has otherwise. Up to this point, this hasn't been a real issue because the lockTable was purely a best-effort representation of the state of a range. There are a few strong reasons to make this change: - it makes it easier to fix cockroachdb#46148, a current release blocker - it allows the MaxTimestamp of transactions to be bound before sequencing in the lockTable. This has the effect of ensuring that all transaction pushes are to a timestamp less than or equal to the current node's HLC clock. - it seems like the order we would want things in if/when we begin to address consistent read replicas (cockroachdb#39758), where the concurrency manager sequencing step becomes a distributed operation that depends on the state of the current lease. One concern here is that the concurrency manager sequencing can take an arbitrarily long amount of time because it blocks on conflicting locks. It would be unfortunate if this lead to an increased number of rejected proposals due to lease sequence mismatches. To address this, the commit makes two small changes the interact with each other: 1. it performs a best-effort check before evaluating and proposing a write that the lease it will be proposing under is still active. This check is similar to the best-effort context cancellation check we perform before proposing. 2. it adds a retry loop for NotLeaseHolderErrors hit by requests into executeBatchWithConcurrencyRetries. We've had a retry loop like this a few times in the past (most recently removed in cockroachdb#35261). Now that we're proactively checking the lease before proposing, it seems like a good idea to retry on the replica itself when possible. I could be convinced out of one or both of these changes.
106502: batcheval: delete flaky integration test r=aliher1911 a=aliher1911 TestLeaseCommandLearnerReplica is fundamentally flaky as commit can't retry all underlying replication failures and acceptable failure causes are not exposed on the higher levels where test operates. What test is supposed to verify that failures to propose with invalid lease are not causing pipelined writes to fail but are retried. More details described in original PR #35261 where this test is introduced. In absence of direct controls of that, test is trying to move leases between replicas, but this action could cause different failures to happen which can't be filtered out. For example: `failed to repropose 0b9a11f4bef0e3d3 at idx 16 with new lease index: [NotLeaseHolderError] ‹reproposal failed due to closed timestamp›` and `stale proposal: command was proposed under lease #2 but is being applied under lease: ...` While closed timestamp could be mitigated by changing settings, second one is opaque to the caller and hard to filter out. Release note: None Fixes #104716 Co-authored-by: Oleg Afanasyev <[email protected]>
Pipelined writes wait for commands to be evaluated, then detach before
they apply. Retryable errors generated at evaluation time are handled
correctly (by DistSender or above) even for pipelined writes, but
pipelined writes have lost the ability to handle apply-time retry
conditions in the executeWriteBatch loop (there used to be more of
these, but illegal lease index is now the only condition retried at
this level now). To remedy this, this commit moves reproposals due to
illegal lease indexes below raft (but only on the proposing node)
In practice, this can happen to writes that race with a raft
leadership transfer. This is observable immediately after table
creation, since table creation performs a split, and then may perform
a lease transfer to balance load (and if the lease is transfer, raft
leadership is transferred afterward).
Specifically,
forwards the proposal to s1.
temporarily leaderless state so it drops the forwarded proposal.
goes right in the raft log since s2 is both leaseholder and leader.
writes w1 to the log with lease index i1.
index.
If w1 was pipelined, the client stops listening after step 2, and
won't learn of the failure until it tries to commit. At this point the
commit returns an ASYNC_WRITE_FAILURE retry to the client.
Note that in some cases, NotLeaseHolderError can be generated at apply
time. These errors cannot be retried (since the proposer has lost its
lease) so they will unavoidably result in an ASYNC_WRITE_FAILURE error
to the client. However, this is uncommon - most NotLeaseHolderErrors
are generated at evaluation time, which is compatible with pipelined
writes.
Fixes #28876
The fourth commit is the important part of this change. The fifth is optional; it's a simplification but it changes an under-tested error path. Assuming we like the fifth commit, I'd like to add a sixth that rips out proposalReevaluationReason and the executeWriteBatch/tryExecuteWriteBatch loop altogether.