-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: document reproposals #94633
Conversation
22fa6e3
to
1d4acc7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @pavelkalinnikov, and @tbg)
pkg/kv/kvserver/replica.go
line 506 at r1 (raw file):
// is not superseded, see below) and attach the value to the entry. // 2. for each entry: // - stage written and in-memory effects of the entry (some may apply as no-ops
indentation
pkg/kv/kvserver/replica.go
line 596 at r1 (raw file):
// [^1]: https://github.com/cockroachdb/cockroach/blob/59ce13b6052a99a0318e3dfe017908ff5630db30/pkg/kv/kvserver/replica_raft.go#L1224 // [^2]: https://github.com/cockroachdb/cockroach/blob/59ce13b6052a99a0318e3dfe017908ff5630db30/pkg/kv/kvserver/replica_application_result.go#L148 // [^3]: it's debatable how useful this is. It was introduced in
what is "this"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pavelkalinnikov and @tbg)
pkg/kv/kvserver/replica.go
line 523 at r1 (raw file):
// We can always safely create an identical copy (i.e. (1)) because of the // replay protection conferred by the MaxLeaseIndex - all but the first // proposal will be rejected (i.e. apply as a no-op).
"all but the first proposal that successfully achieves consensus"
pkg/kv/kvserver/replica.go
line 531 at r1 (raw file):
with various different MaxLeaseIndex values
Is it possible for there two be more than two different MaxLeaseIndex values in the unapplied log at a time? And can they be interleaved?
pkg/kv/kvserver/replica.go
line 552 at r1 (raw file):
// where we assume that the `MaxLeaseIndex` 100 is invalid, i.e. when we see the // first copy of the command being applied, we've already applied some command // with a higher `MaxLeaseIndex`. In a world without mechanism (2), `N` would
"an equal or higher"
pkg/kv/kvserver/replica.go
line 553 at r1 (raw file):
// first copy of the command being applied, we've already applied some command // with a higher `MaxLeaseIndex`. In a world without mechanism (2), `N` would // be rejected, and would finalize the proposal (i.e. signal the client with
Should we define "finalize" up above in a broader discussion of when entries are added to the proposals map, when they're updated, and when they're removed?
pkg/kv/kvserver/replica.go
line 584 at r1 (raw file):
// then can (2) engage. In addition, an entry that doesn't pass this equality // check must not signal the proposer and/or unlink from the proposals map (as a // newer reproposal is likely in the log)[^4].
"as a newer reproposal is likely in the log which may succeed"
pkg/kv/kvserver/replica_application_result.go
line 259 at r1 (raw file):
p.command.MaxLeaseIndex = 0 p.encodedCommand = nil pErr := r.propose(ctx, p, tok.Move(ctx))
Replica.propose
doesn't assign the MaxLeaseIndex synchronously, it only does when the proposal buffer is flushed. Are we ok with the command having a MaxLeaseIndex of 0 in the meantime? The need to reset the fields on error raises questions.
pkg/kv/kvserver/replica_application_state_machine.go
line 236 at r1 (raw file):
log.Fatalf(ctx, "command already applied: %+v; unexpected successful result", cmd) } // If any reproposals at a higher MaxLeaseIndex exist we know that they will
I don't understand this branch. It dates back to 0e81674, but I can't make sense of it — it sounds like it was just dead code. Glad you got rid of it and replaced it with an assertion.
pkg/kv/kvserver/replica_proposal.go
line 135 at r1 (raw file):
// rejected and should not be reproposed. // // Note that some commands (such as leases) don't use MaxLeaseIndex.
"lease requests"
Lease transfers do use MaxLeaseIndex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reviews! I rebased and added a fix-up commit, so if reviewable gets confused you might just want to use GH review for the last commit instead.
Dismissed @andreimatei from 2 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @pavelkalinnikov)
pkg/kv/kvserver/replica.go
line 531 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
with various different MaxLeaseIndex values
Is it possible for there two be more than two different MaxLeaseIndex values in the unapplied log at a time? And can they be interleaved?
I think in practice we won't see too much of that but in theory anything goes, because you always have to assume that all previous incarnations are still floating around out there as raft reproposals that could reach the leader at any time & in any order. Updated the comment with some detail and simplified it a bit too.
pkg/kv/kvserver/replica.go
line 596 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
what is "this"?
The below-raft reproposal mechanism, updated.
pkg/kv/kvserver/replica_application_result.go
line 259 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Replica.propose
doesn't assign the MaxLeaseIndex synchronously, it only does when the proposal buffer is flushed. Are we ok with the command having a MaxLeaseIndex of 0 in the meantime? The need to reset the fields on error raises questions.
The command is not in the map at this point, so I am not only ok with this, but intentionally want this. I want r.propose
to not have to care about whether it sees a proposal for the first time or fifth time.
pkg/kv/kvserver/replica_application_state_machine.go
line 236 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I don't understand this branch. It dates back to 0e81674, but I can't make sense of it — it sounds like it was just dead code. Glad you got rid of it and replaced it with an assertion.
Ack, thanks for the check. I had a similar conclusion. The logic is still quite complicated and I wish it were a little less subtle to know when a proposal is in which "state" but that's for another day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pavelkalinnikov)
pkg/kv/kvserver/replica_application_result.go
line 259 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
The command is not in the map at this point, so I am not only ok with this, but intentionally want this. I want
r.propose
to not have to care about whether it sees a proposal for the first time or fifth time.
The part that feels fragile to me then is the need to reset the fields on error. I probably just don't understand that well enough yet, but why is an error here treated differently than an error when flushing the proposal buffer back into the proposals
map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pavelkalinnikov)
pkg/kv/kvserver/replica_application_result.go
line 259 at r1 (raw file):
In this method, the command is not in the proposals map. If r.propose
returns an error, the command never made it into the proposals map. So here we just pretend that we never even called r.propose
. The proposal will then fail up to the caller, and this is probably not a particularly well-tested code path anyway, but it will not be reproposed and so it shouldn't matter what the MLAI and prev encoded command was. But it felt right to restore the previous state, to avoid having to reason about yet one more code path at the caller.
I'm not defending this code, btw - this is the clearest I was able to make it without really stirring the pot. I wish we didn't have these below-raft reproposals and would "just" delegate to an above-raft goroutine pool to repropose on behalf of clients who used pipelined writes. I bet that would be tricky in its own right, but conceptually clearer.
but why is an error here treated differently than an error when flushing the proposal buffer back into the proposals map?
I might be wrong, but: if we catch an error flushing the buffer, we return it up:
return 0, firstErr |
and up:
cockroach/pkg/kv/kvserver/replica_raft.go
Lines 727 to 728 in b637f03
err := r.withRaftGroupLocked(true, func(raftGroup *raft.RawNode) (bool, error) { | |
numFlushed, err := r.mu.proposalBuf.FlushLockedWithRaftGroup(ctx, raftGroup) |
and up:
cockroach/pkg/kv/kvserver/replica_raft.go
Lines 756 to 758 in b637f03
} else if err != nil { | |
return stats, errors.Wrap(err, "checking raft group for Ready") | |
} |
and up:
cockroach/pkg/kv/kvserver/replica_raft.go
Lines 1110 to 1113 in b637f03
default: | |
log.FatalfDepth(ctx, 1, "%+v", err) | |
panic("unreachable") | |
} |
basically flushing the buffer "has to work". But in this method, we can run into the closed timestamp, so it has to be fallible and we need to return an error to the caller.
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r4, 2 of 2 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @pavelkalinnikov)
pkg/kv/kvserver/replica_application_result.go
line 259 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
In this method, the command is not in the proposals map. If
r.propose
returns an error, the command never made it into the proposals map. So here we just pretend that we never even calledr.propose
. The proposal will then fail up to the caller, and this is probably not a particularly well-tested code path anyway, but it will not be reproposed and so it shouldn't matter what the MLAI and prev encoded command was. But it felt right to restore the previous state, to avoid having to reason about yet one more code path at the caller.I'm not defending this code, btw - this is the clearest I was able to make it without really stirring the pot. I wish we didn't have these below-raft reproposals and would "just" delegate to an above-raft goroutine pool to repropose on behalf of clients who used pipelined writes. I bet that would be tricky in its own right, but conceptually clearer.
but why is an error here treated differently than an error when flushing the proposal buffer back into the proposals map?
I might be wrong, but: if we catch an error flushing the buffer, we return it up:
return 0, firstErr and up:
cockroach/pkg/kv/kvserver/replica_raft.go
Lines 727 to 728 in b637f03
err := r.withRaftGroupLocked(true, func(raftGroup *raft.RawNode) (bool, error) { numFlushed, err := r.mu.proposalBuf.FlushLockedWithRaftGroup(ctx, raftGroup) and up:
cockroach/pkg/kv/kvserver/replica_raft.go
Lines 756 to 758 in b637f03
} else if err != nil { return stats, errors.Wrap(err, "checking raft group for Ready") } and up:
cockroach/pkg/kv/kvserver/replica_raft.go
Lines 1110 to 1113 in b637f03
default: log.FatalfDepth(ctx, 1, "%+v", err) panic("unreachable") } basically flushing the buffer "has to work". But in this method, we can run into the closed timestamp, so it has to be fallible and we need to return an error to the caller.
Thanks for explaining. This makes sense. If you wanted to add another sentence to the comment stating that an error to propose causes the request to fail while a success means that it's back in the proposal pipeline, I'd be supportive. It's also fine if not.
Reproposals are a deep rabbit hole and an area in which past changes were all related to subtle bugs. Write it all up and in particular make some simplifications that ought to be possible if my understanding is correct: - have proposals always enter `(*Replica).propose` without a MaxLeaseIndex or prior encoded command set, i.e. `propose` behaves the same for reproposals as for first proposals. - assert that after a failed call to tryReproposeWithNewLeaseIndex, the command is not in the proposals map, i.e. check absence of a leak. - replace code that should be impossible to reach (and had me confused for a good amount of time) with an assertion. - add long comment on `r.mu.proposals`. This commit also moves `tryReproposeWithNewLeaseIndex` off `(*Replica)`, which is possible due to recent changes[^1]. In doing so, I realized there was a (small) data race (now fixed): when returning a `NotLeaseholderError` from that method, we weren't acquiring `r.mu`. It may have looked as though we were holding it already since we're accessing `r.mu.propBuf`, however that field has special semantics - it wraps `r.mu` and acquires it when needed. [^1]: The "below raft" test mentioned in the previous comment was changed in cockroachdb#93785 and no longer causes a false positive. Epic: CRDB-220 Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r=nvanbenschoten
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @pavelkalinnikov)
pkg/kv/kvserver/replica_application_result.go
line 259 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Thanks for explaining. This makes sense. If you wanted to add another sentence to the comment stating that an error to propose causes the request to fail while a success means that it's back in the proposal pipeline, I'd be supportive. It's also fine if not.
Good idea., done.
Build succeeded: |
In cockroachdb#94633, I introduced[^1] an assertion that attempted to catch cases in which we might otherwise accidentally end up applying a proposal twice. This assertion had a false positive, see the updated comment within. I was able to reproduce the failure within ~minutes via `./experiment.sh` in cockroachdb#97173 as of 33dcdef. Closes cockroachdb#94633. [^1]: https://github.com/cockroachdb/cockroach/pull/94633/files#diff-50e458584d176deae52b20a7c04461b3e4110795c8c9a307cf7ee6696ba6bc60R238 Epic: none Release note: None
In cockroachdb#94633, I introduced[^1] an assertion that attempted to catch cases in which we might otherwise accidentally end up applying a proposal twice. This assertion had a false positive, see the updated comment within. I was able to reproduce the failure within ~minutes via `./experiment.sh` in cockroachdb#97173 as of 33dcdef. Better testing of these cases would be desirable. Unfortunately, while there is an abstraction over command application (`apply.Task`), most of the logic worth testing lives in `(*replicaAppBatch)` which is essentially a `*Replica` with more moving parts attached. This does not lend itself well to unit testing. I had a run[^1][^2][^3] earlier this year to make log application standalone, but then didn't have enough time to follow through. It would be desirable to do so at a later date, perhaps with the explicit goals of having interactions like the one discussion in this PR unit become testable. No release note because unreleased (except perhaps in an alpha). [3]: cockroachdb#93309 [2]: cockroachdb#93266 [1]: cockroachdb#93239 Closes cockroachdb#94633. [^1]: https://github.com/cockroachdb/cockroach/pull/94633/files#diff-50e458584d176deae52b20a7c04461b3e4110795c8c9a307cf7ee6696ba6bc60R238 Epic: none Release note: None
In cockroachdb#94633, I introduced[^1] an assertion that attempted to catch cases in which we might otherwise accidentally end up applying a proposal twice. This assertion had a false positive. I was able to reproduce the failure within ~minutes via `./experiment.sh` in cockroachdb#97173 as of 33dcdef. Better testing of these cases would be desirable. Unfortunately, while there is an abstraction over command application (`apply.Task`), most of the logic worth testing lives in `(*replicaAppBatch)` which is essentially a `*Replica` with more moving parts attached. This does not lend itself well to unit testing. I had a run[^2][^3][^4] earlier this year to make log application standalone, but then didn't have enough time to follow through. It would be desirable to do so at a later date, perhaps with the explicit goals of having interactions like the one discussion in this PR unit become testable. [^4]: cockroachdb#93309 [^3]: cockroachdb#93266 [^2]: cockroachdb#93239 [^1]: https://github.com/cockroachdb/cockroach/pull/94633/files#diff-50e458584d176deae52b20a7c04461b3e4110795c8c9a307cf7ee6696ba6bc60R238 This assertion was previously trying to assert too much at a distance and was not only incorrect, but additionally inscrutable. It was mixing up two assertions, the first one of which is sensible: If an entry is accepted, it must not be superseded by inflight proposal. If this were violated, this superseded proposal could also apply, resulting in a failure of replay protection. This assertion is now still around as a stand-alone assertion. The other half of the assertion was more confused: if an entry is rejected, it was claiming that it couldn't also be superseded. The thinking was that if a superseding log entry exists, maybe it could apply, and that would be bad since we just told the waiting client that their proposal got rejected. This reasoning is incorrect, as the following example shows. Consider the following initial situation: [lease seq is 1] log idx 99: unrelated cmd at LAI 10000, lease seq = 1 log idx 100: cmd X at LAI 10000, lease seq = 1 And next: - a new lease enters the log at idx 101 (lease seq = 2) - an identical copy of idx 100 enters the log at idx 102 - we apply idx 100, leading to superseding reproposal at idx 103 resulting in the log: [lease seq is 1] log idx 99: unrelated cmd at LAI 10000, lease seq = 1 log idx 100: cmd X at LAI 10000, lease seq = 1 log idx 101: lease seq = 2 log idx 102: (same as idx 100) log idx 103: cmd X at LAI = 20000, lease seq = 1 During application of idx 102, we get a *permanent* rejection and yet the entry is superseded (by the proposal at idx 103). This would erroneously trigger the assertion, even though this is a legal sequence of events with no detrimental outcomes: the superseding proposal will always have the same lease sequence as its superseded copies, so it will also fail. I initially tried only soften the assertion a *little bit*. Observing that the example above led to a *permanent* rejection, should we only require that a proposal (which in this assertion is always local) is not superseded if it got rejected due to its lease index (which implies that it passed the lease check)? It turns out that this is primarily an assertion on when superseded proposals are counted as "local" at this point in the code: if there were multiple copies of this rejected proposal in the current `appTask` (i.e. the current `CommittedEntries` slice handed to us for application by raft), then all copies are initially local; and a copy that successfully spawns a superseding proposal would be made non-local from that point on. On the face of it, All other copies in the same `appTask` would now hit the assertion (erroneously): they are local, they are rejected, so why don't they enter the branch? The magic ingredient is that if an entry is superseded when we handle the lease index rejection, we also unlink the proposal from it. So these never enter this path since it's not local at this point. For example, if these are the log entries to apply (all at valid lease seq): log idx 99: unrelated cmd at LAI 10000 log idx 100: cmd X at LAI 10000 log idx 101: (identical copy of idx 100) and idxs 99-101 are applied in one batch, then idx 100 would spawn a reproposal at a new lease applied index: log idx 99: unrelated cmd at LAI 10000 log idx 100: cmd X at LAI 10000 <- applied log idx 101: (identical copy of idx 100) log idx 100: cmd X at LAI 20000 <- not in current batch When we apply 101, we observe an illegal lease index, but the proposal supersedes the entry, so we mark it as non-local and don't enter the branch that contains the assertion. The above reasoning is very difficult to understand, and it happens too far removed from where the interesting state changes happen. Also, for testing purposes it is interesting to introduce "errors" in the lease applied index assignment to artificially exercise these reproposal mechanisms. In doing so, these assertions can trip because the lease applied index assigned to a reproposal might accidentally (or intentionally!) match the existing lease applied index, in which case copies of the command in the same batch now *don't* consider themselves superseded. The value of this testing outweighs the very limited benefit of this branch of the assertion. An argument could even be made that this assertion alone as negative benefit due to its complexity. We are removing it in this commit and will instead work towards simplifying the mechanisms that played a role in explaining the asssertion. Closes cockroachdb#94633. Closes cockroachdb#97347. No release note because unreleased (except perhaps in an alpha). Epic: none Release note: None
In cockroachdb#94633, I introduced[^1] an assertion that attempted to catch cases in which we might otherwise accidentally end up applying a proposal twice. This assertion had a false positive. I was able to reproduce the failure within ~minutes via `./experiment.sh` in cockroachdb#97173 as of 33dcdef. Better testing of these cases would be desirable. Unfortunately, while there is an abstraction over command application (`apply.Task`), most of the logic worth testing lives in `(*replicaAppBatch)` which is essentially a `*Replica` with more moving parts attached. This does not lend itself well to unit testing. I had a run[^2][^3][^4] earlier this year to make log application standalone, but then didn't have enough time to follow through. It would be desirable to do so at a later date, perhaps with the explicit goals of having interactions like the one discussion in this PR unit become testable. [^4]: cockroachdb#93309 [^3]: cockroachdb#93266 [^2]: cockroachdb#93239 [^1]: https://github.com/cockroachdb/cockroach/pull/94633/files#diff-50e458584d176deae52b20a7c04461b3e4110795c8c9a307cf7ee6696ba6bc60R238 This assertion was previously trying to assert too much at a distance and was not only incorrect, but additionally inscrutable. It was mixing up two assertions, the first one of which is sensible: If an entry is accepted, it must not be superseded by inflight proposal. If this were violated, this superseded proposal could also apply, resulting in a failure of replay protection. This assertion is now still around as a stand-alone assertion. The other half of the assertion was more confused: if an entry is rejected, it was claiming that it couldn't also be superseded. The thinking was that if a superseding log entry exists, maybe it could apply, and that would be bad since we just told the waiting client that their proposal got rejected. This reasoning is incorrect, as the following example shows. Consider the following initial situation: [lease seq is 1] log idx 99: unrelated cmd at LAI 10000, lease seq = 1 log idx 100: cmd X at LAI 10000, lease seq = 1 And next: - a new lease enters the log at idx 101 (lease seq = 2) - an identical copy of idx 100 enters the log at idx 102 - we apply idx 100, leading to superseding reproposal at idx 103 resulting in the log: [lease seq is 1] log idx 99: unrelated cmd at LAI 10000, lease seq = 1 log idx 100: cmd X at LAI 10000, lease seq = 1 log idx 101: lease seq = 2 log idx 102: (same as idx 100) log idx 103: cmd X at LAI = 20000, lease seq = 1 During application of idx 102, we get a *permanent* rejection and yet the entry is superseded (by the proposal at idx 103). This would erroneously trigger the assertion, even though this is a legal sequence of events with no detrimental outcomes: the superseding proposal will always have the same lease sequence as its superseded copies, so it will also fail. I initially tried only soften the assertion a *little bit*. Observing that the example above led to a *permanent* rejection, should we only require that a proposal (which in this assertion is always local) is not superseded if it got rejected due to its lease index (which implies that it passed the lease check)? It turns out that this is primarily an assertion on when superseded proposals are counted as "local" at this point in the code: if there were multiple copies of this rejected proposal in the current `appTask` (i.e. the current `CommittedEntries` slice handed to us for application by raft), then all copies are initially local; and a copy that successfully spawns a superseding proposal would be made non-local from that point on. On the face of it, All other copies in the same `appTask` would now hit the assertion (erroneously): they are local, they are rejected, so why don't they enter the branch? The magic ingredient is that if an entry is superseded when we handle the lease index rejection, we also unlink the proposal from it. So these never enter this path since it's not local at this point. For example, if these are the log entries to apply (all at valid lease seq): log idx 99: unrelated cmd at LAI 10000 log idx 100: cmd X at LAI 10000 log idx 101: (identical copy of idx 100) and idxs 99-101 are applied in one batch, then idx 100 would spawn a reproposal at a new lease applied index: log idx 99: unrelated cmd at LAI 10000 log idx 100: cmd X at LAI 10000 <- applied log idx 101: (identical copy of idx 100) log idx 100: cmd X at LAI 20000 <- not in current batch When we apply 101, we observe an illegal lease index, but the proposal supersedes the entry, so we mark it as non-local and don't enter the branch that contains the assertion. The above reasoning is very difficult to understand, and it happens too far removed from where the interesting state changes happen. Also, for testing purposes it is interesting to introduce "errors" in the lease applied index assignment to artificially exercise these reproposal mechanisms. In doing so, these assertions can trip because the lease applied index assigned to a reproposal might accidentally (or intentionally!) match the existing lease applied index, in which case copies of the command in the same batch now *don't* consider themselves superseded. The value of this testing outweighs the very limited benefit of this branch of the assertion. An argument could even be made that this assertion alone as negative benefit due to its complexity. We are removing it in this commit and will instead work towards simplifying the mechanisms that played a role in explaining the asssertion. Closes cockroachdb#94633. Closes cockroachdb#97347. No release note because unreleased (except perhaps in an alpha). Epic: none Release note: None
97564: kvserver: narrow down 'finishing a proposal with outstanding reproposal' r=pavelkalinnikov a=tbg In #94633, I introduced[^1] an assertion that attempted to catch cases in which we might otherwise accidentally end up applying a proposal twice. This assertion had a false positive. I was able to reproduce the failure within ~minutes via `./experiment.sh` in #97173 as of 33dcdef. Better testing of these cases would be desirable. Unfortunately, while there is an abstraction over command application (`apply.Task`), most of the logic worth testing lives in `(*replicaAppBatch)` which is essentially a `*Replica` with more moving parts attached. This does not lend itself well to unit testing. I had a run[^2][^3][^4] earlier this year to make log application standalone, but then didn't have enough time to follow through. It would be desirable to do so at a later date, perhaps with the explicit goals of having interactions like the one discussion in this PR unit become testable. [^4]: #93309 [^3]: #93266 [^2]: #93239 [^1]: https://github.com/cockroachdb/cockroach/pull/94633/files#diff-50e458584d176deae52b20a7c04461b3e4110795c8c9a307cf7ee6696ba6bc60R238 This assertion was previously trying to assert too much at a distance and was not only incorrect, but additionally inscrutable. It was mixing up two assertions, the first one of which is sensible: If an entry is accepted, it must not be superseded by inflight proposal. If this were violated, this superseded proposal could also apply, resulting in a failure of replay protection. This assertion is now still around as a stand-alone assertion. The other half of the assertion was more confused: if an entry is rejected, it was claiming that it couldn't also be superseded. The thinking was that if a superseding log entry exists, maybe it could apply, and that would be bad since we just told the waiting client that their proposal got rejected. This reasoning is incorrect, as the following example shows. Consider the following initial situation: [lease seq is 1] log idx 99: unrelated cmd at LAI 10000, lease seq = 1 log idx 100: cmd X at LAI 10000, lease seq = 1 And next: - a new lease enters the log at idx 101 (lease seq = 2) - an identical copy of idx 100 enters the log at idx 102 - we apply idx 100, leading to superseding reproposal at idx 103 resulting in the log: [lease seq is 1] log idx 99: unrelated cmd at LAI 10000, lease seq = 1 log idx 100: cmd X at LAI 10000, lease seq = 1 log idx 101: lease seq = 2 log idx 102: (same as idx 100) log idx 103: cmd X at LAI = 20000, lease seq = 1 During application of idx 102, we get a *permanent* rejection and yet the entry is superseded (by the proposal at idx 103). This would erroneously trigger the assertion, even though this is a legal sequence of events with no detrimental outcomes: the superseding proposal will always have the same lease sequence as its superseded copies, so it will also fail. I initially tried only soften the assertion a *little bit*. Observing that the example above led to a *permanent* rejection, should we only require that a proposal (which in this assertion is always local) is not superseded if it got rejected due to its lease index (which implies that it passed the lease check)? It turns out that this is primarily an assertion on when superseded proposals are counted as "local" at this point in the code: if there were multiple copies of this rejected proposal in the current `appTask` (i.e. the current `CommittedEntries` slice handed to us for application by raft), then all copies are initially local; and a copy that successfully spawns a superseding proposal would be made non-local from that point on. On the face of it, All other copies in the same `appTask` would now hit the assertion (erroneously): they are local, they are rejected, so why don't they enter the branch? The magic ingredient is that if an entry is superseded when we handle the lease index rejection, we also unlink the proposal from it. So these never enter this path since it's not local at this point. For example, if these are the log entries to apply (all at valid lease seq): log idx 99: unrelated cmd at LAI 10000 log idx 100: cmd X at LAI 10000 log idx 101: (identical copy of idx 100) and idxs 99-101 are applied in one batch, then idx 100 would spawn a reproposal at a new lease applied index: log idx 99: unrelated cmd at LAI 10000 log idx 100: cmd X at LAI 10000 <- applied log idx 101: (identical copy of idx 100) log idx 100: cmd X at LAI 20000 <- not in current batch When we apply 101, we observe an illegal lease index, but the proposal supersedes the entry, so we mark it as non-local and don't enter the branch that contains the assertion. The above reasoning is very difficult to understand, and it happens too far removed from where the interesting state changes happen. Also, for testing purposes it is interesting to introduce "errors" in the lease applied index assignment to artificially exercise these reproposal mechanisms. In doing so, these assertions can trip because the lease applied index assigned to a reproposal might accidentally (or intentionally!) match the existing lease applied index, in which case copies of the command in the same batch now *don't* consider themselves superseded. The value of this testing outweighs the very limited benefit of this branch of the assertion. An argument could even be made that this assertion alone as negative benefit due to its complexity. We are removing it in this commit and will instead work towards simplifying the mechanisms that played a role in explaining the asssertion. Closes #97102. Closes #97347. Closes #97447. Closes #97612. No release note because unreleased (except perhaps in an alpha). Epic: none Release note: None Co-authored-by: Tobias Grieger <[email protected]>
98481: kvserver: revert recent changes to reproposals r=pavelkalinnikov a=tbg Reverts #97606, #97564, #94825, #94633. - Revert "kvserver: disable assertion 'finished proposal inserted'" - Revert "kvserver: narrow down 'finishing a proposal with outstanding reproposal'" - Revert "kvserver: fill gaps in comment near tryReproposeWithNewLeaseIndex" - Revert "kvserver: hoist early return out of tryReproposeWithNewLeaseIndex" - Revert "fixup! kvserver: prevent finished proposal from being present in proposals map" - Revert "kvserver: prevent finished proposal from being present in proposals map" - Revert "kvserver: improve reproposal assertions and documentation" Closes #97973. Epic: CRDB-25287 Release Note: none 98537: sql: check row level ttl change before truncating a table r=chengxiong-ruan a=chengxiong-ruan Fixes: #93443 Release note (sql change): This commit fixed a bug where crdb paniced wehn user tried to truncate a table which is has an ongoing row level ttl change. We still don't support table truncates in this scenario, but a more gentle unimplemented error is returned instead of panic. 98575: cdc: use int64 for emitted bytes telemetry r=miretskiy a=jayshrivastava Previously, the stored `emitted_bytes` field was an int32, which can hold a maximum value of 2.1GB. This value is too small because the logging period is 24h and changefeeds can emit much more than 2.1GB in 24h. This change updates the field to be an int64, which solves this problem. Epic: None Release note: None 98582: ci: allow-list `BUILD_VCS_NUMBER` env var in cloud unit tests r=jlinder a=rickystewart This job was filing issues linking to the wrong commit. Epic: none Release note: None Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Chengxiong Ruan <[email protected]> Co-authored-by: Jayant Shrivastava <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
Reproposals are a deep rabbit hole and an area in which past changes
were all related to subtle bugs. Write it all up and in particular make
some simplifications that ought to be possible if my understanding is
correct:
(*Replica).propose
without aMaxLeaseIndex or prior encoded command set, i.e.
propose
behaves the same for reproposals as for first proposals.
the command is not in the proposals map, i.e. check absence of
a leak.
for a good amount of time) with an assertion.
r.mu.proposals
.This commit also moves
tryReproposeWithNewLeaseIndex
off(*Replica)
,which is possible due to recent changes1. In doing so, I realized
there was a (small) data race (now fixed): when returning a
NotLeaseholderError
from that method, we weren't acquiringr.mu
. Itmay have looked as though we were holding it already since we're
accessing
r.mu.propBuf
, however that field has special semantics - itwraps
r.mu
and acquires it when needed.Epic: CRDB-220
Release note: None
Footnotes
The "below raft" test mentioned in the previous comment was
changed in https://github.com/cockroachdb/cockroach/pull/93785 and
no longer causes a false positive. ↩