-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: prevent follower reads while a range is subsumed #50265
Conversation
Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. I have added a few people who may be able to assist in reviewing:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
I feel uneasy about having In a future commit, I will add a test that forces a lease transfer during the merge critical state. |
e0b1668
to
1afc52b
Compare
Thank you for updating your pull request. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were there supposed to be one commits or two in here? I see what with a message looking like the squash of two.
Nit: there's a lot diffs created by the exporting of that Cfg
guy. Moving that change to a dedicated commit would make reviewing easier.
Nit: the release note is too technical; users don't understand this. Also the tense is wrong (as per https://wiki.crdb.io/wiki/spaces/CRDB/pages/186548364/Release+notes) - say "Fixed a bug".
Release note (bug fix): Fixes a bug that allowed follower reads
during the critical state of a merge transaction.
I'd say
Fixed a rare bug causing transaction serializability violations when using follower reads.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @andreimatei, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/closed_timestamp_test.go, line 710 at r1 (raw file):
// done. func setupClusterForClosedTSTestingWithCustomKnobs( ctx context.Context, t *testing.T, targetDuration time.Duration, knobs base.TestingKnobs,
might as well take a TestClusterArgs
.
pkg/kv/kvserver/closed_timestamp_test.go, line 801 at r1 (raw file):
repls []*kvserver.Replica, ) { defaultTestingKnobs := base.TestingKnobs{
Consider keeping a single setupClusterForClosedTSTesting()
function by extracting defaultTestingKnobs
into a aggressiveResolvedTSPushKnobs
that everybody passed in. That way it'll be easier for tests to compose these default knobs with other additional knobs.
pkg/kv/kvserver/replica.go, line 1330 at r1 (raw file):
// neighbor is in its critical phase and, if so, arranges to block all requests // until the merge completes. func (r *Replica) maybeWatchForMerge(ctx context.Context, untrack closedts.ReleaseFunc) error {
pls comment the new param
pkg/kv/kvserver/replica.go, line 1361 at r1 (raw file):
// Nothing more to do. r.mu.Unlock() return nil
don't we need to call untrack()
with LAI+1 in this case?
pkg/kv/kvserver/replica.go, line 1370 at r1 (raw file):
// arrived and now, which is rare but entirely legal. r.unquiesceLocked() // We prevent replicas from being able to serve follower reads on timestamps
I think this comment needs to spell things out more: say that we block follower reads by publishing a MLAI that's never going to come in the happy case, and we do that because the closed timestamps that are published from now on will not be reflected in the LHS's tscache after the merge.
pkg/kv/kvserver/replica.go, line 1372 at r1 (raw file):
// We prevent replicas from being able to serve follower reads on timestamps // that fall between the subsumption time (FreezeStart) and the timestamp at // which the merge transaction commits or aborts (i.e. when a merge is in its
s/when a merge/the timestamp window representing the range's critical state
Also I don't see the syntagm "critical state" used elsewhere around here. Maybe say something more descriptive like "subsumed state".
pkg/kv/kvserver/replica.go, line 1382 at r1 (raw file):
// has been subsumed are lease requests, which do not bump the LAI. // // See https://github.com/cockroachdb/cockroach/issues/44878 for discussion.
nit: I personally consider most such issue links to be just click bait. The comments should usually be sufficient for explaining what's going on; when they're not sufficient because we failed to write them properly, there's git blame.
pkg/kv/kvserver/replica_read.go, line 127 at r1 (raw file):
log.Event(ctx, "executing read-only batch") untrack := r.maybeTrackProposal(ctx, ba)
can this down, move next to res.Local.ReleaseFunc = untrack
?
pkg/kv/kvserver/replica_read.go, line 130 at r1 (raw file):
} br, res, pErr = evaluateBatch(ctx, kvserverbase.CmdIDKey(""), rw, rec, nil, ba, true /* readOnly */)
spurious diff
pkg/kv/kvserver/replica_read.go, line 150 at r1 (raw file):
return nil, res, pErr } res.Local.ReleaseFunc = untrack
Is this indirection really needed? Can't we simply untrack
here?
I see that there are some early returns in maybeWatchForMerge
that don't untrack with LAI+1, but I can't tell which ones are expected. Plus one of them seems wrong to me.
If the indirection is necessary, can we at least bind all the arguments to the closure we put in res.Local.ReleaseFunc
? They shouldn't change in between here and maybeWatchForMerge
, should they? The closure could take a single bool about whether we're watching for the merge or not (a false would translate to all zeros to untrack()
).
pkg/kv/kvserver/replica_read.go, line 154 at r1 (raw file):
} func (r *Replica) maybeTrackProposal(
This funky method has one caller. Inline it there and then I won't have to ask for a better name and comments.
pkg/kv/kvserver/replica_read.go, line 160 at r1 (raw file):
// We start tracking SubsumeRequests as part of our guarantee to never // broadcast a closed timestamp entry between when we evaluate the Subsume // request and the time that we run maybeWatchForMerge for the first time.
I don't quite understand the "for the first time" part. Consider axing it.
pkg/kv/kvserver/replica_read.go, line 162 at r1 (raw file):
// request and the time that we run maybeWatchForMerge for the first time. // See comment in maybeWatchForMerge() for details. _, untrack := r.store.Cfg.ClosedTimestamp.Tracker.Track(ctx)
I think there's something missing here - the other return value from Track
is a minimum timestamp for the (Subsume) request. I think we need to reject the Subsume
request if its timestamp is above it. I think the comment above should spell out the important of Subsume's ts.
pkg/kv/kvserver/batcheval/result/result.go, line 69 at r1 (raw file):
// Call maybeWatchForMerge. MaybeWatchForMerge bool // ReleaseFunc, when it's not the empty function, stores the callback returned
I think this needs more words; the read-only SubsumeRequest
doesn't have a LeaseAppliedIndex
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @andreimatei, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica.go, line 1361 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
don't we need to call
untrack()
with LAI+1 in this case?
We only need to release this tracked proposal the first time we run maybeWatchForMerge
(which should always be right after the subsume request evaluation unless there is a lease transfer). Any time we see that we've already started the mergeComplete
channel, we should already have released this proposal then. This is what the comment below (in maybeTrackProposal
) talks about, I'll make it clearer.
pkg/kv/kvserver/replica_read.go, line 127 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
can this down, move next to
res.Local.ReleaseFunc = untrack
?
Starting to track the proposal after evaluation of the subsume request seemed problematic for the given scenario:
t1: RHS leaseholder begins evaluating the subsume request (i.e. t1 is our FreezeStart
)
t2: The leaseholder's store broadcasts a closed timestamp that is above t1
t3: Now we begin tracking the proposal, but it's too late because we've already emitted a closed timestamp above FreezeStart
Does that make sense? Let me know if I'm missing something.
Big +1 on this. The main change here is small but subtle and difficult to prove the correctness of. It's a major burden for reviewers and future readers when such changes are mixed in with a refactor that touches 46 files.
Are we sure about this? I mentioned in #44878 (comment) that I'm not convinced this invariant is currently upheld. It relies on latching and that there are Requests that perform Raft proposals that do not acquire any latches that conflict with the Given that this is so critical for correctness, I think it would be worth adding assertions around raft proposal logic to ensure that we uphold this invariant. |
Sounds good, feel free to hold off on reviewing until I reorg it into multiple commits. I'll also add assertions around the invariant. |
1afc52b
to
ebd3fc3
Compare
ebd3fc3
to
ffe190b
Compare
f136ce0
to
fd1c9fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reorganized a bunch of stuff and added a new test that forces a lease transfer on the RHS during the merge (& ensures that our closed timestamp guarantees are upheld).
With regards to ComputeChecksumRequest: It turns out that the ComputeChecksum
request does "acquire latches" even though it doesn't declare any keys. However, checkExecutionCanProceed
only cares about the latchGuard being non-nil on the leaseholder. It almost seems like an unintentional consequence. Should we be more explicit about this? Sorry for not investigating this properly earlier. I added an assertion that blows up if it sees a proposal that would bump the LAI while the range is subsumed, along with an accompanying test.
This is RFAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @andreimatei, and @nvanbenschoten)
pkg/kv/kvserver/closed_timestamp_test.go, line 710 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
might as well take a
TestClusterArgs
.
Done, let me know if I misunderstood you.
pkg/kv/kvserver/closed_timestamp_test.go, line 801 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Consider keeping a single
setupClusterForClosedTSTesting()
function by extractingdefaultTestingKnobs
into aaggressiveResolvedTSPushKnobs
that everybody passed in. That way it'll be easier for tests to compose these default knobs with other additional knobs.
Done
pkg/kv/kvserver/replica.go, line 1330 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
pls comment the new param
Removed the indirection.
pkg/kv/kvserver/replica.go, line 1370 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think this comment needs to spell things out more: say that we block follower reads by publishing a MLAI that's never going to come in the happy case, and we do that because the closed timestamps that are published from now on will not be reflected in the LHS's tscache after the merge.
Done, let me know if you still find it lacking.
pkg/kv/kvserver/replica.go, line 1372 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
s/when a merge/the timestamp window representing the range's critical state
Also I don't see the syntagm "critical state" used elsewhere around here. Maybe say something more descriptive like "subsumed state".
Done.
pkg/kv/kvserver/replica.go, line 1382 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit: I personally consider most such issue links to be just click bait. The comments should usually be sufficient for explaining what's going on; when they're not sufficient because we failed to write them properly, there's git blame.
Removed.
pkg/kv/kvserver/replica_read.go, line 150 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Is this indirection really needed? Can't we simply
untrack
here?
I see that there are some early returns inmaybeWatchForMerge
that don't untrack with LAI+1, but I can't tell which ones are expected. Plus one of them seems wrong to me.If the indirection is necessary, can we at least bind all the arguments to the closure we put in
res.Local.ReleaseFunc
? They shouldn't change in between here andmaybeWatchForMerge
, should they? The closure could take a single bool about whether we're watching for the merge or not (a false would translate to all zeros tountrack()
).
You're correct that the indirection isn't really necessary since we don't need to untrack()
in the lease acquisition path (which is what the initial check in maybeWatchForMerge
is for)
pkg/kv/kvserver/replica_read.go, line 154 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
This funky method has one caller. Inline it there and then I won't have to ask for a better name and comments.
Done
pkg/kv/kvserver/replica_read.go, line 160 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I don't quite understand the "for the first time" part. Consider axing it.
Done.
pkg/kv/kvserver/replica_read.go, line 162 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think there's something missing here - the other return value from
Track
is a minimum timestamp for the (Subsume) request. I think we need to reject theSubsume
request if its timestamp is above it. I think the comment above should spell out the important of Subsume's ts.
Not applicable as discussed offline.
pkg/kv/kvserver/batcheval/result/result.go, line 69 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think this needs more words; the read-only
SubsumeRequest
doesn't have aLeaseAppliedIndex
.
Removed this, see above.
b89c7c9
to
5f0b494
Compare
The test that forces a lease transfer during the merge is flakey under contention because it relies on the merge txn not retrying 😞. Trying to fix that.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
nit: I imagine I'm probably driving you crazy with this this stuff, but I think r4 and r5 should be squashed with r2. The testing really belongs with the commit fixing the issue. And the assertion is tiny, and also tied to the main thing this patch is doing. TestStoreRangeMergeCheckConsistencyAfterSubsumption
is more random, that can be a separate commit indeed.
Btw, I also find it much better to rewrite all the commits every time I update a PR (like, rebase them all), so that reviewable shows the new commits as a contiguous block, instead of interspersing old commits and new commits.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @andreimatei, and @nvanbenschoten)
pkg/kv/kvserver/client_merge_test.go, line 1534 at r5 (raw file):
// 2. Once a merge is aborted, pending (and future) requests will be allowed to // be proposed. This is meant as a sanity check for the assertion at the // end of Replica.propose().
pls put more words here describing this assertion
pkg/kv/kvserver/client_merge_test.go, line 1599 at r5 (raw file):
case <-checkConsistencyResp: t.Fatalf("expected the consistency check to wait until the merge was complete") case <-time.After(1 * time.Second):
you probably want more than a second here and below; tests run on loaded machines, particularly under testrace.
Consider testutils.SucceedsSoon()
- that's the go to for waiting a while
pkg/kv/kvserver/closed_timestamp_test.go, line 469 at r4 (raw file):
t.Run("without lease transfer", func(t *testing.T) { ctx := context.Background() blockMergeTrigger := make(chan interface{}, 10) // headroom in case the merge txn retries
why interface{}
and not hlc.Timestamp
?
pkg/kv/kvserver/replica_raft.go, line 337 at r5 (raw file):
if maxLeaseIndex != 0 { r.mu.RLock() defer r.mu.RUnlock()
nit: I'd just put the RUnlock() below, without a defer. You want to hold this lock for as little as possible, and tie it with the mergeInProgressRLocked()
call. Like, if there's more code added to this method, you don't want it running under the lock.
If the log.Fatal() is hit, the lock won't be release anyway.
pkg/kv/kvserver/replica_read.go, line 127 at r1 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
Starting to track the proposal after evaluation of the subsume request seemed problematic for the given scenario:
t1: RHS leaseholder begins evaluating the subsume request (i.e. t1 is ourFreezeStart
)
t2: The leaseholder's store broadcasts a closed timestamp that is above t1
t3: Now we begin tracking the proposal, but it's too late because we've already emitted a closed timestamp aboveFreezeStart
Does that make sense? Let me know if I'm missing something.
ack
pkg/kv/kvserver/replica_read.go, line 126 at r2 (raw file):
log.Event(ctx, "executing read-only batch") untrack := func(_ context.Context, _ ctpb.Epoch, _ roachpb.RangeID, _ ctpb.LAI) {}
I'd suggest simplifying this untrack()
to only take a bool - whether or not the range has been subsumed, and closing over all the other arguments right here.
d1fd97d
to
4d64895
Compare
In addition to what I just posted in #51294, I also ran into some clock sync related
With this warning message in the traces:
Here are the logs for these: The second one seems..really weird? |
FWIW at this point, I believe I’m not going to find anything in these stress runs that falls under this PR. The exact issue this PR fixes is simulated by the targeted unit test. It’s easy to sink a lot of time looking into these nemeses failures so we should probably let it find stuff in the nightly runs & track them separately anyway. |
Currently, RaftHeartbeatIntervalTicks is not configurable from a TestCluster and we disallow RaftElectionTimeoutTicks to be less than or equal to it. Thus, tests that use TestCluster and, for instance, sleep for a node's liveness to expire take longer than they really need to. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 23 files at r14, 21 of 21 files at r15, 1 of 1 files at r16.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, and @andreimatei)
pkg/kv/kvserver/replica_closedts.go, line 37 at r6 (raw file):
because it gets interpreted as the range getting quiesced
This doesn't quite capture the conclusion of that conversation. Mind updating it to be a little more clear?
pkg/kv/kvserver/replica_write.go, line 156 at r12 (raw file):
Subsume() cmd_subsume.go
Is this missing an "in"?
pkg/kv/kvserver/batcheval/cmd_compute_checksum.go, line 39 at r16 (raw file):
a rare a closed timestamp
Delete the second "a".
Before this commit, during a merge, the RHS leaseholder’s store could continue broadcasting (actionable) closed timestamp updates even after it had been subsumed. This allowed the followers to be able to serve follower reads past the subsumption time of RHS. Additionally, after the merge, if the LHS had a lower closed timestamp than the RHS, it could allow writes to the keyspan owned by RHS at timestamps lower than the RHS’s max closed timestamp. This commit fixes this bug by requiring that the followers catch up to a LeaseAppliedIndex that belongs to the entry succeeding the Subsume request. It also adds necessary assertions around the invariant that while a range is subsumed, nothing (except the merge txn being rolled back) can bump its lease applied index. Fixes cockroachdb#44878 Release note (bug fix): Fixed a rare bug that could cause actionable closed timestamps to effectively regress over a given keyspan. This could in turn lead to a serializability violation when using follower reads. This was due to ill-defined interactions between range merges and the closed timestamp subsystem.
This commit adds a "sanity check" test around the assertion at the end of Replica.propose() that ensures that the lease applied index of a subsumed range is never bumped. Release note: None
4d64895
to
1458427
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the extremely detailed reviews @nvanbenschoten and @andreimatei
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @andreimatei, and @nvanbenschoten)
pkg/kv/kvserver/replica_closedts.go, line 37 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
because it gets interpreted as the range getting quiesced
This doesn't quite capture the conclusion of that conversation. Mind updating it to be a little more clear?
Done.
pkg/kv/kvserver/replica_write.go, line 156 at r12 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Subsume() cmd_subsume.go
Is this missing an "in"?
Fixed.
pkg/kv/kvserver/batcheval/cmd_compute_checksum.go, line 39 at r16 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
a rare a closed timestamp
Delete the second "a".
Done.
bors r=nvanbenschoten,andreimatei |
We had merges disabled because of the bugs tracked in cockroachdb#44878, but those have since been fixed by cockroachdb#46085 and cockroachdb#50265. Release note: None
Build failed (retrying...): |
Build succeeded: |
51305: geo,geoindex: use bounding box for geography coverings that are bad r=sumeerbhola a=sumeerbhola This change uses a covererInterface implementation for geography that notices when a covering is using top-level cells of all faces and in that case uses the bounding box to compute the covering. Also changed the bounding box calculation for geography shapes to use only the points and not first construct s2.Regions. The latter causes marginally bad shapes to continue to have bad coverings since the bounding box also covers the whole earth. Release note: None 51882: roachpb: panic when comparing a Lease to a non-lease r=andreimatei a=andreimatei Release note: None 52146: sql: remove local execution of projectSetNode and implement ConstructProjectSet in the new factory r=yuzefovich a=yuzefovich Depends on #52108. **sql: remove local execution of projectSetNode** We have project set processor which is always planned for `projectSetNode`, so this commit removes the dead code of its local execution. Additionally, it removes some unused fields and cleans up cancellation check of the processor. Release note: None **sql: implement ConstructProjectSet in the new factory** Addresses: #47473. Release note: None 52320: kvserver: enable merges in kvnemesis r=aayushshah15 a=aayushshah15 We had merges disabled because of the bugs tracked in #44878, but those have since been fixed by #46085 and #50265. Release note: None Co-authored-by: sumeerbhola <[email protected]> Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Aayush Shah <[email protected]>
Needed for cockroachdb#61137. This commit updates the manner through which lease transfers (through `LeaseTransferRequest`) and range merges (through `SubsumeRequest`) handle the "transfer of power" from their outgoing leaseholder to their incoming leaseholder. Specifically, it updates the handling of these requests in order to rationalize the interaction between their evaluation and the closing of timestamps through the closed timestamp side-transport. It is now clearer when and how these requests instruct the outgoing leaseholder to relinquish its ability to advance the closed timestamp and, as a result, now possible for the requests to query and operate on the maximum closed timestamp published by the outgoing leaseholder. For lease transfers, this commit begins by addressing an existing TODO to push the revocation of the outgoing lease out of `AdminTransferLease` and into the evaluation of `LeaseTransferRequest` through a new `RevokeLease` method on the `EvalContext`. Once a lease is revoked, the side-transport will no longer be able to advance the closed timestamp under it. This was made possible by cockroachdb#59086 and was suggested by @tbg during the code review. We generally like to keep replica state changes out of "admin" requests themselves, which are intended to coordinate changes through individual non-admin requests. Admin requests generally don't even need to evaluate on a leaseholder (though they try to), so having them perform any state changes is fragile. For range merges, this commit removes the `MaybeWatchForMerge` flag from the `LocalResult` returned by `SubsumeRequest` and replaces it with a `WatchForMerge` method on the `EvalContext`. This allows the `SubsumeRequest` to launch the range merge watcher goroutine during it evaluation, which the side-transport checks for to see whether a leaseholder can advance its closed timestamp. In doing so, the `SubsumeRequest` is able to pause closed timestamps when it wants and is also able to observe and return the maximum closed timestamp _after_ the closed timestamp has stopped advancing. This is a stark improvement over the approach with the original closed timestamp system, which required a herculean effort in cockroachdb#50265 to make correct. With these refactors complete, the closed timestamp side-transport should have a much easier and safer time checking whether a given leaseholder is able to advance its closed timestamp. Release justification: Necessary for the safety of new functionality.
Needed for cockroachdb#61137. This commit updates the manner through which lease transfers (through `LeaseTransferRequest`) and range merges (through `SubsumeRequest`) handle the "transfer of power" from their outgoing leaseholder to their incoming leaseholder. Specifically, it updates the handling of these requests in order to rationalize the interaction between their evaluation and the closing of timestamps through the closed timestamp side-transport. It is now clearer when and how these requests instruct the outgoing leaseholder to relinquish its ability to advance the closed timestamp and, as a result, now possible for the requests to query and operate on the maximum closed timestamp published by the outgoing leaseholder. For lease transfers, this commit begins by addressing an existing TODO to push the revocation of the outgoing lease out of `AdminTransferLease` and into the evaluation of `LeaseTransferRequest` through a new `RevokeLease` method on the `EvalContext`. Once a lease is revoked, the side-transport will no longer be able to advance the closed timestamp under it. This was made possible by cockroachdb#59086 and was suggested by @tbg during the code review. We generally like to keep replica state changes out of "admin" requests themselves, which are intended to coordinate changes through individual non-admin requests. Admin requests generally don't even need to evaluate on a leaseholder (though they try to), so having them perform any state changes is fragile. For range merges, this commit removes the `MaybeWatchForMerge` flag from the `LocalResult` returned by `SubsumeRequest` and replaces it with a `WatchForMerge` method on the `EvalContext`. This allows the `SubsumeRequest` to launch the range merge watcher goroutine during it evaluation, which the side-transport checks for to see whether a leaseholder can advance its closed timestamp. In doing so, the `SubsumeRequest` is able to pause closed timestamps when it wants and is also able to observe and return the maximum closed timestamp _after_ the closed timestamp has stopped advancing. This is a stark improvement over the approach with the original closed timestamp system, which required a herculean effort in cockroachdb#50265 to make correct. With these refactors complete, the closed timestamp side-transport should have a much easier and safer time checking whether a given leaseholder is able to advance its closed timestamp. Release justification: Necessary for the safety of new functionality.
61221: kv: sync lease transfers and range merges with closed timestamp side-transport r=nvanbenschoten a=nvanbenschoten Needed for the safety of #61137. This commit updates the manner through which lease transfers (through `LeaseTransferRequest`) and range merges (through `SubsumeRequest`) handle the "transfer of power" from their outgoing leaseholder to their incoming leaseholder. Specifically, it updates the handling of these requests in order to rationalize the interaction between their evaluation and the closing of timestamps through the closed timestamp side-transport. It is now clearer when and how these requests instruct the outgoing leaseholder to relinquish its ability to advance the closed timestamp and, as a result, now possible for the requests to query and operate on the maximum closed timestamp published by the outgoing leaseholder. For lease transfers, this commit begins by addressing an existing TODO to push the revocation of the outgoing lease out of `AdminTransferLease` and into the evaluation of `LeaseTransferRequest` through a new `RevokeLease` method on the `EvalContext`. Once a lease is revoked, the side-transport will no longer be able to advance the closed timestamp under it. This was made possible by #59086 and was suggested by @tbg during the code review. We generally like to keep replica state changes out of "admin" requests themselves, which are intended to coordinate changes through individual non-admin requests. Admin requests generally don't even need to evaluate on a leaseholder (though they try to), so having them perform any state changes is fragile. For range merges, this commit removes the `MaybeWatchForMerge` flag from the `LocalResult` returned by `SubsumeRequest` and replaces it with a `WatchForMerge` method on the `EvalContext`. This allows the `SubsumeRequest` to launch the range merge watcher goroutine during it evaluation, which the side-transport checks for to see whether a leaseholder can advance its closed timestamp. In doing so, the `SubsumeRequest` is able to pause closed timestamps when it wants and is also able to observe and return the maximum closed timestamp _after_ the closed timestamp has stopped advancing. This is a stark improvement over the approach with the original closed timestamp system, which required a herculean effort in #50265 to make correct. With these refactors complete, the closed timestamp side-transport should have a much easier and safer time checking whether a given leaseholder is able to advance its closed timestamp. Release justification: Necessary for the safety of new functionality. 61237: util/log: ensure that all channel logs are displayed with `-show-logs` r=tbg a=knz When `-show-logs` is specified, the `log.Scope` becomes a no-op and the default configuration in the `log` package is used. This is the only time ever when the default configuration is used. Prior to this patch, only the logging for the DEV channel would make its way to the standard error (and the test output) in that case. This was unfortunate, since the intent (as spelled out in a comment already) was to display everything. This patch fixes that. Release justification: non-production code changes Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>
Before this commit, during a merge, the RHS leaseholder’s store
could continue broadcasting (actionable) closed timestamp updates
even after it had been subsumed. This allowed the followers to be
able to serve follower reads past the subsumption time of RHS.
Additionally, after the merge, if the LHS had a lower closed timestamp
than the RHS, it could allow writes to the keyspan owned by RHS
at timestamps lower than the RHS’s max closed timestamp.
This commit fixes this bug by requiring that the followers catch up
to a LeaseAppliedIndex that belongs to the entry succeeding the
Subsume request.
Fixes #44878
Release note (bug fix): Fixed a rare bug that could cause actionable
closed timestamps to effectively regress over a given keyspan. This
could in turn lead to a serializability violation when using follower
reads. This was due to ill-defined interactions between range merges
and the closed timestamp subsystem.