-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: fix write below closedts bug #63672
kvserver: fix write below closedts bug #63672
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 2 of 2 files at r3, 5 of 6 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei)
pkg/kv/kvserver/replica_closedts_test.go, line 417 at r4 (raw file):
} // Test that a lease proposal that gets rejected doesn't erroneously dictate the
Nice test! It's complex, but the way you wrote this makes it easy to follow.
pkg/kv/kvserver/replica_closedts_test.go, line 422 at r4 (raw file):
// // 1. A lease held by rep1 is getting close to its expiration. // 2. Rep1 begins the process of transferring its lease to rep2 with a start time of 100.
nit: wrap
pkg/kv/kvserver/replica_closedts_test.go, line 427 at r4 (raw file):
// lease acquisition is slow to propose. // 4. The lease transfer finally applies. Rep2 is the new leaseholder and bumps // its tscache to 100.
nit: spacing is off here.
pkg/kv/kvserver/replica_closedts_test.go, line 429 at r4 (raw file):
// its tscache to 100. // 5. Two writes start evaluating on rep2 under the new lease. They bump their // write timestamp to 100.
101, right? Or 100,1 if we want to concern ourselves with logical ticks.
pkg/kv/kvserver/replica_closedts_test.go, line 448 at r4 (raw file):
// The test simulates the scenario and verifies that we don't crash with a // closed timestamp violation assertion. We avoid the violation because, in step // 6, the lease proposal doesn't bump the assignedClosedTimestamp.
Consider referencing the issue as well, to provide more context for those who seek it.
pkg/kv/kvserver/replica_closedts_test.go, line 472 at r4 (raw file):
unblockWritesCh := make(chan struct{}) var writeKey1, writeKey2 atomic.Value // Initialize the atomics do they get bound to a specific type.
"so"
pkg/kv/kvserver/replica_closedts_test.go, line 472 at r4 (raw file):
unblockWritesCh := make(chan struct{}) var writeKey1, writeKey2 atomic.Value // Initialize the atomics do they get bound to a specific type.
That's a neat idea.
pkg/kv/kvserver/replica_closedts_test.go, line 483 at r4 (raw file):
return } if atomic.LoadInt64(&trappedLeaseAcquisition) == 1 {
Make this an atomic compare and swap to avoid a race between two lease acquisitions. Or use a sync.Once
.
pkg/kv/kvserver/replica_proposal_buf.go, line 736 at r3 (raw file):
// timestamps); in other words, assignedClosedTimestamp was not supposed to // have been incremented while requests with lower timestamps were // evaluating.
" and should have been used to bump the timestamp of any requests that began evaluating after it was bumped".
pkg/kv/kvserver/replica_proposal_buf.go, line 737 at r3 (raw file):
// have been incremented while requests with lower timestamps were // evaluating. if !b.assignedClosedTimestamp.LessEq(p.Request.WriteTimestamp()) &&
Aren't these first two conditions equal to p.Request.WriteTimestamp().Less(b.assignedClosedTimestamp)
?
pkg/kv/kvserver/replica_proposal_buf.go, line 739 at r3 (raw file):
if !b.assignedClosedTimestamp.LessEq(p.Request.WriteTimestamp()) && !b.assignedClosedTimestamp.IsEmpty() && p.Request.IsIntentWrite() { wts := p.Request.WriteTimestamp()
nit: pull this out before if you're going to assign it to a variable so that we can use it in the check.
pkg/kv/kvserver/replica_proposal_buf.go, line 633 at r4 (raw file):
} } if fn := b.p.testingFlushFilter(); fn != nil {
Do we need this? Don't we have the submitProposalFilter
right above?
pkg/kv/kvserver/replica_proposal_buf.go, line 752 at r4 (raw file):
// calling forwardClosedTimestampLocked). Bumping it to the lease start time // would (surprisingly?) be illegal: just because we're proposing a lease // starting at timestamp 100, doesn't mean we're sure to not evaluate
"to not be evaluating"?
pkg/kv/kvserver/replica_proposal_buf_test.go, line 204 at r3 (raw file):
} func (pc proposalCreator) newPutProposal(ts hlc.Timestamp) (*ProposalData, []byte) {
Do we need this change? We you planning on testing the new assertion?
pkg/kv/kvserver/replica_proposal_buf_test.go, line 768 at r4 (raw file):
expClosed hlc.Timestamp // If not nil, b.assignedClosedTimestamp is checked against this.
Is it possible to make this mandatory and check it for each subtest?
pkg/kv/kvserver/replica_range_lease.go, line 1056 at r4 (raw file):
} // TestingAcquireLease is redirectOnOrAcquireLease exposed for tests.
Can this live in helpers_test.go
?
pkg/kv/kvserver/replica_range_lease.go, line 1281 at r4 (raw file):
if log.ExpensiveLogEnabled(ctx, 2) { log.Infof(ctx, "extending lease %s at %s", st.Lease, st.Now) debug.PrintStack()
I don't think you meant to keep this.
pkg/kv/kvserver/testing_knobs.go, line 26 at r4 (raw file):
) // ProposalFLushFilter is the type of the TestingFlushFilter callback.
"FLush"
543e94e
to
3d88622
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/replica_closedts_test.go, line 422 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: wrap
done
pkg/kv/kvserver/replica_closedts_test.go, line 427 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: spacing is off here.
done
pkg/kv/kvserver/replica_closedts_test.go, line 429 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
101, right? Or 100,1 if we want to concern ourselves with logical ticks.
done
pkg/kv/kvserver/replica_closedts_test.go, line 448 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Consider referencing the issue as well, to provide more context for those who seek it.
I think the issue is easily discoverable from the commit message.
pkg/kv/kvserver/replica_closedts_test.go, line 472 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"so"
done
pkg/kv/kvserver/replica_closedts_test.go, line 483 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Make this an atomic compare and swap to avoid a race between two lease acquisitions. Or use a
sync.Once
.
done
pkg/kv/kvserver/replica_proposal_buf.go, line 736 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
" and should have been used to bump the timestamp of any requests that began evaluating after it was bumped".
done
pkg/kv/kvserver/replica_proposal_buf.go, line 737 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Aren't these first two conditions equal to
p.Request.WriteTimestamp().Less(b.assignedClosedTimestamp)
?
yes. I don't know what I was thinking when I wrote it like that.
pkg/kv/kvserver/replica_proposal_buf.go, line 739 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: pull this out before if you're going to assign it to a variable so that we can use it in the check.
well wts
escapes, that's why I put it inside the if
(right?)
But I've gotten rid of it, it wasn't worth a variable.
pkg/kv/kvserver/replica_proposal_buf.go, line 633 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we need this? Don't we have the
submitProposalFilter
right above?
submitProposalFilter
as it stood wasn't good because it ran before a closed ts had been assigned to the command. But I've prepended a commit to make it run after, and switched to use it.
pkg/kv/kvserver/replica_proposal_buf.go, line 752 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"to not be evaluating"?
done
pkg/kv/kvserver/replica_proposal_buf_test.go, line 204 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we need this change? We you planning on testing the new assertion?
well it's not that I particularly wanted to test the assertion, but that without this change the assertion fires for the test that now passes a timestamp (TestProposalBufferClosedTimestamp
). That's a unit test where proposals don't go through a tracker.
pkg/kv/kvserver/replica_proposal_buf_test.go, line 768 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is it possible to make this mandatory and check it for each subtest?
see now
pkg/kv/kvserver/replica_range_lease.go, line 1056 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Can this live in
helpers_test.go
?
not exactly, because I was calling it from TestServer
, which is in a different package. But I've removed the TestServer
utility in favor of having the test reach down directly for the replica, and I've moved this to helpers_test.go
. I've got mixed feelings.
pkg/kv/kvserver/replica_range_lease.go, line 1281 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I don't think you meant to keep this.
removed
pkg/kv/kvserver/testing_knobs.go, line 26 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"FLush"
removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 8 files at r9.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)
pkg/kv/kvserver/replica_closedts_test.go, line 448 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think the issue is easily discoverable from the commit message.
Until the test gets moved a few times and the git blame requires a few hops to find the commit message. But I don't feel strongly.
pkg/kv/kvserver/replica_proposal_buf_test.go, line 204 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
well it's not that I particularly wanted to test the assertion, but that without this change the assertion fires for the test that now passes a timestamp (
TestProposalBufferClosedTimestamp
). That's a unit test where proposals don't go through a tracker.
Oh, I was missing: pc.newPutProposal(now.ToTimestamp())
. Carry on.
3d88622
to
fea8d06
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
had to do some crap to make the test stable. Please look at the new commit called "improve test utilities for lease info".
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)
pkg/kv/kvserver/replica_closedts_test.go, line 448 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Until the test gets moved a few times and the git blame requires a few hops to find the commit message. But I don't feel strongly.
the thing is that I don't believe that issue to be useful to anyone. It's up to the comments on the test and, to a lesser degree, this commit message.
fea8d06
to
ee63438
Compare
Make it clear that an else case is, in fact, an else case. Release note: None
Clarifiy a <= vs < in a closedts comment. Release note: None
Release note: None
Make this testing filter run after a closed timestamp is assigned to the command to be filtered. I had to slightly change a test because, now, the filter doesn't run if the raft group is nil. Release note: None
ee63438
to
6b6871c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please look at the new commits called "wait when adding voters" and "improve test utilities for lease info".
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)
The TestServer/TestCluster have facilities for querying a range's lease. This patch addresses two problems with this functionality: 1) If the node being queried has a lease proposal in flight, the respective functions would return it at the detriment of the current lease. This behavior is dubious, and indeed not what a test that I'm writing wants. This patch adds a function that returns both the current and the prospective next lease. 2) The functions let you ask for the state of a particular node to be queried. This was buggy, though, because the LeaseInfoRequest could be transparently forwarded to a different node in case the queried node doesn't have a replica or has a learner replica. That's confusing as hell. This patch optionally dissallows that, giving the caller confidence that they got the state from the node it was interested in. Release note: None
1436ad4
to
563e1ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 6 files at r14, 6 of 6 files at r15, 6 of 6 files at r20, 6 of 6 files at r21.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
This patch fixes a bug in our closed timestamp management. This bug was making it possible for a command to close a timestamp even though other requests writing at lower timestamps are currently evaluating. The problem was that we were assuming that, if a replica is proposing a new lease, there can't be any requests in flight and every future write evaluated on the range will wait for the new lease and the evaluate above the lease start time. Based on that reasoning, the proposal buffer was recording the lease start time as its assignedClosedTimestamp. This was matching what it does for every write, where assignedClosedTimestamp corresponds to the the closed timestamp carried by the command. It turns out that the replica's reasoning was wrong. It is, in fact, possible for writes to be evaluating on the range when the lease acquisition is proposed. And these evaluations might be done at timestamps below the would-be lease's start time. This happens when the replica has already received a lease through a lease transfer. The transfer must have applied after the previous lease expired and the replica decided to start acquiring a new one. This fixes one of the assertion failures seen in cockroachdb#62655. Release note (bug fix): A bug leading to crashes with the message "writing below closed ts" has been fixed.
563e1ab
to
467ae98
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build succeeded: |
This patch fixes a bug in our closed timestamp management. This bug was
making it possible for a command to close a timestamp even though other
requests writing at lower timestamps are currently evaluating. The
problem was that we were assuming that, if a replica is proposing a new
lease, there can't be any requests in flight and every future write
evaluated on the range will wait for the new lease and the evaluate
above the lease start time. Based on that reasoning, the proposal buffer
was recording the lease start time as its assignedClosedTimestamp. This
was matching what it does for every write, where assignedClosedTimestamp
corresponds to the the closed timestamp carried by the command.
It turns out that the replica's reasoning was wrong. It is, in fact,
possible for writes to be evaluating on the range when the lease
acquisition is proposed. And these evaluations might be done at
timestamps below the would-be lease's start time. This happens when the
replica has already received a lease through a lease transfer. The
transfer must have applied after the previous lease expired and the
replica decided to start acquiring a new one.
This fixes one of the assertion failures seen in #62655.
Release note (bug fix): A bug leading to crashes with the message
"writing below closed ts" has been fixed.