-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rangefeed: fix premature checkpoint due to intent resolution race #117612
rangefeed: fix premature checkpoint due to intent resolution race #117612
Conversation
4fc2d95
to
e7ac988
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I think I prefer this over #114487 for the reasons discussed in the PR. I'm glad Barrier
could be extended for this use case.
What do you think about it?
Reviewed 2 of 2 files at r1, 7 of 7 files at r2, 3 of 3 files at r3, 6 of 6 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/kv/kvpb/api.proto
line 1323 at r4 (raw file):
but it did exist in the past
I don't even think we know that. The timestamp cache may have been ratcheted above the txn's min timestamp across the board.
pkg/kv/kvserver/replica_rangefeed.go
line 189 at r3 (raw file):
// Execute a Barrier on the leaseholder. // // TODO(erikgrinaker): handle Barrier requests split across ranges. This can
If a range splits, its rangefeed processor will be torn down and recreated, at which point the resolved timestamp will be recomputed with a lock table scan. So it doesn't feel like we necessarily need to handle this by using the LAIs returned from the Barrier request. But we also shouldn't crash. What happens now?
pkg/kv/kvserver/replica_rangefeed.go
line 193 at r3 (raw file):
// isUnsplittable on BarrierRequest, but we can maybe do something better // (e.g. return per-range LAIs). desc := tp.r.Desc()
Should we be passing the key span in from the rangefeed processor? We will tear the processor down on splits or merged, but I'm not sure that precludes any races.
pkg/kv/kvserver/replica_rangefeed.go
line 209 at r3 (raw file):
for tp.r.GetLeaseAppliedIndex() < lai { if ticker == nil { ticker = time.NewTicker(100 * time.Millisecond)
Do you think it's worth adding an exponential backoff here? 100ms for the first iteration feels a bit long.
pkg/kv/kvserver/replica_write.go
line 299 at r2 (raw file):
} // Populate LeaseAppliedIndex for Barrier responses. if ba.IsSingleBarrierRequest() && propResult.Err == nil {
Reminder to myself and others who get here: Barrier
requests do go through Raft, despite having an empty WriteBatch, because of BatchRequest.RequiresConsensus
.
pkg/kv/kvserver/intentresolver/intent_resolver.go
line 387 at r4 (raw file):
pushType kvpb.PushTxnType, skipIfInFlight bool, ) (_ map[uuid.UUID]*roachpb.Transaction, ambiguousAbort bool, _ *kvpb.Error) {
nit: consider renaming this to anyAmbiguousAbort
to something along those lines.
pkg/kv/kvserver/rangefeed/resolved_timestamp.go
line 163 at r5 (raw file):
// // TODO(erikgrinaker): check that this won't end up with crash loops. rts.assertOpAboveRTS(op, t.Timestamp)
I can't recall why we didn't already have this assertion here. I don't think it was an intentional omission when we added the other assertions.
pkg/kv/kvserver/rangefeed/task.go
line 346 at r3 (raw file):
// intents in the rangefeed processor's queue. These updates may not yet have // been applied to the resolved timestamp intent queue, but that's ok -- our // MVCCAbortTxnOp will be enqueued and processed after them.
Could you mention that it's safe to publish a committed MVCCAbortTxnOp
for a txn once all of that txn's local intents have been committed and all MVCCCommitIntentOp
ops have been published? And then add a test for this?
pkg/kv/kvserver/rangefeed/task.go
line 352 at r3 (raw file):
// // TODO(erikgrinaker): check how often we hit this, and consider coming up // with a way to avoid it in the common case, since it incurs a write.
s/write/raft consensus round/
pkg/kv/kvserver/rangefeed/task.go
line 253 at r4 (raw file):
// prevent a restart for any transaction that has not been written // over at a larger timestamp. pushedTxns, ambiguousAbort, err := a.pusher.PushTxns(ctx, a.txns, a.ts)
Can we trust this ambiguousAbort
value in mixed-version clusters?
pkg/kv/kvserver/batcheval/cmd_barrier_test.go
line 38 at r2 (raw file):
// TODO(erikgrinaker): add some actual testing here. ts, lai, err := kvDB.Barrier(ctx, start, end)
Should this test be in batcheval
? It's spinning up a full server, which we usually reserve for pkg/kv/kvclient
or pkg/kv/kvserver
.
Do we have any other testing of Barrier
that we could extend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about it?
I think it's better -- it's simpler and more robust. I'm not overly worried about the cost of these writes either, given the conditions under which it triggers and that it's at most one per range per incident, but I'll have a closer look at this.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/replica_rangefeed.go
line 189 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
If a range splits, its rangefeed processor will be torn down and recreated, at which point the resolved timestamp will be recomputed with a lock table scan. So it doesn't feel like we necessarily need to handle this by using the LAIs returned from the Barrier request. But we also shouldn't crash. What happens now?
As is, we'll use the LAI of the first range overlapping the keyspan, because there's no handling of this in BarrierResponse.combine()
:
For splits, that could actually be fine, because that LAI would be on the LHS after applying the split, so waiting for it would also mean waiting for the split to apply locally and tearing down the rangefeed. But this is mostly accidental, and we shouldn't rely on it. It would also break on a merge where the LHS LAI is lower than then RHS LAI -- that case requires us to check that the range ID matches our local replica, which I don't believe is currently exposed in the RPC response.
For our purposes here, it would be sufficient to mark Barrier
as isUnsplittable
and include the range ID in the response, and then just error out in txnPushAttempt
until we tear down the rangefeed processor. But this doesn't generalize particularly well, because other callers would likely want to apply a barrier across a keyspan regardless of the range structure.
We could also return a slice of RangeBarrier
containing the range ID, key span, evaluation timestamp, and LAI. However, that seems impractical when used with a large table across thousands of ranges, where we'd have to start using resume spans to avoid large responses.
I'm going to do the straightforward thing for now: make this conditional on a parameter BarrierRequest.WithLeaseAppliedIndex
which will make the request unsplittable.
pkg/kv/kvserver/replica_rangefeed.go
line 193 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should we be passing the key span in from the rangefeed processor? We will tear the processor down on splits or merged, but I'm not sure that precludes any races.
Yeah, that seems prudent.
pkg/kv/kvserver/replica_rangefeed.go
line 209 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do you think it's worth adding an exponential backoff here? 100ms for the first iteration feels a bit long.
Yeah, either that or we add a way to subscribe to LAI updates from the replica -- but that seems a bit overkill perhaps.
pkg/kv/kvserver/rangefeed/resolved_timestamp.go
line 163 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I can't recall why we didn't already have this assertion here. I don't think it was an intentional omission when we added the other assertions.
Is this assertion relevant for MVCCUpdateIntentOp
as well?
pkg/kv/kvserver/rangefeed/task.go
line 346 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could you mention that it's safe to publish a committed
MVCCAbortTxnOp
for a txn once all of that txn's local intents have been committed and allMVCCCommitIntentOp
ops have been published? And then add a test for this?
Yep, I'll verify and document this, with a test.
pkg/kv/kvserver/rangefeed/task.go
line 253 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Can we trust this
ambiguousAbort
value in mixed-version clusters?
We can't, but in that case it degrades to the current behavior. That's worth documenting on the API.
pkg/kv/kvserver/batcheval/cmd_barrier_test.go
line 38 at r2 (raw file):
Should this test be in
batcheval
? It's spinning up a full server, which we usually reserve forpkg/kv/kvclient
orpkg/kv/kvserver
.
Several other batcheval
tests do as well. I don't have an opinion on whether they should (beyond unit tests being better when possible), but it seems natural to test the request here, and we'll need to submit the proposal through Raft to fully do so.
Do we have any other testing of
Barrier
that we could extend?
Nope. I'll see if I can add some basic coverage in kvnemesis too.
e7ac988
to
fbdd612
Compare
c258859
to
94ce567
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)
pkg/kv/kvserver/replica_rangefeed.go
line 189 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
As is, we'll use the LAI of the first range overlapping the keyspan, because there's no handling of this in
BarrierResponse.combine()
:For splits, that could actually be fine, because that LAI would be on the LHS after applying the split, so waiting for it would also mean waiting for the split to apply locally and tearing down the rangefeed. But this is mostly accidental, and we shouldn't rely on it. It would also break on a merge where the LHS LAI is lower than then RHS LAI -- that case requires us to check that the range ID matches our local replica, which I don't believe is currently exposed in the RPC response.
For our purposes here, it would be sufficient to mark
Barrier
asisUnsplittable
and include the range ID in the response, and then just error out intxnPushAttempt
until we tear down the rangefeed processor. But this doesn't generalize particularly well, because other callers would likely want to apply a barrier across a keyspan regardless of the range structure.We could also return a slice of
RangeBarrier
containing the range ID, key span, evaluation timestamp, and LAI. However, that seems impractical when used with a large table across thousands of ranges, where we'd have to start using resume spans to avoid large responses.I'm going to do the straightforward thing for now: make this conditional on a parameter
BarrierRequest.WithLeaseAppliedIndex
which will make the request unsplittable.
I've updated the PR with this approach. It should be mostly good to go once I add some tests, would appreciate another look in the meanwhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 18 of 18 files at r6, 6 of 12 files at r7, 4 of 7 files at r8, 6 of 6 files at r9, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/kv/kvserver/replica_application_result.go
line 268 at r6 (raw file):
// Populate BarrierResponse if requested. if pErr == nil && cmd.proposal.Local != nil && cmd.proposal.Local.PopulateBarrierResponse { cmd.proposal.Local.PopulateBarrierResponse = false // mark as handled
nit: want to push this into a DetachPopulateBarrierResponse
method on LocalResult
, which handles the cmd.proposal.Local == nil
case?
pkg/kv/kvserver/replica_rangefeed.go
line 200 at r9 (raw file):
} if lai == 0 { if tp.r.store.ClusterSettings().Version.IsActive(ctx, clusterversion.V24_1Start) {
Should we check the active cluster version before issuing the Barrier request? As is, the cluster might have completed its upgrade after the Barrier request was sent.
pkg/kv/kvserver/batcheval/result/result.go
line 102 at r6 (raw file):
"GossipFirstRange:%t MaybeGossipSystemConfig:%t "+ "MaybeGossipSystemConfigIfHaveFailure:%t MaybeAddToSplitQueue:%t "+ "MaybeGossipNodeLiveness:%s PopulateBarrierResponse:%t",
nit: let's keep this in the same order as the struct definition and IsZero
.
pkg/kv/kvserver/rangefeed/resolved_timestamp.go
line 163 at r5 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Is this assertion relevant for
MVCCUpdateIntentOp
as well?
I don't think so. In theory, an intent that the rangefeed knows will eventually be aborted could get updated from a timestamp below the resolved ts to another timestamp below the resolved ts.
pkg/kv/kvserver/rangefeed/resolved_timestamp.go
line 165 at r9 (raw file):
// This assertion can be violated in mixed-version clusters, so gate it // on 24.1. See: https://github.com/cockroachdb/cockroach/issues/104309 if rts.settings.Version.IsActive(context.Background(), clusterversion.V24_1Start) {
To be extra careful, do we want to add an env var here so that we don't even up crash looping a cluster without any mechanism to get unstuck?
pkg/kv/kvserver/rangefeed/task.go
line 253 at r4 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
We can't, but in that case it degrades to the current behavior. That's worth documenting on the API.
I see. So we may operate on incorrect information, but that will cause this txnPushAttempt
to degrade to the current behavior. That seems fine, but +1 on documenting — other uses of the flag may not handle being lied to as gracefully.
pkg/kv/kvserver/rangefeed/task.go
line 363 at r9 (raw file):
// rangefeed processor shuts down. It needs to be, otherwise a stalled // barrier can prevent node shutdown. Regardless, use a 1 minute backstop // to prevent getting wedged.
Were you planning to address this TODO in this initial PR?
pkg/kv/kvserver/batcheval/cmd_barrier_test.go
line 38 at r2 (raw file):
I'll see if I can add some basic coverage in kvnemesis too.
👍 That's a good idea.
94ce567
to
d105a98
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/replica_rangefeed.go
line 200 at r9 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should we check the active cluster version before issuing the Barrier request? As is, the cluster might have completed its upgrade after the Barrier request was sent.
This seems exceedingly unlikely, since the remote server would have to have its binary upgraded and restarted while the response is in flight (a 24.1 binary will return the right response regardless of the cluster version). But I suppose it can't hurt. Done.
pkg/kv/kvserver/rangefeed/resolved_timestamp.go
line 165 at r9 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
To be extra careful, do we want to add an env var here so that we don't even up crash looping a cluster without any mechanism to get unstuck?
Yeah, may as well -- done. We could also consider disabling the assertion in release builds instead.
pkg/kv/kvserver/rangefeed/task.go
line 253 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I see. So we may operate on incorrect information, but that will cause this
txnPushAttempt
to degrade to the current behavior. That seems fine, but +1 on documenting — other uses of the flag may not handle being lied to as gracefully.
Added some more NBs on the IntentResolver and TxnPusher APIs.
pkg/kv/kvserver/rangefeed/task.go
line 363 at r9 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Were you planning to address this TODO in this initial PR?
I'll submit a separate PR for this tomorrow, and clean up some of the related rangefeed shutdown code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 15 files at r10, 6 of 7 files at r11, 5 of 5 files at r12, 3 of 3 files at r13, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker)
117801: kvserver: acquire `Replica.mu` when returning reproposal error r=erikgrinaker a=erikgrinaker Discovered while working on #117612. Looks like it's been there since #42939. This is a rare error, and these fields are unlikely to change while we're holding the raft mutex, so seems very unlikely to have caused any problems -- I could be convinced we shouldn't backport this, on the off-chance I've missed a potential deadlock. Epic: none Release note: None 117833: *: un-skip many tests that were skipped over the last month r=rail a=rickystewart This partially reverts the following commits: ``` e615aec 99884f8 99f9760 75c91c2 5c64a22 c1778b7 381c96b f9b20ae 8608499 eb71a83 337777f ``` These tests were skipped due to failures under RBE, typically under `deadlock` and `race`. With the addition of the `heavy` pool for these tests, we anticipate the decreased load will cause these tests to stop failing, therefore we un-skip the skipped tests. Epic: CRDB-8308 Release note: None Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
d105a98
to
92009d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)
pkg/kv/kvserver/rangefeed/task.go
line 363 at r9 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I'll submit a separate PR for this tomorrow, and clean up some of the related rangefeed shutdown code.
f37ce33
to
3f4d373
Compare
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
3f4d373
to
381bd68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 4 files at r20, 1 of 5 files at r22, 3 of 36 files at r24, 27 of 30 files at r25, 3 of 6 files at r26, 3 of 3 files at r28, 3 of 3 files at r29, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/kv/kvserver/rangefeed/task.go
line 38 at r27 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I wonder if we should make this a cluster setting instead. While I don't expect this to cause problems, if it does, it's a lot less hassle for operators. Then again, adding cluster settings for every little thing (even if temporary) is going to snowball pretty fast. Wdyt?
A cluster setting seems reasonable here. There may be other reasons why we'd want the configurability of being able to disable this dynamically.
58a1834
to
6c8d455
Compare
It was possible for rangefeeds to emit a premature checkpoint, before all writes below its timestamp had been emitted. This in turn would cause changefeeds to not emit these write events at all. This could happen because `PushTxn` may return a false `ABORTED` status for a transaction that has in fact been committed, if the transaction record has been GCed (after resolving all intents). The timestamp cache does not retain sufficient information to disambiguate a committed transaction from an aborted one in this case, so it pessimistically assumes an abort (see `Replica.CanCreateTxnRecord` and `batcheval.SynthesizeTxnFromMeta`). However, the rangefeed txn pusher trusted this `ABORTED` status, ignoring the pending txn intents and allowing the resolved timestamp to advance past them before emitting the committed intents. This can lead to the following scenario: - A rangefeed is running on a lagging follower. - A txn writes an intent, which is replicated to the follower. - The closed timestamp advances past the intent. - The txn commits and resolves the intent at the original write timestamp, then GCs its txn record. This is not yet applied on the follower. - The rangefeed pushes the txn to advance its resolved timestamp. - The txn is GCed, so the push returns ABORTED (it can't know whether the txn was committed or aborted after its record is GCed). - The rangefeed disregards the "aborted" txn and advances the resolved timestamp, emitting a checkpoint. - The follower applies the resolved intent and emits an event below the checkpoint, violating the checkpoint guarantee. - The changefeed sees an event below its frontier and drops it, never emitting these events at all. This patch fixes the bug by submitting a barrier command to the leaseholder which waits for all past and ongoing writes (including intent resolution) to complete and apply, and then waits for the local replica to apply the barrier as well. This ensures any committed intent resolution will be applied and emitted before the transaction is removed from resolved timestamp tracking. Epic: none Release note (bug fix): fixed a bug where a changefeed could omit events in rare cases, logging the error "cdc ux violation: detected timestamp ... that is less or equal to the local frontier". This can happen if a rangefeed runs on a follower replica that lags significantly behind the leaseholder, a transaction commits and removes its transaction record before its intent resolution is applied on the follower, the follower's closed timestamp has advanced past the transaction commit timestamp, and the rangefeed attempts to push the transaction to a new timestamp (at least 10 seconds after the transaction began). This may cause the rangefeed to prematurely emit a checkpoint before emitting writes at lower timestamps, which in turn may cause the changefeed to drop these events entirely, never emitting them.
Epic: none Release note: None
6c8d455
to
6d92cef
Compare
TFTRs! bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from aa5e353 to blathers/backport-release-23.1-117612: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.1.x failed. See errors above. error creating merge commit from aa5e353 to blathers/backport-release-23.2-117612: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
👍 |
This PR depends on the following changes, which should be backported together:
BarrierRequest
asisAlone
#117787BarrierRequest.WithLeaseAppliedIndex
#117967Replica.WaitForLeaseAppliedIndex()
#117968PushTxnResponse.AmbiguousAbort
#117969It was possible for rangefeeds to emit a premature checkpoint, before all writes below its timestamp had been emitted. This in turn would cause changefeeds to not emit these write events at all.
This could happen because
PushTxn
may return a falseABORTED
status for a transaction that has in fact been committed, if the transaction record has been GCed (after resolving all intents). The timestamp cache does not retain sufficient information to disambiguate a committed transaction from an aborted one in this case, so it pessimisticallyassumes an abort (see
Replica.CanCreateTxnRecord
andbatcheval.SynthesizeTxnFromMeta
).However, the rangefeed txn pusher trusted this
ABORTED
status, ignoring the pending txn intents and allowing the resolved timestamp to advance past them before emitting the committed intents. This can lead to the following scenario:This patch fixes the bug by submitting a barrier command to the leaseholder which waits for all past and ongoing writes (including intent resolution) to complete and apply, and then waits for the local replica to apply the barrier as well. This ensures any committed intent resolution will be applied and emitted before the transaction is removed from resolved timestamp tracking.
Resolves #104309.
Epic: none
Release note (bug fix): fixed a bug where a changefeed could omit events in rare cases, logging the error "cdc ux violation: detected timestamp ... that is less or equal to the local frontier". This can happen if a rangefeed runs on a follower replica that lags significantly behind the leaseholder, a transaction commits and removes its transaction record before its intent resolution is applied on the follower, the follower's closed timestamp has advanced past the transaction commit timestamp, and the rangefeed attempts to push the transaction to a new timestamp (at least 10 seconds after the transaction began). This may cause the rangefeed to prematurely emit a checkpoint before emitting writes at lower timestamps, which in turn may cause the changefeed to drop these events entirely, never emitting them.