-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: perform recovery on rollback of staging transaction record #75601
kv: perform recovery on rollback of staging transaction record #75601
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)
-- commits, line 22 at r1:
please expand this commit message for posterity, clarifying what we did prior to this patch. In fact, the commit message doesn't really say what the problem was.
So instead of trying to make the guarantee in all cases, I'd rather make
rollbacks safe in all cases and then later explore selectively opting in
to skipping txn recovery in specific cases where the client can
definitively guarantee that the it is rolling back a staging transaction
that is not implicitly committed. I don't expect this to be needed
immediately.
This sounds right to me.
pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 282 at r1 (raw file):
"programming error: epoch regression: %d", h.Txn.Epoch) } if h.Txn.Epoch > reply.Txn.Epoch {
consider splitting the PENDING
vs STAGING
cases, perhaps with a fallthrough
, and have this code only in the STAGING
case
pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 285 at r1 (raw file):
// If the EndTxn carries a newer epoch than a STAGING txn record, we do // not consider the record to be performing a parallel commit because we // know that the transaction restarted since entering the STAGING state.
maybe clarify even more in this comment that, in this case, we know that the txn is not "implicitly committed" - basically I think these two words should appear besides/instead of "performing a parallel commit".
pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 327 at r1 (raw file):
reply.Txn.Status = roachpb.COMMITTED } else { // If the transaction is STAGING, we can only move it to ABORTED if it is
I think it's worth noting in this comment that, technically, we could still abort an implicitly committed txn given the current behavior of conflicting transactions which never see the effects of a STAGING
transactions. But, at the very least that would be confusing. And this muddying of the definition of an "implicit commit" can easily confuse random code.
pkg/kv/txn_external_test.go, line 57 at r1 (raw file):
// If txnStatus == STAGING, setting txnImplicitlyCommitted will make // the transaction qualify for the implicit commit condition. Otherwise, // the transaction will not qualify.
hint at why it won't qualify, so I don't need to read the explosion of filters below :)
pkg/kv/kvserver/replica_test.go, line 11344 at r1 (raw file):
{ // Case of a rollback after an unsuccessful implicit commit. // Because the rollback request uses the same epoch as the staging record,
i think the reference to the "same epoch" in here is superfluous. In the next test case, you can put some extra words to contrast it with this case.
pkg/kv/kvserver/replica_test.go, line 11360 at r1 (raw file):
return sendWrappedWithErr(etH, &et) }, expError: "found txn in indeterminate STAGING state",
I'm confused that this test expects this error. How come this error is not handled throughhandleIndeterminateCommitError()
?
Btw, consider taking the opportunity to put a comment on handleIndeterminateCommitError()
, saying that no error is returned if the txn is committed or aborted.
Also, that function currently makes references to "the blocked pusher", which is no longer general enough (if it ever was).
pkg/kv/kvserver/replica_test.go, line 11377 at r1 (raw file):
run: func(txn *roachpb.Transaction, now hlc.Timestamp) error { clone := txn.Clone() clone.Restart(-1, 0, now)
nit: consider commenting the arg names. Particularly since I thought one of them probably has something to do with the epoch.
pkg/kv/kvserver/batcheval/cmd_end_transaction_test.go, line 613 at r1 (raw file):
}, { // Standard case where a transaction is rolled back. The record
I don't think this case is "standard" any more. The comment probably needs more words, saying that you can't rollback.
pkg/kv/kvserver/batcheval/cmd_end_transaction_test.go, line 623 at r1 (raw file):
commit: false, // Expected result. expError: "found txn in indeterminate STAGING state",
can we improve this test to check for IndeterminateCommitError
- perhaps adding an expErr
that's typed?
ac7094e
to
a91d7aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei)
Previously, andreimatei (Andrei Matei) wrote…
please expand this commit message for posterity, clarifying what we did prior to this patch. In fact, the commit message doesn't really say what the problem was.
Done.
pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 282 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
consider splitting the
PENDING
vsSTAGING
cases, perhaps with afallthrough
, and have this code only in theSTAGING
case
Done.
pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 285 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
maybe clarify even more in this comment that, in this case, we know that the txn is not "implicitly committed" - basically I think these two words should appear besides/instead of "performing a parallel commit".
Done.
pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 327 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think it's worth noting in this comment that, technically, we could still abort an implicitly committed txn given the current behavior of conflicting transactions which never see the effects of a
STAGING
transactions. But, at the very least that would be confusing. And this muddying of the definition of an "implicit commit" can easily confuse random code.
Done.
pkg/kv/txn_external_test.go, line 57 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
hint at why it won't qualify, so I don't need to read the explosion of filters below :)
Done.
pkg/kv/kvserver/replica_test.go, line 11344 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
i think the reference to the "same epoch" in here is superfluous. In the next test case, you can put some extra words to contrast it with this case.
Done.
pkg/kv/kvserver/replica_test.go, line 11360 at r1 (raw file):
I'm confused that this test expects this error. How come this error is not handled throughhandleIndeterminateCommitError() ?
This is because the test sets TestingKnobs.DontRecoverIndeterminateCommits
, along with a few other testing knobs that disable redirects.
Btw, consider taking the opportunity to put a comment on handleIndeterminateCommitError(), saying that no error is returned if the txn is committed or aborted.
Done.
Also, that function currently makes references to "the blocked pusher", which is no longer general enough (if it ever was).
Done.
pkg/kv/kvserver/batcheval/cmd_end_transaction_test.go, line 613 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I don't think this case is "standard" any more. The comment probably needs more words, saying that you can't rollback.
I think it's still standard as far as rollbacks go. Added more words.
pkg/kv/kvserver/batcheval/cmd_end_transaction_test.go, line 623 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
can we improve this test to check for
IndeterminateCommitError
- perhaps adding anexpErr
that's typed?
You don't like the use of testutils.IsError
?
bors r+ |
Build failed (retrying...): |
Build failed: |
a91d7aa
to
3fc3ff5
Compare
Fixes cockroachdb#48301. During a transaction rollback (i.e. an `EndTxn(abort)`), if the request arrives to find that the transaction record is STAGING, it can only move it to ABORTED if it is *not* already implicitly committed. On the commit path, the transaction coordinator is deliberate to only ever issue an `EndTxn(commit)` once the transaction has reached an implicit commit state. However, on the rollback path, the transaction coordinator does not make the opposite guarantee that it will never issue an `EndTxn(abort)` once the transaction has reached (or if it still could reach) an implicit commit state. As a result, on the rollback path, we don't trust the transaction's coordinator to be an authoritative source of truth about whether the transaction is implicitly committed. In other words, we don't consider this `EndTxn(abort)` to be a claim that the transaction is not implicitly committed. The transaction's coordinator may have just given up on the transaction before it heard the outcome of a commit attempt. So in this case, we now return an `IndeterminateCommitError` during evaluation to trigger the transaction recovery protocol and transition the transaction record to a finalized state (COMMITTED or ABORTED). Prior to this change, we were blindly trusting the `EndTxn(abort)` to be such an authoritative source of truth, so we were at risk of hazards where a transaction was implicitly committed but its coordinator did not know and issued a rollback. This was observed to cause workloads to hit "found ABORTED record for implicitly committed transaction" errors under sufficient network fault injection. This was one of the two alternatives described in cockroachdb#48301. The other option was to lock down the txn client to attempt to make the guarantee that rollbacks will only be performed if the client is certain that the transaction is not currently in and can no longer ever enter the implicit commit state. I was previously drawn to this other approach because it would avoid the need for transaction recovery during the rollback of a staging txn. However, I don't think it's possible to make such a guarantee in all cases due to the possibility of ambiguity. To eliminate this ambiguity, there are cases where a transaction's coordinator would need to query the result of writes, which turns out to be analogous to the transaction recovery protocol. So instead of trying to make the guarantee in all cases, I'd rather make rollbacks safe in all cases and then later explore selectively opting in to skipping txn recovery in specific cases where the client can definitively guarantee that the it is rolling back a staging transaction that is not implicitly committed. I don't expect this to be needed immediately. Release note (bug fix): fixed a rare race condition that could lead to client-visible errors that looked like "found ABORTED record for implicitly committed transaction". These errors were harmless in that they did not indicate data corruption, but they could be disruptive to clients.
3fc3ff5
to
c24ce09
Compare
bors r+ |
Build succeeded: |
Fixes #48301.
During a transaction rollback (i.e. an
EndTxn(abort)
), if the requestarrives to find that the transaction record is STAGING, it can only move
it to ABORTED if it is not already implicitly committed. On the commit
path, the transaction coordinator is deliberate to only ever issue an
EndTxn(commit)
once the transaction has reached an implicit commitstate. However, on the rollback path, the transaction coordinator does
not make the opposite guarantee that it will never issue an
EndTxn(abort)
once the transaction has reached (or if it still couldreach) an implicit commit state.
As a result, on the rollback path, we don't trust the transaction's
coordinator to be an authoritative source of truth about whether the
transaction is implicitly committed. In other words, we don't consider
this
EndTxn(abort)
to be a claim that the transaction is notimplicitly committed. The transaction's coordinator may have just given
up on the transaction before it heard the outcome of a commit attempt.
So in this case, we now return an
IndeterminateCommitError
duringevaluation to trigger the transaction recovery protocol and transition
the transaction record to a finalized state (COMMITTED or ABORTED).
This was one of the two alternatives described in #48301. The other
option was to lock down the txn client to attempt to make the guarantee
that rollbacks will only be performed if the client is certain that the
transaction is not currently in and can no longer ever enter the
implicit commit state. I was previously drawn to this other approach
because it would avoid the need for transaction recovery during the
rollback of a staging txn. However, I don't think it's possible to make
such a guarantee in all cases due to the possibility of ambiguity. To
eliminate this ambiguity, there are cases where a transaction's
coordinator would need to query the result of writes, which turns out to
be analogous to the transaction recovery protocol.
So instead of trying to make the guarantee in all cases, I'd rather make
rollbacks safe in all cases and then later explore selectively opting in
to skipping txn recovery in specific cases where the client can
definitively guarantee that the it is rolling back a staging transaction
that is not implicitly committed. I don't expect this to be needed
immediately.
Release note (bug fix): fixed a rare race condition that could lead to
client-visible errors that looked like "found ABORTED record for
implicitly committed transaction". These errors were harmless in that
they did not indicate data corruption, but they could be disruptive to
clients.