Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: perform recovery on rollback of staging transaction record #75601

Merged

Conversation

nvanbenschoten
Copy link
Member

Fixes #48301.

During a transaction rollback (i.e. an EndTxn(abort)), if the request
arrives to find that the transaction record is STAGING, it can only move
it to ABORTED if it is not already implicitly committed. On the commit
path, the transaction coordinator is deliberate to only ever issue an
EndTxn(commit) once the transaction has reached an implicit commit
state. However, on the rollback path, the transaction coordinator does
not make the opposite guarantee that it will never issue an
EndTxn(abort) once the transaction has reached (or if it still could
reach) an implicit commit state.

As a result, on the rollback path, we don't trust the transaction's
coordinator to be an authoritative source of truth about whether the
transaction is implicitly committed. In other words, we don't consider
this EndTxn(abort) to be a claim that the transaction is not
implicitly committed. The transaction's coordinator may have just given
up on the transaction before it heard the outcome of a commit attempt.
So in this case, we now return an IndeterminateCommitError during
evaluation to trigger the transaction recovery protocol and transition
the transaction record to a finalized state (COMMITTED or ABORTED).

This was one of the two alternatives described in #48301. The other
option was to lock down the txn client to attempt to make the guarantee
that rollbacks will only be performed if the client is certain that the
transaction is not currently in and can no longer ever enter the
implicit commit state. I was previously drawn to this other approach
because it would avoid the need for transaction recovery during the
rollback of a staging txn. However, I don't think it's possible to make
such a guarantee in all cases due to the possibility of ambiguity. To
eliminate this ambiguity, there are cases where a transaction's
coordinator would need to query the result of writes, which turns out to
be analogous to the transaction recovery protocol.

So instead of trying to make the guarantee in all cases, I'd rather make
rollbacks safe in all cases and then later explore selectively opting in
to skipping txn recovery in specific cases where the client can
definitively guarantee that the it is rolling back a staging transaction
that is not implicitly committed. I don't expect this to be needed
immediately.

Release note (bug fix): fixed a rare race condition that could lead to
client-visible errors that looked like "found ABORTED record for
implicitly committed transaction". These errors were harmless in that
they did not indicate data corruption, but they could be disruptive to
clients.

@nvanbenschoten nvanbenschoten requested a review from a team as a code owner January 27, 2022 05:09
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)


-- commits, line 22 at r1:
please expand this commit message for posterity, clarifying what we did prior to this patch. In fact, the commit message doesn't really say what the problem was.


-- commits, line 38 at r1:

So instead of trying to make the guarantee in all cases, I'd rather make
rollbacks safe in all cases and then later explore selectively opting in
to skipping txn recovery in specific cases where the client can
definitively guarantee that the it is rolling back a staging transaction
that is not implicitly committed. I don't expect this to be needed
immediately.

This sounds right to me.


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 282 at r1 (raw file):

					"programming error: epoch regression: %d", h.Txn.Epoch)
			}
			if h.Txn.Epoch > reply.Txn.Epoch {

consider splitting the PENDING vs STAGING cases, perhaps with a fallthrough, and have this code only in the STAGING case


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 285 at r1 (raw file):

				// If the EndTxn carries a newer epoch than a STAGING txn record, we do
				// not consider the record to be performing a parallel commit because we
				// know that the transaction restarted since entering the STAGING state.

maybe clarify even more in this comment that, in this case, we know that the txn is not "implicitly committed" - basically I think these two words should appear besides/instead of "performing a parallel commit".


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 327 at r1 (raw file):

		reply.Txn.Status = roachpb.COMMITTED
	} else {
		// If the transaction is STAGING, we can only move it to ABORTED if it is

I think it's worth noting in this comment that, technically, we could still abort an implicitly committed txn given the current behavior of conflicting transactions which never see the effects of a STAGING transactions. But, at the very least that would be confusing. And this muddying of the definition of an "implicit commit" can easily confuse random code.


pkg/kv/txn_external_test.go, line 57 at r1 (raw file):

		// If txnStatus == STAGING, setting txnImplicitlyCommitted will make
		// the transaction qualify for the implicit commit condition. Otherwise,
		// the transaction will not qualify.

hint at why it won't qualify, so I don't need to read the explosion of filters below :)


pkg/kv/kvserver/replica_test.go, line 11344 at r1 (raw file):

		{
			// Case of a rollback after an unsuccessful implicit commit.
			// Because the rollback request uses the same epoch as the staging record,

i think the reference to the "same epoch" in here is superfluous. In the next test case, you can put some extra words to contrast it with this case.


pkg/kv/kvserver/replica_test.go, line 11360 at r1 (raw file):

				return sendWrappedWithErr(etH, &et)
			},
			expError: "found txn in indeterminate STAGING state",

I'm confused that this test expects this error. How come this error is not handled throughhandleIndeterminateCommitError() ?

Btw, consider taking the opportunity to put a comment on handleIndeterminateCommitError(), saying that no error is returned if the txn is committed or aborted.
Also, that function currently makes references to "the blocked pusher", which is no longer general enough (if it ever was).


pkg/kv/kvserver/replica_test.go, line 11377 at r1 (raw file):

			run: func(txn *roachpb.Transaction, now hlc.Timestamp) error {
				clone := txn.Clone()
				clone.Restart(-1, 0, now)

nit: consider commenting the arg names. Particularly since I thought one of them probably has something to do with the epoch.


pkg/kv/kvserver/batcheval/cmd_end_transaction_test.go, line 613 at r1 (raw file):

		},
		{
			// Standard case where a transaction is rolled back. The record

I don't think this case is "standard" any more. The comment probably needs more words, saying that you can't rollback.


pkg/kv/kvserver/batcheval/cmd_end_transaction_test.go, line 623 at r1 (raw file):

			commit:    false,
			// Expected result.
			expError: "found txn in indeterminate STAGING state",

can we improve this test to check for IndeterminateCommitError - perhaps adding an expErr that's typed?

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/recoverOnRollback branch from ac7094e to a91d7aa Compare February 3, 2022 04:54
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei)


-- commits, line 22 at r1:

Previously, andreimatei (Andrei Matei) wrote…

please expand this commit message for posterity, clarifying what we did prior to this patch. In fact, the commit message doesn't really say what the problem was.

Done.


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 282 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

consider splitting the PENDING vs STAGING cases, perhaps with a fallthrough, and have this code only in the STAGING case

Done.


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 285 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

maybe clarify even more in this comment that, in this case, we know that the txn is not "implicitly committed" - basically I think these two words should appear besides/instead of "performing a parallel commit".

Done.


pkg/kv/kvserver/batcheval/cmd_end_transaction.go, line 327 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think it's worth noting in this comment that, technically, we could still abort an implicitly committed txn given the current behavior of conflicting transactions which never see the effects of a STAGING transactions. But, at the very least that would be confusing. And this muddying of the definition of an "implicit commit" can easily confuse random code.

Done.


pkg/kv/txn_external_test.go, line 57 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

hint at why it won't qualify, so I don't need to read the explosion of filters below :)

Done.


pkg/kv/kvserver/replica_test.go, line 11344 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

i think the reference to the "same epoch" in here is superfluous. In the next test case, you can put some extra words to contrast it with this case.

Done.


pkg/kv/kvserver/replica_test.go, line 11360 at r1 (raw file):

I'm confused that this test expects this error. How come this error is not handled throughhandleIndeterminateCommitError() ?

This is because the test sets TestingKnobs.DontRecoverIndeterminateCommits, along with a few other testing knobs that disable redirects.

Btw, consider taking the opportunity to put a comment on handleIndeterminateCommitError(), saying that no error is returned if the txn is committed or aborted.

Done.

Also, that function currently makes references to "the blocked pusher", which is no longer general enough (if it ever was).

Done.


pkg/kv/kvserver/batcheval/cmd_end_transaction_test.go, line 613 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't think this case is "standard" any more. The comment probably needs more words, saying that you can't rollback.

I think it's still standard as far as rollbacks go. Added more words.


pkg/kv/kvserver/batcheval/cmd_end_transaction_test.go, line 623 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

can we improve this test to check for IndeterminateCommitError - perhaps adding an expErr that's typed?

You don't like the use of testutils.IsError?

@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 3, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Feb 4, 2022

Build failed:

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/recoverOnRollback branch from a91d7aa to 3fc3ff5 Compare February 4, 2022 23:51
Fixes cockroachdb#48301.

During a transaction rollback (i.e. an `EndTxn(abort)`), if the request
arrives to find that the transaction record is STAGING, it can only move
it to ABORTED if it is *not* already implicitly committed. On the commit
path, the transaction coordinator is deliberate to only ever issue an
`EndTxn(commit)` once the transaction has reached an implicit commit
state. However, on the rollback path, the transaction coordinator does
not make the opposite guarantee that it will never issue an
`EndTxn(abort)` once the transaction has reached (or if it still could
reach) an implicit commit state.

As a result, on the rollback path, we don't trust the transaction's
coordinator to be an authoritative source of truth about whether the
transaction is implicitly committed. In other words, we don't consider
this `EndTxn(abort)` to be a claim that the transaction is not
implicitly committed. The transaction's coordinator may have just given
up on the transaction before it heard the outcome of a commit attempt.
So in this case, we now return an `IndeterminateCommitError` during
evaluation to trigger the transaction recovery protocol and transition
the transaction record to a finalized state (COMMITTED or ABORTED).

Prior to this change, we were blindly trusting the `EndTxn(abort)` to
be such an authoritative source of truth, so we were at risk of hazards
where a transaction was implicitly committed but its coordinator did not
know and issued a rollback. This was observed to cause workloads to hit
"found ABORTED record for implicitly committed transaction" errors under
sufficient network fault injection.

This was one of the two alternatives described in cockroachdb#48301. The other
option was to lock down the txn client to attempt to make the guarantee
that rollbacks will only be performed if the client is certain that the
transaction is not currently in and can no longer ever enter the
implicit commit state. I was previously drawn to this other approach
because it would avoid the need for transaction recovery during the
rollback of a staging txn. However, I don't think it's possible to make
such a guarantee in all cases due to the possibility of ambiguity. To
eliminate this ambiguity, there are cases where a transaction's
coordinator would need to query the result of writes, which turns out to
be analogous to the transaction recovery protocol.

So instead of trying to make the guarantee in all cases, I'd rather make
rollbacks safe in all cases and then later explore selectively opting in
to skipping txn recovery in specific cases where the client can
definitively guarantee that the it is rolling back a staging transaction
that is not implicitly committed. I don't expect this to be needed
immediately.

Release note (bug fix): fixed a rare race condition that could lead to
client-visible errors that looked like "found ABORTED record for
implicitly committed transaction". These errors were harmless in that
they did not indicate data corruption, but they could be disruptive to
clients.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/recoverOnRollback branch from 3fc3ff5 to c24ce09 Compare February 5, 2022 19:10
@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 6, 2022

Build succeeded:

@craig craig bot merged commit be9516e into cockroachdb:master Feb 6, 2022
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/recoverOnRollback branch February 8, 2022 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: rolling back an implicitly committed txn should not be permitted
3 participants