Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: rolling back an implicitly committed txn should not be permitted #48301

Closed
andreimatei opened this issue May 1, 2020 · 3 comments · Fixed by #75601
Closed

kv: rolling back an implicitly committed txn should not be permitted #48301

andreimatei opened this issue May 1, 2020 · 3 comments · Fixed by #75601
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@andreimatei
Copy link
Contributor

A transaction is implicitly committed if it has a STAGING txn record, and all its in-flight writes were successfully written (at copacetic timestamps). We like to think of a transaction that satisfies this implicit-commit condition as committed. Surprisingly, it turns out that that coordinator can rollback an implicitly-committed txn. If it sends a rollback, nothing will stand in its way and the transaction will be cleaned up.
This is definitely not what we've intended. I believe that currently this behavior cannot result in very bad outcomes (in particular, atomicity violations), since the txn's writes only become visible to others once the txn recovery protocol successfully moves the txn record from the STAGING to the COMMITTED state. The recovery protocol will error out with an assertion failure if it loses a races to a rollback.

Even though currently atomicity is preserved and I guess the only side-effect of this rollback is that assertion error being encountered by concurrent recoverers, we don't like it that resolving implicitly-committed intents need to wait for the replication of the COMMITTED transaction. What we'd want would be to allow both the coordinator and other pushers to resolve intents as soon at the implicit commit condition is proved to be satisfied - so for example we'd like to run the RecoverTxn request in an AsyncConsensus batch. We cannot currently do that because we'd also need to ensure that multiple RecoverTxns all succeed even if some intents have been resolved. In order for that to be the case, we need to store the txn id in committed values (talked about around #26915). Anyway, that's the forward-looking plan, and this rollback behavior is a major fly in the ointment because, at that point, it would result in atomicity failures.

We should fix it by preventing transitions from STAGING->ABORTED if the implicit commit condition is satisfied. This would probably entail having an EndTxn(Rollback) that encounters a STAGING record kick off txn recovery (unless the client knows that it will fail). And maybe also the client shouldn't send a rollback in the first place if the status is ambiguous; the client could kick off recovery itself.

@andreimatei andreimatei added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label May 1, 2020
@andreimatei
Copy link
Contributor Author

#48297 adds a test reproducing this behavior

andreimatei added a commit to andreimatei/cockroach that referenced this issue May 1, 2020
This patch improves the assertion that requests don't get errors
informing them that their transaction has been committed from underneath
them. There is a semi-legitimate case that's hitting that assertion: a
rollback sent after a commit returned an ambiguous result to the client.
This happens easily when the commit's ctx times out - the client gets an
ambiguous error and is likely to send a rollback attempt afterwards.
This rollback attempt might find a transaction record with the COMMITTED
status, in which case the assertion was hit.
This patch makes the assertion white-list this case. We were already
white-listing this specific case in other places. For other, unexpected
cases, the assertion (re-)becomes a Fatalf (from an Errorf), since an
unepectedly-committed transaction is no joke.

This patch also adds a test for the case that was triggering the
assertion. Also, it adds tests for another two most troubling cases of
rollback-after-ambiguous-commit that we handle incorrectly. Those
behaviors are not fixed; the test just reproduces and highlights them.

Touches cockroachdb#48301
Touches cockroachdb#48302

Release note: None
@nvanbenschoten
Copy link
Member

@andreimatei and I have been discussing this in private over the past few days, but it's worth keeping the conversation public.

I believe that currently this behavior cannot result in very bad outcomes (in particular, atomicity violations), since the txn's writes only become visible to others once the txn recovery protocol successfully moves the txn record from the STAGING to the COMMITTED state.

This is also my understanding because we currently have asymmetric visibility into the state of a transaction commit between the transaction coordinator itself and onlookers. The coordinator itself will consider a transaction committed when it reaches the implicit commit state. An onlooker (pusher) will only consider it committed after it has recovered the transaction record and moved it into an explicit commit state. Similarly, a transaction coordinator will only be able to begin resolving its intents once its transaction record is in an explicit commit state. So this protects us from any data corruption due to atomicity violations.

However, I do think it's possible that this could lead to this confusing (soft) assertion:

case roachpb.PENDING, roachpb.ABORTED:
// Once implicitly committed, the transaction should never move back
// to the PENDING status and it should never be ABORTED.
//
// In order for the second statement to be true, we need to ensure
// that transaction records that are GCed after being COMMITTED are
// never re-written as ABORTED. We used to allow this to happen when
// PushTxn requests found missing transaction records because it was
// harmless, but we now use the timestamp cache to avoid
// needing to ever do so. If this ever becomes possible again, we'll
// need to relax this check.
return result.Result{}, roachpb.NewTransactionStatusError(fmt.Sprintf(
"programming error: found %s record for implicitly committed transaction: %v",
reply.RecoveredTxn.Status, reply.RecoveredTxn,
))

We don't seem to have any reports of that, but it's possible.

We should fix it by preventing transitions from STAGING->ABORTED if the implicit commit condition is satisfied. This would probably entail having an EndTxn(Rollback) that encounters a STAGING record kick off txn recovery (unless the client knows that it will fail). And maybe also the client shouldn't send a rollback in the first place if the status is ambiguous; the client could kick off recovery itself.

This is pretty much my thinking, although I'd make it more general so we can also fix #48302 at the same time. Let's update the txnCommitter to keep track of when a transaction has been committed (implicitly, explicitly, or ambiguously). This should be easy to do here:

// Determine next steps based on the status of the transaction.

We can then reject any other requests that pass through it.

@nvanbenschoten
Copy link
Member

And just to be explicit here, I think any solution needs to incorporate cooperation from the client. In fact, I'd prefer solving this in the client entirely. We're not operating in a byzantine model where we need to protect against clients going crazy and violating atomicity intentionally. So the simplest fix is to block out incorrect rollbacks right in the TxnCoordSender (or its interceptor stack).

However, for the sake of argument, if we didn't want to solve this in the client, I think the solution here would be to force any STAGING -> ABORTED transition to go through the txn recovery procedure. That solution seems incorrect to me though. If we do that, then why aren't we forcing STAGING->COMMITTED transitions to do the same thing? Couldn't the client have gotten that wrong as well?

andreimatei added a commit to andreimatei/cockroach that referenced this issue May 4, 2020
This patch improves the assertion that requests don't get errors
informing them that their transaction has been committed from underneath
them. There is a semi-legitimate case that's hitting that assertion: a
rollback sent after a commit returned an ambiguous result to the client.
This happens easily when the commit's ctx times out - the client gets an
ambiguous error and is likely to send a rollback attempt afterwards.
This rollback attempt might find a transaction record with the COMMITTED
status, in which case the assertion was hit.
This patch makes the assertion white-list this case. We were already
white-listing this specific case in other places. For other, unexpected
cases, the assertion (re-)becomes a Fatalf (from an Errorf), since an
unepectedly-committed transaction is no joke.

This patch also adds a test for the case that was triggering the
assertion. Also, it adds tests for another two most troubling cases of
rollback-after-ambiguous-commit that we handle incorrectly. Those
behaviors are not fixed; the test just reproduces and highlights them.

Touches cockroachdb#48301
Touches cockroachdb#48302

Release note: None
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 27, 2022
Fixes cockroachdb#48301.

During a transaction rollback (i.e. an `EndTxn(abort)`), if the request
arrives to find that the transaction record is STAGING, it can only move
it to ABORTED if it is *not* already implicitly committed. On the commit
path, the transaction coordinator is deliberate to only ever issue an
`EndTxn(commit)` once the transaction has reached an implicit commit
state. However, on the rollback path, the transaction coordinator does
not make the opposite guarantee that it will never issue an
`EndTxn(abort)` once the transaction has reached (or if it still could
reach) an implicit commit state.

As a result, on the rollback path, we don't trust the transaction's
coordinator to be an authoritative source of truth about whether the
transaction is implicitly committed. In other words, we don't consider
this `EndTxn(abort)` to be a claim that the transaction is not
implicitly committed. The transaction's coordinator may have just given
up on the transaction before it heard the outcome of a commit attempt.
So in this case, we now return an `IndeterminateCommitError` during
evaluation to trigger the transaction recovery protocol and transition
the transaction record to a finalized state (COMMITTED or ABORTED).

This was one of the two alternatives described in cockroachdb#48301. The other
option was to lock down the txn client to attempt to make the guarantee
that rollbacks will only be performed if the client is certain that the
transaction is not currently in and can no longer ever enter the
implicit commit state. I was previously drawn to this other approach
because it would avoid the need for transaction recovery during the
rollback of a staging txn. However, I don't think it's possible to make
such a guarantee in all cases due to the possibility of ambiguity. To
eliminate this ambiguity, there are cases where a transaction's
coordinator would need to query the result of writes, which turns out to
be analogous to the transaction recovery protocol.

So instead of trying to make the guarantee in all cases, I'd rather make
rollbacks safe in all cases and then later explore selectively opting in
to skipping txn recovery in specific cases where the client can
definitively guarantee that the it is rolling back a staging transaction
that is not implicitly committed. I don't expect this to be needed
immediately.

Release note (bug fix): fixed a rare race condition that could lead to
client-visible errors that looked like "found ABORTED record for
implicitly committed transaction". These errors were harmless in that
they did not indicate data corruption, but they could be disruptive to
clients.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 3, 2022
Fixes cockroachdb#48301.

During a transaction rollback (i.e. an `EndTxn(abort)`), if the request
arrives to find that the transaction record is STAGING, it can only move
it to ABORTED if it is *not* already implicitly committed. On the commit
path, the transaction coordinator is deliberate to only ever issue an
`EndTxn(commit)` once the transaction has reached an implicit commit
state. However, on the rollback path, the transaction coordinator does
not make the opposite guarantee that it will never issue an
`EndTxn(abort)` once the transaction has reached (or if it still could
reach) an implicit commit state.

As a result, on the rollback path, we don't trust the transaction's
coordinator to be an authoritative source of truth about whether the
transaction is implicitly committed. In other words, we don't consider
this `EndTxn(abort)` to be a claim that the transaction is not
implicitly committed. The transaction's coordinator may have just given
up on the transaction before it heard the outcome of a commit attempt.
So in this case, we now return an `IndeterminateCommitError` during
evaluation to trigger the transaction recovery protocol and transition
the transaction record to a finalized state (COMMITTED or ABORTED).

Prior to this change, we were blindly trusting the `EndTxn(abort)` to
be such an authoritative source of truth, so we were at risk of hazards
where a transaction was implicitly committed but its coordinator did not
know and issued a rollback. This was observed to cause workloads to hit
"found ABORTED record for implicitly committed transaction" errors under
sufficient network fault injection.

This was one of the two alternatives described in cockroachdb#48301. The other
option was to lock down the txn client to attempt to make the guarantee
that rollbacks will only be performed if the client is certain that the
transaction is not currently in and can no longer ever enter the
implicit commit state. I was previously drawn to this other approach
because it would avoid the need for transaction recovery during the
rollback of a staging txn. However, I don't think it's possible to make
such a guarantee in all cases due to the possibility of ambiguity. To
eliminate this ambiguity, there are cases where a transaction's
coordinator would need to query the result of writes, which turns out to
be analogous to the transaction recovery protocol.

So instead of trying to make the guarantee in all cases, I'd rather make
rollbacks safe in all cases and then later explore selectively opting in
to skipping txn recovery in specific cases where the client can
definitively guarantee that the it is rolling back a staging transaction
that is not implicitly committed. I don't expect this to be needed
immediately.

Release note (bug fix): fixed a rare race condition that could lead to
client-visible errors that looked like "found ABORTED record for
implicitly committed transaction". These errors were harmless in that
they did not indicate data corruption, but they could be disruptive to
clients.
craig bot pushed a commit that referenced this issue Feb 6, 2022
75601: kv: perform recovery on rollback of staging transaction record r=nvanbenschoten a=nvanbenschoten

Fixes #48301.

During a transaction rollback (i.e. an `EndTxn(abort)`), if the request
arrives to find that the transaction record is STAGING, it can only move
it to ABORTED if it is *not* already implicitly committed. On the commit
path, the transaction coordinator is deliberate to only ever issue an
`EndTxn(commit)` once the transaction has reached an implicit commit
state. However, on the rollback path, the transaction coordinator does
not make the opposite guarantee that it will never issue an
`EndTxn(abort)` once the transaction has reached (or if it still could
reach) an implicit commit state.

As a result, on the rollback path, we don't trust the transaction's
coordinator to be an authoritative source of truth about whether the
transaction is implicitly committed. In other words, we don't consider
this `EndTxn(abort)` to be a claim that the transaction is not
implicitly committed. The transaction's coordinator may have just given
up on the transaction before it heard the outcome of a commit attempt.
So in this case, we now return an `IndeterminateCommitError` during
evaluation to trigger the transaction recovery protocol and transition
the transaction record to a finalized state (COMMITTED or ABORTED).

This was one of the two alternatives described in #48301. The other
option was to lock down the txn client to attempt to make the guarantee
that rollbacks will only be performed if the client is certain that the
transaction is not currently in and can no longer ever enter the
implicit commit state. I was previously drawn to this other approach
because it would avoid the need for transaction recovery during the
rollback of a staging txn. However, I don't think it's possible to make
such a guarantee in all cases due to the possibility of ambiguity. To
eliminate this ambiguity, there are cases where a transaction's
coordinator would need to query the result of writes, which turns out to
be analogous to the transaction recovery protocol.

So instead of trying to make the guarantee in all cases, I'd rather make
rollbacks safe in all cases and then later explore selectively opting in
to skipping txn recovery in specific cases where the client can
definitively guarantee that the it is rolling back a staging transaction
that is not implicitly committed. I don't expect this to be needed
immediately.

Release note (bug fix): fixed a rare race condition that could lead to
client-visible errors that looked like "found ABORTED record for
implicitly committed transaction". These errors were harmless in that
they did not indicate data corruption, but they could be disruptive to
clients.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in c24ce09 Feb 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants