-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: rolling back an implicitly committed txn should not be permitted #48301
Comments
#48297 adds a test reproducing this behavior |
This patch improves the assertion that requests don't get errors informing them that their transaction has been committed from underneath them. There is a semi-legitimate case that's hitting that assertion: a rollback sent after a commit returned an ambiguous result to the client. This happens easily when the commit's ctx times out - the client gets an ambiguous error and is likely to send a rollback attempt afterwards. This rollback attempt might find a transaction record with the COMMITTED status, in which case the assertion was hit. This patch makes the assertion white-list this case. We were already white-listing this specific case in other places. For other, unexpected cases, the assertion (re-)becomes a Fatalf (from an Errorf), since an unepectedly-committed transaction is no joke. This patch also adds a test for the case that was triggering the assertion. Also, it adds tests for another two most troubling cases of rollback-after-ambiguous-commit that we handle incorrectly. Those behaviors are not fixed; the test just reproduces and highlights them. Touches cockroachdb#48301 Touches cockroachdb#48302 Release note: None
@andreimatei and I have been discussing this in private over the past few days, but it's worth keeping the conversation public.
This is also my understanding because we currently have asymmetric visibility into the state of a transaction commit between the transaction coordinator itself and onlookers. The coordinator itself will consider a transaction committed when it reaches the implicit commit state. An onlooker (pusher) will only consider it committed after it has recovered the transaction record and moved it into an explicit commit state. Similarly, a transaction coordinator will only be able to begin resolving its intents once its transaction record is in an explicit commit state. So this protects us from any data corruption due to atomicity violations. However, I do think it's possible that this could lead to this confusing (soft) assertion: cockroach/pkg/kv/kvserver/batcheval/cmd_recover_txn.go Lines 106 to 120 in 96f8f33
We don't seem to have any reports of that, but it's possible.
This is pretty much my thinking, although I'd make it more general so we can also fix #48302 at the same time. Let's update the
We can then reject any other requests that pass through it. |
And just to be explicit here, I think any solution needs to incorporate cooperation from the client. In fact, I'd prefer solving this in the client entirely. We're not operating in a byzantine model where we need to protect against clients going crazy and violating atomicity intentionally. So the simplest fix is to block out incorrect rollbacks right in the However, for the sake of argument, if we didn't want to solve this in the client, I think the solution here would be to force any |
This patch improves the assertion that requests don't get errors informing them that their transaction has been committed from underneath them. There is a semi-legitimate case that's hitting that assertion: a rollback sent after a commit returned an ambiguous result to the client. This happens easily when the commit's ctx times out - the client gets an ambiguous error and is likely to send a rollback attempt afterwards. This rollback attempt might find a transaction record with the COMMITTED status, in which case the assertion was hit. This patch makes the assertion white-list this case. We were already white-listing this specific case in other places. For other, unexpected cases, the assertion (re-)becomes a Fatalf (from an Errorf), since an unepectedly-committed transaction is no joke. This patch also adds a test for the case that was triggering the assertion. Also, it adds tests for another two most troubling cases of rollback-after-ambiguous-commit that we handle incorrectly. Those behaviors are not fixed; the test just reproduces and highlights them. Touches cockroachdb#48301 Touches cockroachdb#48302 Release note: None
Fixes cockroachdb#48301. During a transaction rollback (i.e. an `EndTxn(abort)`), if the request arrives to find that the transaction record is STAGING, it can only move it to ABORTED if it is *not* already implicitly committed. On the commit path, the transaction coordinator is deliberate to only ever issue an `EndTxn(commit)` once the transaction has reached an implicit commit state. However, on the rollback path, the transaction coordinator does not make the opposite guarantee that it will never issue an `EndTxn(abort)` once the transaction has reached (or if it still could reach) an implicit commit state. As a result, on the rollback path, we don't trust the transaction's coordinator to be an authoritative source of truth about whether the transaction is implicitly committed. In other words, we don't consider this `EndTxn(abort)` to be a claim that the transaction is not implicitly committed. The transaction's coordinator may have just given up on the transaction before it heard the outcome of a commit attempt. So in this case, we now return an `IndeterminateCommitError` during evaluation to trigger the transaction recovery protocol and transition the transaction record to a finalized state (COMMITTED or ABORTED). This was one of the two alternatives described in cockroachdb#48301. The other option was to lock down the txn client to attempt to make the guarantee that rollbacks will only be performed if the client is certain that the transaction is not currently in and can no longer ever enter the implicit commit state. I was previously drawn to this other approach because it would avoid the need for transaction recovery during the rollback of a staging txn. However, I don't think it's possible to make such a guarantee in all cases due to the possibility of ambiguity. To eliminate this ambiguity, there are cases where a transaction's coordinator would need to query the result of writes, which turns out to be analogous to the transaction recovery protocol. So instead of trying to make the guarantee in all cases, I'd rather make rollbacks safe in all cases and then later explore selectively opting in to skipping txn recovery in specific cases where the client can definitively guarantee that the it is rolling back a staging transaction that is not implicitly committed. I don't expect this to be needed immediately. Release note (bug fix): fixed a rare race condition that could lead to client-visible errors that looked like "found ABORTED record for implicitly committed transaction". These errors were harmless in that they did not indicate data corruption, but they could be disruptive to clients.
Fixes cockroachdb#48301. During a transaction rollback (i.e. an `EndTxn(abort)`), if the request arrives to find that the transaction record is STAGING, it can only move it to ABORTED if it is *not* already implicitly committed. On the commit path, the transaction coordinator is deliberate to only ever issue an `EndTxn(commit)` once the transaction has reached an implicit commit state. However, on the rollback path, the transaction coordinator does not make the opposite guarantee that it will never issue an `EndTxn(abort)` once the transaction has reached (or if it still could reach) an implicit commit state. As a result, on the rollback path, we don't trust the transaction's coordinator to be an authoritative source of truth about whether the transaction is implicitly committed. In other words, we don't consider this `EndTxn(abort)` to be a claim that the transaction is not implicitly committed. The transaction's coordinator may have just given up on the transaction before it heard the outcome of a commit attempt. So in this case, we now return an `IndeterminateCommitError` during evaluation to trigger the transaction recovery protocol and transition the transaction record to a finalized state (COMMITTED or ABORTED). Prior to this change, we were blindly trusting the `EndTxn(abort)` to be such an authoritative source of truth, so we were at risk of hazards where a transaction was implicitly committed but its coordinator did not know and issued a rollback. This was observed to cause workloads to hit "found ABORTED record for implicitly committed transaction" errors under sufficient network fault injection. This was one of the two alternatives described in cockroachdb#48301. The other option was to lock down the txn client to attempt to make the guarantee that rollbacks will only be performed if the client is certain that the transaction is not currently in and can no longer ever enter the implicit commit state. I was previously drawn to this other approach because it would avoid the need for transaction recovery during the rollback of a staging txn. However, I don't think it's possible to make such a guarantee in all cases due to the possibility of ambiguity. To eliminate this ambiguity, there are cases where a transaction's coordinator would need to query the result of writes, which turns out to be analogous to the transaction recovery protocol. So instead of trying to make the guarantee in all cases, I'd rather make rollbacks safe in all cases and then later explore selectively opting in to skipping txn recovery in specific cases where the client can definitively guarantee that the it is rolling back a staging transaction that is not implicitly committed. I don't expect this to be needed immediately. Release note (bug fix): fixed a rare race condition that could lead to client-visible errors that looked like "found ABORTED record for implicitly committed transaction". These errors were harmless in that they did not indicate data corruption, but they could be disruptive to clients.
75601: kv: perform recovery on rollback of staging transaction record r=nvanbenschoten a=nvanbenschoten Fixes #48301. During a transaction rollback (i.e. an `EndTxn(abort)`), if the request arrives to find that the transaction record is STAGING, it can only move it to ABORTED if it is *not* already implicitly committed. On the commit path, the transaction coordinator is deliberate to only ever issue an `EndTxn(commit)` once the transaction has reached an implicit commit state. However, on the rollback path, the transaction coordinator does not make the opposite guarantee that it will never issue an `EndTxn(abort)` once the transaction has reached (or if it still could reach) an implicit commit state. As a result, on the rollback path, we don't trust the transaction's coordinator to be an authoritative source of truth about whether the transaction is implicitly committed. In other words, we don't consider this `EndTxn(abort)` to be a claim that the transaction is not implicitly committed. The transaction's coordinator may have just given up on the transaction before it heard the outcome of a commit attempt. So in this case, we now return an `IndeterminateCommitError` during evaluation to trigger the transaction recovery protocol and transition the transaction record to a finalized state (COMMITTED or ABORTED). This was one of the two alternatives described in #48301. The other option was to lock down the txn client to attempt to make the guarantee that rollbacks will only be performed if the client is certain that the transaction is not currently in and can no longer ever enter the implicit commit state. I was previously drawn to this other approach because it would avoid the need for transaction recovery during the rollback of a staging txn. However, I don't think it's possible to make such a guarantee in all cases due to the possibility of ambiguity. To eliminate this ambiguity, there are cases where a transaction's coordinator would need to query the result of writes, which turns out to be analogous to the transaction recovery protocol. So instead of trying to make the guarantee in all cases, I'd rather make rollbacks safe in all cases and then later explore selectively opting in to skipping txn recovery in specific cases where the client can definitively guarantee that the it is rolling back a staging transaction that is not implicitly committed. I don't expect this to be needed immediately. Release note (bug fix): fixed a rare race condition that could lead to client-visible errors that looked like "found ABORTED record for implicitly committed transaction". These errors were harmless in that they did not indicate data corruption, but they could be disruptive to clients. Co-authored-by: Nathan VanBenschoten <[email protected]>
A transaction is implicitly committed if it has a STAGING txn record, and all its in-flight writes were successfully written (at copacetic timestamps). We like to think of a transaction that satisfies this implicit-commit condition as committed. Surprisingly, it turns out that that coordinator can rollback an implicitly-committed txn. If it sends a rollback, nothing will stand in its way and the transaction will be cleaned up.
This is definitely not what we've intended. I believe that currently this behavior cannot result in very bad outcomes (in particular, atomicity violations), since the txn's writes only become visible to others once the txn recovery protocol successfully moves the txn record from the
STAGING
to theCOMMITTED
state. The recovery protocol will error out with an assertion failure if it loses a races to a rollback.Even though currently atomicity is preserved and I guess the only side-effect of this rollback is that assertion error being encountered by concurrent recoverers, we don't like it that resolving implicitly-committed intents need to wait for the replication of the
COMMITTED
transaction. What we'd want would be to allow both the coordinator and other pushers to resolve intents as soon at the implicit commit condition is proved to be satisfied - so for example we'd like to run theRecoverTxn
request in anAsyncConsensus
batch. We cannot currently do that because we'd also need to ensure that multipleRecoverTxns
all succeed even if some intents have been resolved. In order for that to be the case, we need to store the txn id in committed values (talked about around #26915). Anyway, that's the forward-looking plan, and this rollback behavior is a major fly in the ointment because, at that point, it would result in atomicity failures.We should fix it by preventing transitions from STAGING->ABORTED if the implicit commit condition is satisfied. This would probably entail having an
EndTxn(Rollback)
that encounters aSTAGING
record kick off txn recovery (unless the client knows that it will fail). And maybe also the client shouldn't send a rollback in the first place if the status is ambiguous; the client could kick off recovery itself.The text was updated successfully, but these errors were encountered: