storage: 1PC transactions can be applied twice #10023

bdarnell · 2016-10-17T09:03:48Z

When a 1PC transaction is retried due to a network failure, the second attempt may return an error even if the first attempt succeeded (and the successful result was masked by the network failure). The most likely error to be seen in this case is WriteTooOldError (caused by the transaction seeing its own past write, which by now has been committed, resolved, and scrubbed of its transaction ID), which is a transactionRestartError. Upon restart, the transaction gets a new timestamp and may perform different writes, which may succeed on a subsequent attempt, leading to the same statement applying twice.

This is the more insidious cousin of #6053 and #7604. Those issues are about the failure leaking back to the client; this one is about the fact that if the operation is retried in some situations, it could succeed.

Here is one concrete case in which the error can occur:

A SQL table has a primary key whose default value is auto-generated (e.g. the default rowid column) and has no secondary indexes
An insert to this table is performed as a single auto-committed statement (allowing the 1PC optimization)
The network fails after the write is proposed. DistSender gets an RPC error and retries.
The initial write succeeds, but there is no one listening so its response goes nowhere. It's a 1PC transaction so it cleans up its intent synchronously.
The retried write fails with a WriteTooOldError
This error is raised to client.Txn, which is running in AutoRetry mode, so it retries this transactionRestartError. (note that AutoRetry is not the problem here - if the error continued on to the client, the client's proper response to a WriteTooOldError is to retry)
This runs the SQL transaction closure, which generates a new row id, avoids the conflict, and commits.

The response cache (removed in #3077 because it was too expensive and retries of non-transactional requests were deemed less important. The 1PC optimization makes its requests effectively non-transactional at the KV layer) handled this by remembering the response so the retry would get the same response.

A few possible solutions that don't require removing the 1PC optimization or bringing back the response cache:

Distinguish between RPC errors that definitely won't succeed (e.g. connection refused) and those where the outcome is uncertain. When a 1PC transaction gets the latter error, return it to the client as an ambiguous result (ambiguous results are always possible in the postgres protocol if a network failure occurs during a commit)
Remember all committed transaction ids for a time in something like the abort cache. This would be lighter than the full response cache and would only allow us to tell that this scenario has occurred and return a distinct error code for it, but not recover what exactly happened on the earlier attempt.
At the SQL layer, perform a separate query to figure out what happened. This requires specific knowledge of the behavior of each type of query that can benefit from the 1PC optimization.

The text was updated successfully, but these errors were encountered:

petermattis · 2016-10-17T12:42:50Z

Distinguish between RPC errors that definitely won't succeed (e.g. connection refused) and those where the outcome is uncertain.

This seems error prone as the code evolves.

At the SQL layer, perform a separate query to figure out what happened.

Ditto.

Remember all committed transaction ids for a time in something like the abort cache. This would be lighter than the full response cache and would only allow us to tell that this scenario has occurred and return a distinct error code for it, but not recover what exactly happened on the earlier attempt.

1PC transactions are only allowed for write-only operations, correct? So what do we need to recover other than whether the transaction committed or not?

I'm not seeing what about this problem is specific to 1PC transactions. If a KV batch contains a Put and an EndTransaction and we cleanup the transaction record, replay of the request (Put + EndTransaction) can fail with a WriteTooOldError and get propagated back up to the SQL layer. Is there something that protects us in this scenario?

bdarnell · 2016-10-17T13:29:34Z

Distinguish between RPC errors that definitely won't succeed (e.g. connection refused) and those where the outcome is uncertain.

This seems error prone as the code evolves.

This one doesn't seem that bad to me. We just need two kinds of RPC errors instead of one, and use one for getting an rpc connection and one for an error sending the request. The question here is whether that is precise enough for us or does the GRPC abstraction prevent us from seeing the distinction we want.

At the SQL layer, perform a separate query to figure out what happened.

Ditto.

Yep, this seems very tricky and fragile.

Remember all committed transaction ids for a time in something like the abort cache. This would be lighter than the full response cache and would only allow us to tell that this scenario has occurred and return a distinct error code for it, but not recover what exactly happened on the earlier attempt.

1PC transactions are only allowed for write-only operations, correct? So what do we need to recover other than whether the transaction committed or not?

It's legal at the KV layer to include a Get in a 1PC transaction, although we don't currently do it. We do use ConditionalPut, and we need to distinguish ConditionFailedError from other kinds of errors. (ConditionFailedError returns the actual value encountered, although again we don't currently use this)

I'm not seeing what about this problem is specific to 1PC transactions. If a KV batch contains a Put and an EndTransaction and we cleanup the transaction record, replay of the request (Put + EndTransaction) can fail with a WriteTooOldError and get propagated back up to the SQL layer. Is there something that protects us in this scenario?

AutoRetry is unique to 1PC transactions, although if the retry happens at the application level we could still have a problem. However, we accumulate WriteTooOldErrors in transactional requests until the EndTransaction, and a replayed EndTransaction results in a TransactionStatusError (which is not retryable). I'm not completely sure whether we have a problem for multi-phase transactions or not.

petermattis · 2016-10-17T13:59:18Z

This one doesn't seem that bad to me. We just need two kinds of RPC errors instead of one, and use one for getting an rpc connection and one for an error sending the request. The question here is whether that is precise enough for us or does the GRPC abstraction prevent us from seeing the distinction we want.

How would this be used? Would you propagate the error all the way back to the client in the case of an unknown transaction disposition?

(ConditionFailedError returns the actual value encountered, although again we don't currently use this)

I thought we used this in formatting the error when writing to a unique index, but I can't find any code in sql-land which needs the actual value.

I'm not completely sure whether we have a problem for multi-phase transactions or not.

I think a first step is to try and write some tests to replicate this problem. The 1PC issue seems obvious and easily testable. And we could see if multi-phase transactions have a related problem.

bdarnell · 2016-10-17T14:05:41Z

How would this be used? Would you propagate the error all the way back to the client in the case of an unknown transaction disposition?

Yeah, I think so. If a request that does not contain EndTransaction ends with an unknown disposition, the internal/client.Txn can simply retry. If this happens with an EndTransaction (including 1PC), we have to report it all the way back to the end client.

petermattis · 2016-10-17T14:09:25Z

If a request that does not contain EndTransaction ends with an unknown disposition, the internal/client.Txn can simply retry.

Ok, maybe this isn't as fragile as I was imagining. @tamird Do we get a good signal from GRPC for when an RPC failed because the remote is down (and definitely didn't receive the RPC) vs other errors?

bdarnell · 2016-10-17T14:12:49Z

We definitely get that signal from our own circuit breaker. We might have to err on the side of caution for the request that trips the breaker and consider it uncertain if we can't tell for sure what grpc's error means.

petermattis · 2016-10-17T14:20:10Z

Cc @cockroachdb/stability

aphyr · 2016-10-17T14:47:41Z

How would this be used? Would you propagate the error all the way back to the client in the case of an unknown transaction disposition?

From the perspective of a client implementer and user, I think it's a great idea to classify every error in your client as known-failed vs indeterminate--maybe via a type (KnownFailure vs MaybeFailure) or a function (indeterminate? err). This is something every application developer has to worry about, so it makes sense to propagate all the way from internals to users. :)

andreimatei · 2016-10-17T15:48:11Z

AutoRetry is unique to 1PC transactions, although if the retry happens at the application level we could still have a problem. However, we accumulate WriteTooOldErrors in transactional requests until the EndTransaction, and a replayed EndTransaction results in a TransactionStatusError (which is not retryable). I'm not completely sure whether we have a problem for multi-phase transactions or not.

Note that TxnExecOptions.AutoRetry is not unique to 1PC txns - we also use it to retry a SQL txn consisting of a whole batch of statement when we know that the client logic is not conditional on reads.
But I think what you said after is spot on: the EndTransaction is key - it allows us to observe the state of the transaction (and a missing state is also a non-retryable error). Anything but 1PC txns will have EndTransactions. So why do you think we might have a problem?

@aphyr, I think the philosophy is that, from a client's perspective, network errors on a COMMIT (or a batch of statements containing the COMMIT) are always ambiguous. And so are they for "implicit transactions" - statements outside of a BEGIN... COMMIT. Except in the case of this bug, of course, we fail to return such an error to the client.
I think we discussed at some point having an explicit "ambiguous commit" error specifically for our RELEASE and COMMIT statements, but I think we dropped that idea.

spencerkimball · 2016-10-17T17:16:03Z

I assume this happened in practice with Jepsen testing? @bdarnell, I wouldn't expect a WriteTooOldError for the case you outline. The successful EndTransaction call puts a marker in the timestamp cache which will result in the replayed transaction erroring out immediately with TransactionReplayError on the receipt of the next BeginTransaction using the same transaction ID. This is not a retryable error.

petermattis · 2016-10-17T17:20:11Z

We strip the begin/end txn requests in Replica.executeWriteBatch for 1PC transactions before executing them. Does that affect what gets added to the timestamp cache?

tbg · 2016-10-17T17:24:49Z

The timestamp cache is populated at a higher level so it would see the EndTransaction, and while the next retry does not execute {Begin,End}Transaction, it does check the timestamp cache for its key and should bump the timestamp. Except, isn't it still the same transaction?

A quick unit test could really answer what happens here without any guessing.

petermattis · 2016-10-17T17:30:46Z

Agreed about writing a test. Spencer is on it.

Spencer pointed out another issue. Replica.addWriteCmd handles calling Replica.endCmds which adds to the timestamp cache. But Replica.addWriteCmd will return if the client RPC was cancelled (i.e. network error). Seems like we should be adding to the timestamp cache in Replica.processRaftCommand when we're executing on the lease holder (or when sendToClient is true).

spencerkimball · 2016-10-17T17:33:05Z

I see what's happening. The timestamp cache isn't being updated after success within raft in the event that the caller context is cancelled. This is a major correctness issue. The addition of context cancellations introduced a bug here, but it should be easy to fix. What this means is that Ben's case will now result in a TransactionReplayError in the common case of the transaction returning on replay to the same replica.

However, if the leadership changes, then the replay won't be able to return the TransactionReplayError and will instead return TransactionRetryError. So we still have a problem, it's just going to be significantly less likely to occur.

aphyr · 2016-10-17T19:15:47Z

I assume this happened in practice with Jepsen testing?

Yep! Here are two example failure cases, which caused single-statement inserts of unique values into a table to result in duplicate values.

20161015T043421.000Z.zip
20161015T015705.000Z.zip

spencerkimball · 2016-10-17T22:50:04Z

@tschottdorf my analysis was not correct. The tryAbandon() method that we call does the right thing with the timestamp cache for the case of just the client going away and the context being canceled.

So we're back to the more fundamental problem and unknown solution.

tbg · 2016-10-18T07:23:39Z

Do you have the unit test?

spencerkimball · 2016-10-18T13:02:32Z

I abandoned the one I was working on because it was too close to the Store that it was beginning to feel overly scripted. I think we actually want to create a unittest which uses GRPC and multiple replicas.

I spent a lot of time last night mulling this over and built a "txn cache" option in the code to measure performance implications. When running block writer on my local machine, performance seems to suffer by ~4%.

tbg · 2016-10-18T13:31:41Z

I abandoned the one I was working on because it was too close to the Store that it was beginning to feel overly scripted. I think we actually want to create a unittest which uses GRPC and multiple replicas.

I don't see what's wrong with a test that's not more complicated than it needs to be, at least to get a basic idea.

spencerkimball · 2016-10-18T14:13:29Z

I think the test needs to work using the distributed sender.

tbg · 2016-10-18T14:16:53Z

Why? There's certainly a need to validate that whatever the fix is works end-to-end, but imo the meat is understanding what happens on the Replica, not orchestrating a complex test across most of the stack right away.

Introduce `AmbiguousCommitError` in the event that a batch with an `EndTransaction` request is sent but the response is not available. Fixes #6053, #7604, and #10023

tamird · 2016-10-26T20:43:35Z

Closed by #10207.

spencerkimball mentioned this issue Oct 22, 2016

Return error if result of a txn commit is unknown #10154

Merged

spencerkimball added a commit that referenced this issue Oct 23, 2016

Return error if result of a txn commit is unknown

b67bbc5

Introduce `AmbiguousCommitError` in the event that a batch with an `EndTransaction` request is sent but the response is not available. Fixes #6053, #7604, and #10023

spencerkimball added a commit that referenced this issue Oct 24, 2016

Return error if result of a txn commit is unknown

bfbd9de

Introduce `AmbiguousCommitError` in the event that a batch with an `EndTransaction` request is sent but the response is not available. Fixes #6053, #7604, and #10023

tamird mentioned this issue Oct 25, 2016

Re-merge ambiguous commit error PR with locking fixes #10207

Merged

tamird closed this as completed Oct 26, 2016

This was referenced Aug 8, 2017

rfc: how to upgrade clusters with enhanced view descriptors #17502

Open

sql: small view descriptor clean-ups #17475

Closed

sql: make view descriptors depend on table/columns by IDs #17501

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: 1PC transactions can be applied twice #10023

storage: 1PC transactions can be applied twice #10023

bdarnell commented Oct 17, 2016 •

edited by knz

Loading

petermattis commented Oct 17, 2016

bdarnell commented Oct 17, 2016

petermattis commented Oct 17, 2016

bdarnell commented Oct 17, 2016

petermattis commented Oct 17, 2016

bdarnell commented Oct 17, 2016

petermattis commented Oct 17, 2016

aphyr commented Oct 17, 2016

andreimatei commented Oct 17, 2016

spencerkimball commented Oct 17, 2016

petermattis commented Oct 17, 2016

tbg commented Oct 17, 2016

petermattis commented Oct 17, 2016

spencerkimball commented Oct 17, 2016

aphyr commented Oct 17, 2016 •

edited

Loading

spencerkimball commented Oct 17, 2016

tbg commented Oct 18, 2016

spencerkimball commented Oct 18, 2016

tbg commented Oct 18, 2016

spencerkimball commented Oct 18, 2016

tbg commented Oct 18, 2016

tamird commented Oct 26, 2016

storage: 1PC transactions can be applied twice #10023

storage: 1PC transactions can be applied twice #10023

Comments

bdarnell commented Oct 17, 2016 • edited by knz Loading

petermattis commented Oct 17, 2016

bdarnell commented Oct 17, 2016

petermattis commented Oct 17, 2016

bdarnell commented Oct 17, 2016

petermattis commented Oct 17, 2016

bdarnell commented Oct 17, 2016

petermattis commented Oct 17, 2016

aphyr commented Oct 17, 2016

andreimatei commented Oct 17, 2016

spencerkimball commented Oct 17, 2016

petermattis commented Oct 17, 2016

tbg commented Oct 17, 2016

petermattis commented Oct 17, 2016

spencerkimball commented Oct 17, 2016

aphyr commented Oct 17, 2016 • edited Loading

spencerkimball commented Oct 17, 2016

tbg commented Oct 18, 2016

spencerkimball commented Oct 18, 2016

tbg commented Oct 18, 2016

spencerkimball commented Oct 18, 2016

tbg commented Oct 18, 2016

tamird commented Oct 26, 2016

bdarnell commented Oct 17, 2016 •

edited by knz

Loading

aphyr commented Oct 17, 2016 •

edited

Loading