Add function for automatic transaction retry #58

smklein · 2023-11-11T01:44:06Z

Attempts to implement https://www.cockroachlabs.com/docs/v23.1/advanced-client-side-transaction-retries

This functionality includes some cockroach-specific calls, so it exists behind a new cockroach feature flag.

.github/workflows/rust.yml

Cargo.toml

src/async_traits.rs

smklein · 2023-11-16T04:10:32Z

Putting this up for review as I finish up usage in oxidecomputer/omicron#4487

There are some transactions which I am not converting, but I believe the shape of things largely makes sense there as an integration PR.

bnaecker

Nice work on this, lots of subtle details to keep track of! I have a few questions and suggestions, mostly around clarity and documentation. Overall the functionality looks good, and the tests are great! Thanks for pushing on this!

src/async_traits.rs

tests/test.rs

smklein

Thanks for the quick feedback!

src/async_traits.rs

tests/test.rs

src/async_traits.rs

tests/test.rs

src/async_traits.rs

davepacheco

I don't think my feedback here is particularly useful.

davepacheco · 2023-11-21T22:26:34Z

src/async_traits.rs

+fn retryable_error(err: &DieselError) -> bool {
+    use diesel::result::DatabaseErrorKind::SerializationFailure;
+    match err {
+        DieselError::DatabaseError(SerializationFailure, boxed_error_information) => {


It looks like SerializationFailure is defined to be synonymous with code 40001, which is the condition that the CockroachDB docs say to look for. I think we shouldn't have to inspect the message? (Not sure we shouldn't. But I'm not sure what we should do if we find a different message here.)

I'm happy to just ignore the error message, and rely on the error code. Not sure what to do if these are inconsistent -- I think that's more a question for "how much do we trust our database", because them being out-of-sync seems like undocumented territory.

davepacheco · 2023-11-21T22:35:38Z

src/async_traits.rs

+
+    /// Issues a function `f` as a transaction.
+    ///
+    /// If it fails, asynchronously calls `retry` to decide if to retry.


It seems like we want to document here that this must not be used from within a transaction? Or is there some way to ensure that at compile-time?

I'm not sure how to ensure this at compile-time -- whether or not we're in a transaction is information attached to each connection, but it does not change the underlying "type" of the connection. If we wanted to embed this info here, I think we'd need to:

Create a new "connection" type that prevents directly accessing all transaction-based methods

Intercept all transaction methods, consuming self, and creating a new connection type (indicating we're in a transaction)

On completion of the transaction, consume self again and reduce the transaction level

That said, we do check for it at run-time, and I have a test that catches this case. I'll add documentation indicating that a runtime error would be returned here.

src/async_traits.rs

davepacheco · 2023-11-21T22:47:08Z

src/async_traits.rs

+                        continue;
+                    }
+
+                    Self::commit_transaction(&conn).await?;


Suggested change

Self::commit_transaction(&conn).await?;

// Commit the top-level transaction, too.

Self::commit_transaction(&conn).await?;

The CockroachDB docs don't address what happens if this fails. I gather from the docs that by releasing the savepoint, we've already durably committed everything we've done, so even if the COMMIT fails, the operation has happened and I guess we could ignore the error altogether. (It also seems fine to not ignore the error. The only errors I could see here are communication-level ones -- e.g., a broken TCP connection. In those cases the caller will have to assume the operation may have happened.)

(added the suggested comment)

Yeah, from reading https://www.cockroachlabs.com/docs/v23.1/commit-transaction , it seems like COMMIT "clears the connection to allow new transactions to begin" in the context of client-side transaction retries.

This is a really weird case -- I agree that the last RELEASE is the thing documented to actually make the transaction durable. I'm truly unsure of whether or not it makes more sense to:

Propagate this error back to a user, and let them figure it out (even though the transaction has probably been made durable?). This at least causes the error to be visible.

Drop the error, hoping that leaves the connection in a "known defunct" state, so it can't be used for future transactions. This admittedly would let "whoever just called last" go on to their next operation, but it feels bizarre to me.

I kinda lean towards (1) -- I think this is the case for any DB operation (not just transactions). A request could be sent, processed by the DB, and the response could be lost. If the caller sees a TCP error (or any other "unhandled" error), they may or may not have completed the operation, and so need to accept that the state of the DB is "unknown" and in need of reconciliation.

src/async_traits.rs

davepacheco · 2023-11-21T22:53:20Z

src/async_traits.rs

+                Err(user_error) => {
+                    // The user-level operation failed: ROLLBACK.
+                    if let Err(first_rollback_err) = Self::rollback_transaction(&conn).await {
+                        // If we fail while rolling back, prioritize returning


What state are we in at this point? It seems like we failed a ROLLBACK TO SAVEPOINT and at that point. The CockroachDB docs don't say anything about errors or failures here and I'm not sure if we're still "in" the nested transaction or what.

Might we wind up returning from this function while still in the outermost transaction?
edit: having look at the other thread, I see that AnsiTransactionManager tries to deal with stuff like this:
https://docs.diesel.rs/master/src/diesel/connection/transaction_manager.rs.html#400-401

Is that behavior a problem for us? e.g., is it possible that we try to do the rollback to savepoint, but diesel winds up doing that, it fails, and then it also rolls back the BEGIN and we've lost track of what depth we're at?

This is more of a general comment and I'm not sure what we can do about this: it feels like there's some critical state here (like "transaction depth" and "are we moving forward or backward") along with some guarantees made by various Diesel functions and async-bb8-functions, plus invariants maintained by the loop, etc. But I don't really understand what they all are. I don't think this is an async-bb8-diesel problem per se. But the net result is that as a reader I find it pretty hard to understand what's true at each point and have confidence in the code's correctness.

It might help [me?] to draw a state diagram. With that in hand, I wonder if an implementation could use this (maybe with something like typestates) to be more clearly, structurally correct. This sounds like a lot of work though and if we're all satisfied that this is correct as is it may not be worth it.

What state are we in at this point? It seems like we failed a ROLLBACK TO SAVEPOINT and at that point. The CockroachDB docs don't say anything about errors or failures here and I'm not sure if we're still "in" the nested transaction or what.

It depends on how the rollback_transaction method has failed.

Any time batch_execute is called -- which happens for any of these SQL operations (BEGIN, SAVEPOINT, ROLLBACK, COMMIT, etc) -- Diesel calls update_transaction_manager_status using the returned error code, which can possibly update the transaction state: https://docs.diesel.rs/master/src/diesel/pg/connection/mod.rs.html#135

The implementation of this function (https://docs.diesel.rs/master/src/diesel/pg/connection/mod.rs.html#232) basically matches on the observed transaction state from libpq and the error and can update Diesel's tracking in a handful of ways (including: doing nothing, indicating that the transaction manager is "broken", indicating that an error happened and we're expecting to roll all the way back to the top-level transaction).

This is one of the ways we can call the function set_requires_rollback_maybe_up_to_top_level (https://docs.diesel.rs/master/src/diesel/connection/transaction_manager.rs.html#147-151) which allows SAVEPOINTS to be released without throwing subsequent errors, until we've finished rolling back to the top-level.

So, to answer your original question, "what state are we in", the answer is "it depends".

The transaction depth after seeing this error will always be 1 from Diesel's perspective.

We will always tried to have issued ROLLBACK after getting into this error case.

In some situations: the transaction manager will be "broken" and we'll expect all further Diesel operations to return: https://docs.diesel.rs/master/diesel/result/enum.Error.html#variant.BrokenTransactionManager

In other situations: the ROLLBACK error could have come from cockroachdb directly.

Might we wind up returning from this function while still in the outermost transaction? edit: having look at the other thread, I see that AnsiTransactionManager tries to deal with stuff like this: https://docs.diesel.rs/master/src/diesel/connection/transaction_manager.rs.html#400-401

The Diesel interface tries really hard to ensure that the "transaction depth" is consistent in the success/failure cases -- basically, regardless of whether "rollback" or "commit" succeeds or fails, there are two outcomes:

The transaction "depth" has decreased by one.

The connection is permanently labelled as "broken" for unrecoverable errors (if you look in the Diesel source code, this is like a call to set_in_error). Once this happens, basically all subsequent diesel operations return a "broken transaction manager" error.

Is that behavior a problem for us? e.g., is it possible that we try to do the rollback to savepoint, but diesel winds up doing that, it fails, and then it also rolls back the BEGIN and we've lost track of what depth we're at?

I don't think so - I think this would violate the aforementioned principle of "only change the transaction depth by at most one level".

This is more of a general comment and I'm not sure what we can do about this: it feels like there's some critical state here (like "transaction depth" and "are we moving forward or backward") along with some guarantees made by various Diesel functions and async-bb8-functions, plus invariants maintained by the loop, etc. But I don't really understand what they all are. I don't think this is an async-bb8-diesel problem per se. But the net result is that as a reader I find it pretty hard to understand what's true at each point and have confidence in the code's correctness.

It might help [me?] to draw a state diagram. With that in hand, I wonder if an implementation could use this (maybe with something like typestates) to be more clearly, structurally correct. This sounds like a lot of work though and if we're all satisfied that this is correct as is it may not be worth it.

I agree that typestates could help here, but I'm not sure how much value they have in async-bb8-diesel alone -- it kinda sounds like the property you want extends into Diesel's implementation of transactions too? I have a fear of managing the "transaction depth" inside async-bb8-diesel via typestates, while a client using the API could just skip the async-bb8-diesel level and send requests directly through Diesel -- which would violate our higher-level expectations.

I started to write this up in https://docs.google.com/drawings/d/1JDU5-LsohXP7kEj_9ZFJUz64XyUXNJhZTu4J02sUYV4/edit?usp=sharing -- just trying to explain what Diesel is doing -- but go figure, it's complicated. It wasn't written as a state machine, so this is basically an expansion of code logic into state machine form -- if we want something that looks like a cleaner state machine, we'll need to re-write Diesel's TransactionManager.

I took a swing at making a state machine diagram for the CockroachDB client-side retry behavior. I intended to finish it and compare it with the code, but haven't been able to prioritize it. I figured I'd drop it here for posterity in case it's useful in the future. (View source to see the underlyaing mermaid source.)

flowchart TD %% node definitions idle["connection idle"] begin["Execute `BEGIN`"] savepoint_start["Execute `SAVEPOINT`"] user_stuff["(consumer executes SQL queries)"] savepoint_release["Execute `RELEASE SAVEPOINT`"] commit["Execute `COMMIT`"] retry_decision{"Should retry?"} rollback_savepoint["Issue `ROLLBACK TO SAVEPOINT`"] rollback_all["Issue `ROLLBACK` (whole txn)"] dead["connection dead"] %% happy path idle -->|consumer calls `transaction_async_with_retry`| begin begin -->|ok| savepoint_start savepoint_start -->|ok| user_stuff user_stuff -->|ok| savepoint_release savepoint_release -->|ok| commit commit -->|ok| idle %% retry path user_stuff --> |fail| retry_decision retry_decision --> |yes| rollback_savepoint rollback_savepoint --> |ok| user_stuff savepoint_release --> |fail| retry_decision %% failure path retry_decision --> |no| rollback_all rollback_all --> |ok| idle commit --> |fail| rollback_all rollback_savepoint --> |fail| rollback_all %% other failures begin --> |fail| idle savepoint_start -->|fail| rollback_all %% really bad path rollback_all --> |fail| dead

Loading

src/async_traits.rs

smklein · 2023-12-05T18:17:01Z

Appreciate the comments - do y'all have any other feedback for this PR? I'm told that this should be part of the release candidate, which is planned to be created tomorrow.

Integrates automatic transaction retry into Nexus for most transactions. Additionally, this PR provides a "RetryHelper" object to help standardize how transaction retry is performed. Currently, after a short randomized wait (up to an upper bound), we retry unconditionally, emitting each attempt to Oximeter for further analysis. - [x] Depends on oxidecomputer/async-bb8-diesel#58 - [x] As noted in oxidecomputer/async-bb8-diesel#58, this will require customizing CRDB session variables to work correctly. (Edit: this is done on each transaction) Part of oxidecomputer/customer-support#46 Part of #3814

smklein added 5 commits October 6, 2023 12:56

Deprecate synchronous transactions

4709682

WIP at txn-retry

a784d4b

Merge branch 'master' into no-sync-txns

d5136d2

Merge branch 'no-sync-txns' into txn-retry

478534c

First pass of a function to allow automated transaction retries

0acaf8a

smklein mentioned this pull request Nov 11, 2023

[nexus] Make most transactions automatically retry oxidecomputer/omicron#4487

Merged

2 tasks

smklein added 4 commits November 12, 2023 17:08

Use non-private methods

5adc123

Merge branch 'master' into txn-retry

db8e8bd

more tests, fixing bugs

858e5ed

Add cockroach feature flag, simplify connection management, more tests

14bd98b

smklein marked this pull request as ready for review November 14, 2023 22:20

smklein changed the title ~~WIP: Automated transaction retry~~ Add function for automatic transaction retry Nov 14, 2023

smklein commented Nov 16, 2023

View reviewed changes

.github/workflows/rust.yml Show resolved Hide resolved

Cargo.toml Show resolved Hide resolved

Cargo.toml Show resolved Hide resolved

src/async_traits.rs Show resolved Hide resolved

src/async_traits.rs Show resolved Hide resolved

src/async_traits.rs Show resolved Hide resolved

smklein requested review from davepacheco and bnaecker November 16, 2023 03:30

bnaecker reviewed Nov 16, 2023

View reviewed changes

src/async_traits.rs Show resolved Hide resolved

src/async_traits.rs Outdated Show resolved Hide resolved

tests/test.rs Show resolved Hide resolved

tests/test.rs Outdated Show resolved Hide resolved

tests/test.rs Outdated Show resolved Hide resolved

Review feedback

131f285

smklein commented Nov 16, 2023

View reviewed changes

bnaecker reviewed Nov 16, 2023

View reviewed changes

src/async_traits.rs Outdated Show resolved Hide resolved

davepacheco reviewed Nov 21, 2023

View reviewed changes

review feedback

0a048d2

smklein merged commit ed7ab5e into master Dec 6, 2023
4 checks passed

smklein deleted the txn-retry branch December 6, 2023 00:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add function for automatic transaction retry #58

Add function for automatic transaction retry #58

smklein commented Nov 11, 2023 •

edited

Loading

smklein commented Nov 16, 2023

bnaecker left a comment

smklein left a comment

davepacheco left a comment

davepacheco Nov 21, 2023

smklein Nov 27, 2023

davepacheco Nov 21, 2023

smklein Nov 27, 2023

davepacheco Nov 21, 2023

smklein Nov 27, 2023

davepacheco Nov 21, 2023

smklein Nov 27, 2023

davepacheco Jan 4, 2024

smklein commented Dec 5, 2023

	Self::commit_transaction(&conn).await?;
	// Commit the top-level transaction, too.
	Self::commit_transaction(&conn).await?;

Add function for automatic transaction retry #58

Add function for automatic transaction retry #58

Conversation

smklein commented Nov 11, 2023 • edited Loading

smklein commented Nov 16, 2023

bnaecker left a comment

Choose a reason for hiding this comment

smklein left a comment

Choose a reason for hiding this comment

davepacheco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein commented Dec 5, 2023

smklein commented Nov 11, 2023 •

edited

Loading