-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add function for automatic transaction retry #58
Conversation
Putting this up for review as I finish up usage in oxidecomputer/omicron#4487 There are some transactions which I am not converting, but I believe the shape of things largely makes sense there as an integration PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work on this, lots of subtle details to keep track of! I have a few questions and suggestions, mostly around clarity and documentation. Overall the functionality looks good, and the tests are great! Thanks for pushing on this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick feedback!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think my feedback here is particularly useful.
src/async_traits.rs
Outdated
fn retryable_error(err: &DieselError) -> bool { | ||
use diesel::result::DatabaseErrorKind::SerializationFailure; | ||
match err { | ||
DieselError::DatabaseError(SerializationFailure, boxed_error_information) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like SerializationFailure
is defined to be synonymous with code 40001, which is the condition that the CockroachDB docs say to look for. I think we shouldn't have to inspect the message? (Not sure we shouldn't. But I'm not sure what we should do if we find a different message here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to just ignore the error message, and rely on the error code. Not sure what to do if these are inconsistent -- I think that's more a question for "how much do we trust our database", because them being out-of-sync seems like undocumented territory.
|
||
/// Issues a function `f` as a transaction. | ||
/// | ||
/// If it fails, asynchronously calls `retry` to decide if to retry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we want to document here that this must not be used from within a transaction? Or is there some way to ensure that at compile-time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to ensure this at compile-time -- whether or not we're in a transaction is information attached to each connection, but it does not change the underlying "type" of the connection. If we wanted to embed this info here, I think we'd need to:
- Create a new "connection" type that prevents directly accessing all transaction-based methods
- Intercept all transaction methods, consuming
self
, and creating a new connection type (indicating we're in a transaction) - On completion of the transaction, consume
self
again and reduce the transaction level
That said, we do check for it at run-time, and I have a test that catches this case. I'll add documentation indicating that a runtime error would be returned here.
continue; | ||
} | ||
|
||
Self::commit_transaction(&conn).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self::commit_transaction(&conn).await?; | |
// Commit the top-level transaction, too. | |
Self::commit_transaction(&conn).await?; |
The CockroachDB docs don't address what happens if this fails. I gather from the docs that by releasing the savepoint, we've already durably committed everything we've done, so even if the COMMIT fails, the operation has happened and I guess we could ignore the error altogether. (It also seems fine to not ignore the error. The only errors I could see here are communication-level ones -- e.g., a broken TCP connection. In those cases the caller will have to assume the operation may have happened.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(added the suggested comment)
Yeah, from reading https://www.cockroachlabs.com/docs/v23.1/commit-transaction , it seems like COMMIT
"clears the connection to allow new transactions to begin" in the context of client-side transaction retries.
This is a really weird case -- I agree that the last RELEASE
is the thing documented to actually make the transaction durable. I'm truly unsure of whether or not it makes more sense to:
- Propagate this error back to a user, and let them figure it out (even though the transaction has probably been made durable?). This at least causes the error to be visible.
- Drop the error, hoping that leaves the connection in a "known defunct" state, so it can't be used for future transactions. This admittedly would let "whoever just called last" go on to their next operation, but it feels bizarre to me.
I kinda lean towards (1) -- I think this is the case for any DB operation (not just transactions). A request could be sent, processed by the DB, and the response could be lost. If the caller sees a TCP error (or any other "unhandled" error), they may or may not have completed the operation, and so need to accept that the state of the DB is "unknown" and in need of reconciliation.
Err(user_error) => { | ||
// The user-level operation failed: ROLLBACK. | ||
if let Err(first_rollback_err) = Self::rollback_transaction(&conn).await { | ||
// If we fail while rolling back, prioritize returning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What state are we in at this point? It seems like we failed a ROLLBACK TO SAVEPOINT
and at that point. The CockroachDB docs don't say anything about errors or failures here and I'm not sure if we're still "in" the nested transaction or what.
Might we wind up returning from this function while still in the outermost transaction?
edit: having look at the other thread, I see that AnsiTransactionManager
tries to deal with stuff like this:
https://docs.diesel.rs/master/src/diesel/connection/transaction_manager.rs.html#400-401
Is that behavior a problem for us? e.g., is it possible that we try to do the rollback to savepoint
, but diesel winds up doing that, it fails, and then it also rolls back the BEGIN
and we've lost track of what depth we're at?
This is more of a general comment and I'm not sure what we can do about this: it feels like there's some critical state here (like "transaction depth" and "are we moving forward or backward") along with some guarantees made by various Diesel functions and async-bb8-functions, plus invariants maintained by the loop, etc. But I don't really understand what they all are. I don't think this is an async-bb8-diesel problem per se. But the net result is that as a reader I find it pretty hard to understand what's true at each point and have confidence in the code's correctness.
It might help [me?] to draw a state diagram. With that in hand, I wonder if an implementation could use this (maybe with something like typestates) to be more clearly, structurally correct. This sounds like a lot of work though and if we're all satisfied that this is correct as is it may not be worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What state are we in at this point? It seems like we failed a
ROLLBACK TO SAVEPOINT
and at that point. The CockroachDB docs don't say anything about errors or failures here and I'm not sure if we're still "in" the nested transaction or what.
It depends on how the rollback_transaction
method has failed.
- Any time
batch_execute
is called -- which happens for any of these SQL operations (BEGIN, SAVEPOINT, ROLLBACK, COMMIT, etc) -- Diesel callsupdate_transaction_manager_status
using the returned error code, which can possibly update the transaction state: https://docs.diesel.rs/master/src/diesel/pg/connection/mod.rs.html#135 - The implementation of this function (https://docs.diesel.rs/master/src/diesel/pg/connection/mod.rs.html#232) basically matches on the observed transaction state from libpq and the error and can update Diesel's tracking in a handful of ways (including: doing nothing, indicating that the transaction manager is "broken", indicating that an error happened and we're expecting to roll all the way back to the top-level transaction).
- This is one of the ways we can call the function
set_requires_rollback_maybe_up_to_top_level
(https://docs.diesel.rs/master/src/diesel/connection/transaction_manager.rs.html#147-151) which allows SAVEPOINTS to be released without throwing subsequent errors, until we've finished rolling back to the top-level.
So, to answer your original question, "what state are we in", the answer is "it depends".
- The transaction depth after seeing this error will always be
1
from Diesel's perspective. - We will always tried to have issued
ROLLBACK
after getting into this error case.
In some situations: the transaction manager will be "broken" and we'll expect all further Diesel operations to return: https://docs.diesel.rs/master/diesel/result/enum.Error.html#variant.BrokenTransactionManager
In other situations: the ROLLBACK
error could have come from cockroachdb directly.
Might we wind up returning from this function while still in the outermost transaction? edit: having look at the other thread, I see that
AnsiTransactionManager
tries to deal with stuff like this: https://docs.diesel.rs/master/src/diesel/connection/transaction_manager.rs.html#400-401
The Diesel interface tries really hard to ensure that the "transaction depth" is consistent in the success/failure cases -- basically, regardless of whether "rollback" or "commit" succeeds or fails, there are two outcomes:
- The transaction "depth" has decreased by one.
- The connection is permanently labelled as "broken" for unrecoverable errors (if you look in the Diesel source code, this is like a call to
set_in_error
). Once this happens, basically all subsequent diesel operations return a "broken transaction manager" error.
Is that behavior a problem for us? e.g., is it possible that we try to do the
rollback to savepoint
, but diesel winds up doing that, it fails, and then it also rolls back theBEGIN
and we've lost track of what depth we're at?
I don't think so - I think this would violate the aforementioned principle of "only change the transaction depth by at most one level".
This is more of a general comment and I'm not sure what we can do about this: it feels like there's some critical state here (like "transaction depth" and "are we moving forward or backward") along with some guarantees made by various Diesel functions and async-bb8-functions, plus invariants maintained by the loop, etc. But I don't really understand what they all are. I don't think this is an async-bb8-diesel problem per se. But the net result is that as a reader I find it pretty hard to understand what's true at each point and have confidence in the code's correctness.
It might help [me?] to draw a state diagram. With that in hand, I wonder if an implementation could use this (maybe with something like typestates) to be more clearly, structurally correct. This sounds like a lot of work though and if we're all satisfied that this is correct as is it may not be worth it.
I agree that typestates could help here, but I'm not sure how much value they have in async-bb8-diesel
alone -- it kinda sounds like the property you want extends into Diesel's implementation of transactions too? I have a fear of managing the "transaction depth" inside async-bb8-diesel
via typestates, while a client using the API could just skip the async-bb8-diesel
level and send requests directly through Diesel -- which would violate our higher-level expectations.
I started to write this up in https://docs.google.com/drawings/d/1JDU5-LsohXP7kEj_9ZFJUz64XyUXNJhZTu4J02sUYV4/edit?usp=sharing -- just trying to explain what Diesel is doing -- but go figure, it's complicated. It wasn't written as a state machine, so this is basically an expansion of code logic into state machine form -- if we want something that looks like a cleaner state machine, we'll need to re-write Diesel's TransactionManager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a swing at making a state machine diagram for the CockroachDB client-side retry behavior. I intended to finish it and compare it with the code, but haven't been able to prioritize it. I figured I'd drop it here for posterity in case it's useful in the future. (View source to see the underlyaing mermaid source.)
flowchart TD
%% node definitions
idle["connection idle"]
begin["Execute `BEGIN`"]
savepoint_start["Execute `SAVEPOINT`"]
user_stuff["(consumer executes SQL queries)"]
savepoint_release["Execute `RELEASE SAVEPOINT`"]
commit["Execute `COMMIT`"]
retry_decision{"Should retry?"}
rollback_savepoint["Issue `ROLLBACK TO SAVEPOINT`"]
rollback_all["Issue `ROLLBACK` (whole txn)"]
dead["connection dead"]
%% happy path
idle -->|consumer calls `transaction_async_with_retry`| begin
begin -->|ok| savepoint_start
savepoint_start -->|ok| user_stuff
user_stuff -->|ok| savepoint_release
savepoint_release -->|ok| commit
commit -->|ok| idle
%% retry path
user_stuff --> |fail| retry_decision
retry_decision --> |yes| rollback_savepoint
rollback_savepoint --> |ok| user_stuff
savepoint_release --> |fail| retry_decision
%% failure path
retry_decision --> |no| rollback_all
rollback_all --> |ok| idle
commit --> |fail| rollback_all
rollback_savepoint --> |fail| rollback_all
%% other failures
begin --> |fail| idle
savepoint_start -->|fail| rollback_all
%% really bad path
rollback_all --> |fail| dead
Appreciate the comments - do y'all have any other feedback for this PR? I'm told that this should be part of the release candidate, which is planned to be created tomorrow. |
Integrates automatic transaction retry into Nexus for most transactions. Additionally, this PR provides a "RetryHelper" object to help standardize how transaction retry is performed. Currently, after a short randomized wait (up to an upper bound), we retry unconditionally, emitting each attempt to Oximeter for further analysis. - [x] Depends on oxidecomputer/async-bb8-diesel#58 - [x] As noted in oxidecomputer/async-bb8-diesel#58, this will require customizing CRDB session variables to work correctly. (Edit: this is done on each transaction) Part of oxidecomputer/customer-support#46 Part of #3814
Attempts to implement https://www.cockroachlabs.com/docs/v23.1/advanced-client-side-transaction-retries
This functionality includes some cockroach-specific calls, so it exists behind a new
cockroach
feature flag.