Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function for automatic transaction retry #58

Merged
merged 11 commits into from
Dec 6, 2023
Merged

Add function for automatic transaction retry #58

merged 11 commits into from
Dec 6, 2023

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Nov 11, 2023

Attempts to implement https://www.cockroachlabs.com/docs/v23.1/advanced-client-side-transaction-retries

This functionality includes some cockroach-specific calls, so it exists behind a new cockroach feature flag.

@smklein smklein marked this pull request as ready for review November 14, 2023 22:20
@smklein smklein changed the title WIP: Automated transaction retry Add function for automatic transaction retry Nov 14, 2023
.github/workflows/rust.yml Show resolved Hide resolved
Cargo.toml Show resolved Hide resolved
Cargo.toml Show resolved Hide resolved
src/async_traits.rs Show resolved Hide resolved
src/async_traits.rs Show resolved Hide resolved
src/async_traits.rs Show resolved Hide resolved
@smklein
Copy link
Collaborator Author

smklein commented Nov 16, 2023

Putting this up for review as I finish up usage in oxidecomputer/omicron#4487

There are some transactions which I am not converting, but I believe the shape of things largely makes sense there as an integration PR.

Copy link

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on this, lots of subtle details to keep track of! I have a few questions and suggestions, mostly around clarity and documentation. Overall the functionality looks good, and the tests are great! Thanks for pushing on this!

src/async_traits.rs Show resolved Hide resolved
src/async_traits.rs Outdated Show resolved Hide resolved
tests/test.rs Show resolved Hide resolved
tests/test.rs Outdated Show resolved Hide resolved
tests/test.rs Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick feedback!

src/async_traits.rs Show resolved Hide resolved
src/async_traits.rs Outdated Show resolved Hide resolved
tests/test.rs Show resolved Hide resolved
tests/test.rs Outdated Show resolved Hide resolved
src/async_traits.rs Show resolved Hide resolved
tests/test.rs Outdated Show resolved Hide resolved
src/async_traits.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think my feedback here is particularly useful.

fn retryable_error(err: &DieselError) -> bool {
use diesel::result::DatabaseErrorKind::SerializationFailure;
match err {
DieselError::DatabaseError(SerializationFailure, boxed_error_information) => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like SerializationFailure is defined to be synonymous with code 40001, which is the condition that the CockroachDB docs say to look for. I think we shouldn't have to inspect the message? (Not sure we shouldn't. But I'm not sure what we should do if we find a different message here.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to just ignore the error message, and rely on the error code. Not sure what to do if these are inconsistent -- I think that's more a question for "how much do we trust our database", because them being out-of-sync seems like undocumented territory.


/// Issues a function `f` as a transaction.
///
/// If it fails, asynchronously calls `retry` to decide if to retry.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we want to document here that this must not be used from within a transaction? Or is there some way to ensure that at compile-time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to ensure this at compile-time -- whether or not we're in a transaction is information attached to each connection, but it does not change the underlying "type" of the connection. If we wanted to embed this info here, I think we'd need to:

  1. Create a new "connection" type that prevents directly accessing all transaction-based methods
  2. Intercept all transaction methods, consuming self, and creating a new connection type (indicating we're in a transaction)
  3. On completion of the transaction, consume self again and reduce the transaction level

That said, we do check for it at run-time, and I have a test that catches this case. I'll add documentation indicating that a runtime error would be returned here.

src/async_traits.rs Outdated Show resolved Hide resolved
src/async_traits.rs Show resolved Hide resolved
continue;
}

Self::commit_transaction(&conn).await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Self::commit_transaction(&conn).await?;
// Commit the top-level transaction, too.
Self::commit_transaction(&conn).await?;

The CockroachDB docs don't address what happens if this fails. I gather from the docs that by releasing the savepoint, we've already durably committed everything we've done, so even if the COMMIT fails, the operation has happened and I guess we could ignore the error altogether. (It also seems fine to not ignore the error. The only errors I could see here are communication-level ones -- e.g., a broken TCP connection. In those cases the caller will have to assume the operation may have happened.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(added the suggested comment)

Yeah, from reading https://www.cockroachlabs.com/docs/v23.1/commit-transaction , it seems like COMMIT "clears the connection to allow new transactions to begin" in the context of client-side transaction retries.

This is a really weird case -- I agree that the last RELEASE is the thing documented to actually make the transaction durable. I'm truly unsure of whether or not it makes more sense to:

  1. Propagate this error back to a user, and let them figure it out (even though the transaction has probably been made durable?). This at least causes the error to be visible.
  2. Drop the error, hoping that leaves the connection in a "known defunct" state, so it can't be used for future transactions. This admittedly would let "whoever just called last" go on to their next operation, but it feels bizarre to me.

I kinda lean towards (1) -- I think this is the case for any DB operation (not just transactions). A request could be sent, processed by the DB, and the response could be lost. If the caller sees a TCP error (or any other "unhandled" error), they may or may not have completed the operation, and so need to accept that the state of the DB is "unknown" and in need of reconciliation.

src/async_traits.rs Outdated Show resolved Hide resolved
Err(user_error) => {
// The user-level operation failed: ROLLBACK.
if let Err(first_rollback_err) = Self::rollback_transaction(&conn).await {
// If we fail while rolling back, prioritize returning
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What state are we in at this point? It seems like we failed a ROLLBACK TO SAVEPOINT and at that point. The CockroachDB docs don't say anything about errors or failures here and I'm not sure if we're still "in" the nested transaction or what.

Might we wind up returning from this function while still in the outermost transaction?
edit: having look at the other thread, I see that AnsiTransactionManager tries to deal with stuff like this:
https://docs.diesel.rs/master/src/diesel/connection/transaction_manager.rs.html#400-401

Is that behavior a problem for us? e.g., is it possible that we try to do the rollback to savepoint, but diesel winds up doing that, it fails, and then it also rolls back the BEGIN and we've lost track of what depth we're at?


This is more of a general comment and I'm not sure what we can do about this: it feels like there's some critical state here (like "transaction depth" and "are we moving forward or backward") along with some guarantees made by various Diesel functions and async-bb8-functions, plus invariants maintained by the loop, etc. But I don't really understand what they all are. I don't think this is an async-bb8-diesel problem per se. But the net result is that as a reader I find it pretty hard to understand what's true at each point and have confidence in the code's correctness.

It might help [me?] to draw a state diagram. With that in hand, I wonder if an implementation could use this (maybe with something like typestates) to be more clearly, structurally correct. This sounds like a lot of work though and if we're all satisfied that this is correct as is it may not be worth it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What state are we in at this point? It seems like we failed a ROLLBACK TO SAVEPOINT and at that point. The CockroachDB docs don't say anything about errors or failures here and I'm not sure if we're still "in" the nested transaction or what.

It depends on how the rollback_transaction method has failed.

So, to answer your original question, "what state are we in", the answer is "it depends".

  • The transaction depth after seeing this error will always be 1 from Diesel's perspective.
  • We will always tried to have issued ROLLBACK after getting into this error case.

In some situations: the transaction manager will be "broken" and we'll expect all further Diesel operations to return: https://docs.diesel.rs/master/diesel/result/enum.Error.html#variant.BrokenTransactionManager

In other situations: the ROLLBACK error could have come from cockroachdb directly.

Might we wind up returning from this function while still in the outermost transaction? edit: having look at the other thread, I see that AnsiTransactionManager tries to deal with stuff like this: https://docs.diesel.rs/master/src/diesel/connection/transaction_manager.rs.html#400-401

The Diesel interface tries really hard to ensure that the "transaction depth" is consistent in the success/failure cases -- basically, regardless of whether "rollback" or "commit" succeeds or fails, there are two outcomes:

  1. The transaction "depth" has decreased by one.
  2. The connection is permanently labelled as "broken" for unrecoverable errors (if you look in the Diesel source code, this is like a call to set_in_error). Once this happens, basically all subsequent diesel operations return a "broken transaction manager" error.

Is that behavior a problem for us? e.g., is it possible that we try to do the rollback to savepoint, but diesel winds up doing that, it fails, and then it also rolls back the BEGIN and we've lost track of what depth we're at?

I don't think so - I think this would violate the aforementioned principle of "only change the transaction depth by at most one level".

This is more of a general comment and I'm not sure what we can do about this: it feels like there's some critical state here (like "transaction depth" and "are we moving forward or backward") along with some guarantees made by various Diesel functions and async-bb8-functions, plus invariants maintained by the loop, etc. But I don't really understand what they all are. I don't think this is an async-bb8-diesel problem per se. But the net result is that as a reader I find it pretty hard to understand what's true at each point and have confidence in the code's correctness.

It might help [me?] to draw a state diagram. With that in hand, I wonder if an implementation could use this (maybe with something like typestates) to be more clearly, structurally correct. This sounds like a lot of work though and if we're all satisfied that this is correct as is it may not be worth it.

I agree that typestates could help here, but I'm not sure how much value they have in async-bb8-diesel alone -- it kinda sounds like the property you want extends into Diesel's implementation of transactions too? I have a fear of managing the "transaction depth" inside async-bb8-diesel via typestates, while a client using the API could just skip the async-bb8-diesel level and send requests directly through Diesel -- which would violate our higher-level expectations.

I started to write this up in https://docs.google.com/drawings/d/1JDU5-LsohXP7kEj_9ZFJUz64XyUXNJhZTu4J02sUYV4/edit?usp=sharing -- just trying to explain what Diesel is doing -- but go figure, it's complicated. It wasn't written as a state machine, so this is basically an expansion of code logic into state machine form -- if we want something that looks like a cleaner state machine, we'll need to re-write Diesel's TransactionManager.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a swing at making a state machine diagram for the CockroachDB client-side retry behavior. I intended to finish it and compare it with the code, but haven't been able to prioritize it. I figured I'd drop it here for posterity in case it's useful in the future. (View source to see the underlyaing mermaid source.)

flowchart TD
    %% node definitions
    idle["connection idle"]
    begin["Execute `BEGIN`"]
    savepoint_start["Execute `SAVEPOINT`"]
    user_stuff["(consumer executes SQL queries)"]
    savepoint_release["Execute `RELEASE SAVEPOINT`"]
    commit["Execute `COMMIT`"]
    retry_decision{"Should retry?"}
    rollback_savepoint["Issue `ROLLBACK TO SAVEPOINT`"]
    rollback_all["Issue `ROLLBACK` (whole txn)"]
    dead["connection dead"]

    %% happy path
    idle -->|consumer calls `transaction_async_with_retry`| begin
    begin -->|ok| savepoint_start
    savepoint_start -->|ok| user_stuff
    user_stuff -->|ok| savepoint_release
    savepoint_release -->|ok| commit
    commit -->|ok| idle

    %% retry path
    user_stuff --> |fail| retry_decision
    retry_decision --> |yes| rollback_savepoint
    rollback_savepoint --> |ok| user_stuff
    savepoint_release --> |fail| retry_decision

    %% failure path
    retry_decision --> |no| rollback_all
    rollback_all --> |ok| idle
    commit --> |fail| rollback_all
    rollback_savepoint --> |fail| rollback_all

    %% other failures
    begin --> |fail| idle
    savepoint_start -->|fail| rollback_all

    %% really bad path
    rollback_all --> |fail| dead
Loading

src/async_traits.rs Outdated Show resolved Hide resolved
@smklein
Copy link
Collaborator Author

smklein commented Dec 5, 2023

Appreciate the comments - do y'all have any other feedback for this PR? I'm told that this should be part of the release candidate, which is planned to be created tomorrow.

@smklein smklein merged commit ed7ab5e into master Dec 6, 2023
4 checks passed
@smklein smklein deleted the txn-retry branch December 6, 2023 00:10
smklein added a commit to oxidecomputer/omicron that referenced this pull request Dec 6, 2023
Integrates automatic transaction retry into Nexus for most transactions.

Additionally, this PR provides a "RetryHelper" object to help
standardize how transaction retry is performed.
Currently, after a short randomized wait (up to an upper bound), we
retry unconditionally, emitting each
attempt to Oximeter for further analysis.

- [x] Depends on
oxidecomputer/async-bb8-diesel#58
- [x] As noted in
oxidecomputer/async-bb8-diesel#58, this will
require customizing CRDB session variables to work correctly. (Edit:
this is done on each transaction)


Part of oxidecomputer/customer-support#46 
Part of #3814
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants