Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus-db-queries] make saga_update more resilient, record_event idempotent #6113

Conversation

sunshowers
Copy link
Contributor

@sunshowers sunshowers commented Jul 17, 2024

See discussion in #2416 and #6090 (comment).

A summary of the changes here:

  1. Made saga_create_event idempotent. Previously, creating another event that duplicated the one which already existed would fail -- now it succeeds. These events are meant to be an append-only idempotent log, so this is okay. Also added a test for this.
  2. saga_update_state was already idempotent -- added a test which made sure of this. Also added a comment about how idempotence may not be enough in the future.
  3. Added a retry loop around saga state updates, similar to the one around recording events.

Created using spr 1.3.6-beta.1
@sunshowers sunshowers requested review from davepacheco and hawkw July 17, 2024 22:18
Created using spr 1.3.6-beta.1
@sunshowers sunshowers requested a review from smklein July 17, 2024 22:24
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! I had a couple tiny suggestions, but I don't see any issue with the change overall.

Comment on lines +53 to +57
// Consider the situation where a saga event gets recorded and
// committed, but there's a network reset which makes the client
// (us) believe that the event wasn't recorded. If we retry the
// event, we want to not fail with a conflict.
.on_conflict((dsl::saga_id, dsl::node_id, dsl::event_type))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this makes sense to me --- I appreciate the comment!

"call_count" => call_count,
"total_duration" => ?total_duration,
);
} else if total_duration > Duration::from_secs(20) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicky, unimportant: maybe make the duration a constant?

Comment on lines 77 to 96
// About conflicts
// ---------------
//
// This function is meant to be called in a loop, so that in the event
// of network flakiness, the operation is retried until successful.
// Currently, if the value of `saga_state` in the database is the same
// as the value we're trying to set it to, the update will be a no-op.
// That is okay, because at any time only one SEC will update the saga.
// (For now, we're implementing saga adoption only in cases where the
// original SEC/Nexus has been expunged.)
//
// However, in the future, it may be possible for multiple SECs to try
// and update the same saga, and overwrite each other's state. For
// example, one SEC might try and update the state to Running while the
// other one updates it to Done. That case would have to be carefully
// considered and tested here, probably using the (currently unused)
// `current_adopt_generation` field to enable optimistic locking.
//
// To reiterate, we are *not* considering the case where several SECs
// try to update the same saga. That will be a future enhancement.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take it or leave it: this kind of feels like it ought to be API documentation on the method, rather than an internal implementation comment? IIUC it is describing what the caller is expected to do with the function, which feels like API docs to me...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable.

Created using spr 1.3.6-beta.1
@sunshowers
Copy link
Contributor Author

Thanks! I'll let one of Sean and Dave also review this just to be sure we've gotten everything we can think of here.

@hawkw
Copy link
Member

hawkw commented Jul 17, 2024

FWIW, it might be worth seeking out a review from someone more qualified with this stuff than I am; I don't actually know this code at all --- I'm not totally sure why #6090 was assigned to me in the first place, it might've been a mistake?

@hawkw
Copy link
Member

hawkw commented Jul 17, 2024

Thanks! I'll let one of Sean and Dave also review this just to be sure we've gotten everything we can think of here.

Ah, yup, looks like you replied while I was writing my prior comment. Waiting for a Sean or Dave review seems like a good idea!

Created using spr 1.3.6-beta.1
@sunshowers sunshowers enabled auto-merge (squash) July 18, 2024 20:39
Created using spr 1.3.6-beta.1
@sunshowers sunshowers merged commit 708c288 into main Jul 18, 2024
22 checks passed
@sunshowers sunshowers deleted the sunshowers/spr/nexus-db-queries-make-saga_update-more-resilient-record_event-idempotent branch July 18, 2024 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants