Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split starting running snapshots into two steps #3097

Merged
merged 1 commit into from
May 15, 2023

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented May 12, 2023

When Saga nodes fail, the corresponding undo function is not executed. If a saga node can partially succeed before failing, the undo function has to be split into a separate saga step in order to properly unwind the created resources. This wasn't being done for running snapshots, so this commit corrects that and adds a test for it.

Fixes #3085

When Saga nodes fail, the corresponding undo function is not executed.
If a saga node can partially succeed before failing, the undo function
has to be split into a separate saga step in order to properly unwind
the created resources. This wasn't being done for running snapshots, so
this commit corrects that and adds a test for it.

Fixes oxidecomputer#3085
@jmpesp jmpesp requested a review from davepacheco May 12, 2023 15:21
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by the pattern of having a node in the DAG with no action, just an undo, followed by a node with just an action, no undo. Usually, the undo action is supposed to go with the corresponding action. I gather doing that here didn't do the right thing because the action is neither succeeding nor failing, but partially completing. Would that be better split it into multiple actions that themselves would either succeed or fail? But maybe that's not possible here because the number of actions is runtime-dependent (depending on the number of allocated regions)?

I don't know that it causes a correctness problem here.

@jmpesp
Copy link
Contributor Author

jmpesp commented May 15, 2023

Usually, the undo action is supposed to go with the corresponding action. I gather doing that here didn't do the right thing because the action is neither succeeding nor failing, but partially completing. Would that be better split it into multiple actions that themselves would either succeed or fail? But maybe that's not possible here because the number of actions is runtime-dependent (depending on the number of allocated regions)?

Yeah, having the undo action go with the corresponding action is only valid when the saga node's action is atomic. If the saga node can partially succeed then this two step process is required to ensure that the unwind is executed. Of course, the unwind may also partially succeed, which is another problem entirely.

It would be better to split it into multiple actions depending on what is actually required (each with only one idempotent call to a corresponding crucible agent) but you're correct in that this is runtime-dependent, and the code that builds the saga DAG does not have access to the database and cannot make complicated runtime decisions. get_allocated_regions could return any number of regions. It's true that we currently do not stitch multiple subregions together or have complicated topologies, but in the future we probably will.

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks for that explanation. I wonder if in the future we might be able to fetch those regions when we create the DAG (which I think is usually a time we could hit the database, since it's usually happening in some API call), store them as a "constant" node or something like that, and then reference it later. Anyway, that seems beyond the scope of this change.

@jmpesp jmpesp merged commit b417a6c into oxidecomputer:main May 15, 2023
@jmpesp jmpesp deleted the split_into_two_steps branch May 17, 2023 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nexus unwrap in saga_exec.rs
2 participants