Skip to content

Commit

Permalink
Delete failed regions when the saga unwinds (oxidecomputer#7245)
Browse files Browse the repository at this point in the history
One of the common sharp edges of sagas is that the compensating action
of a node does _not_ run if the forward action fails. Said another way,
for this node:

    EXAMPLE -> "output" {
      + forward_action
      - forward_action_undo }

If `forward_action` fails, `forward_action_undo` is never executed.
Forward actions are therefore required to be atomic, in that they either
fully apply or don't apply at all.

Sagas with nodes that ensure multiple regions exist cannot be atomic
because they can partially fail (for example: what if only 2 out of 3
ensures succeed?). In order for the compensating action to be run, it
must exist as a separate node that has a no-op forward action:

    EXAMPLE_UNDO -> "not_used" {
      + noop
      - forward_action_undo } EXAMPLE -> "output" { + forward_action }

The region snapshot replacement start saga will only ever ensure that a
single region exists, so one might think they could get away with a
single node that combines the forward and compensating action - you'd be
mistaken! The Crucible agent's region ensure is not atomic in all cases:
if the region fails to create, it enters the `failed` state, but is not
deleted. Nexus must clean these up.

Fixes an issue that Angela saw where failed regions were taking up disk
space in rack2 (oxidecomputer#7209). This commit also includes an omdb command for
finding these orphaned regions and optionally cleaning them up.
  • Loading branch information
jmpesp authored Dec 17, 2024
1 parent bb0b172 commit 4a4d6ca
Show file tree
Hide file tree
Showing 4 changed files with 583 additions and 30 deletions.
Loading

0 comments on commit 4a4d6ca

Please sign in to comment.