Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Delete failed regions when the saga unwinds (oxidecomputer#7245)
One of the common sharp edges of sagas is that the compensating action of a node does _not_ run if the forward action fails. Said another way, for this node: EXAMPLE -> "output" { + forward_action - forward_action_undo } If `forward_action` fails, `forward_action_undo` is never executed. Forward actions are therefore required to be atomic, in that they either fully apply or don't apply at all. Sagas with nodes that ensure multiple regions exist cannot be atomic because they can partially fail (for example: what if only 2 out of 3 ensures succeed?). In order for the compensating action to be run, it must exist as a separate node that has a no-op forward action: EXAMPLE_UNDO -> "not_used" { + noop - forward_action_undo } EXAMPLE -> "output" { + forward_action } The region snapshot replacement start saga will only ever ensure that a single region exists, so one might think they could get away with a single node that combines the forward and compensating action - you'd be mistaken! The Crucible agent's region ensure is not atomic in all cases: if the region fails to create, it enters the `failed` state, but is not deleted. Nexus must clean these up. Fixes an issue that Angela saw where failed regions were taking up disk space in rack2 (oxidecomputer#7209). This commit also includes an omdb command for finding these orphaned regions and optionally cleaning them up.
- Loading branch information