Define a saga for instance start #3873

gjcolombo · 2023-08-11T00:19:00Z

Create a saga that starts instances. This has the following immediate benefits:

It's no longer possible to leak an instance registration during start; previously this could happen if Nexus crashed while handling a start call.
The saga synchronizes properly with concurrent attempts to delete an instance; the existing start routine may not be handling this correctly (it can look up an instance and decide it's OK to start, then start talking to sled agent about it while a deletion saga runs concurrently and deletes the instance).
The saga establishes networking state (Dendrite NAT entries, OPTE V2P mappings) for a newly started instance if it wasn't previously established. This is a stopgap measure to ensure that this state exists when restarting an instance after a cluster is restarted. It should eventually be replaced by a step that triggers the appropriate networking RPW(s).

This saga can also be used, at least in theory, as a subsaga of the instance create saga to replace that saga's logic for starting a newly-created instance. This work isn't done in this PR, though. (The change isn't trivial because the new start saga expects a prior instance record as a parameter, and the create saga can't construct a priori the instance record it intends to insert into CRDB.)

Tested via assorted new cargo tests and by launching a dev cluster with the changes, stopping an instance, restarting it, and verifying that the instance restarted correctly and that Nexus logs contained the expected log lines.

Fixes #2824. Fixes #3813.

Create a saga that starts instances. This has some immediate benefits: - It's no longer possible to leak an instance registration during start; previously this could happen if Nexus crashed while handling a start call. - The saga synchronizes properly with concurrent attempts to delete an instance; the existing start routine may not be handling this correctly (it can look up an instance and decide it's OK to start, then start talking to sled agent about it while a deletion saga runs concurrently and deletes the instance). - The saga establishes networking state (Dendrite NAT entries, OPTE V2P mappings) for a newly started instance if it wasn't previously established. This is a stopgap measure to ensure that this state exists when restarting an instance after a cluster is restarted. It should eventually be replaced by a step that triggers the appropriate networking RPW(s). This saga can be used, at least in theory, as a subsaga of the instance create saga to replace that saga's logic for starting a newly-created instance. This work isn't done in this PR, though. (The change isn't trivial because the new start saga expects a prior instance record as a parameter, and the create saga can't construct *a priori* the instance record it intends to insert into CRDB.)

smklein

Looks great, thanks for these tests!

smklein · 2023-08-12T00:14:59Z

nexus/src/app/instance.rs

+        match db_instance.runtime_state.state.0 {
+            InstanceState::Starting | InstanceState::Running => {
+                return Ok(db_instance)
+            }
+            InstanceState::Stopped => {}
+            _ => {
+                return Err(Error::conflict(&format!(
+                    "instance is in state {} but must be {} to be started",
+                    db_instance.runtime_state.state.0,
+                    InstanceState::Stopped
+                )))
+            }
+        }


Any concern of TOCTTOU here? Kinda seems like this state checking could/should be part of the saga?

We should be covered here by the record's state generation: if the state changes, the saga will fail to transition from Stopped to Starting (because its generation number will be outdated) and will bail. This still isn't a perfect scheme, since it's possible for the saga to hit the "oops the state changed" condition even if the state has come to rest at Stopped. To completely fix all this I think we need to fix another bug first (bet you can guess which one...).

I'll add a comment here about the possible TOCTTOU and how we avoid it. There's another comment in the saga explaining the other remaining problems with the synchronization scheme in this PR.

smklein · 2023-08-12T00:18:00Z

nexus/src/app/instance.rs

+            InstanceState::Starting | InstanceState::Running => {
+                return Ok(db_instance)
+            }
+            InstanceState::Stopped => {}


The saga itself seems fine accepting instances in the Creating state -- is that not acceptable here?

I think we have to be careful about that here. There's a small window in the create saga where the instance record is in CRDB (and has the Creating state) while the objects that support it (external IPs, attached disks, etc.) are still being set up, and we don't want to allow a start request to go through while we're in that state.

The saga accepts Creating as a prior state so that it can be used as a subsaga from within the instance create saga (even though that's not hooked up yet). I could buy, though, that a better approach is to have the create saga fully create the instance, then unconditionally move it to Stopped, and then have it invoke the start saga if needed. That would make everything very consistent at the cost of having a Creating instance (where the create request has start: true) briefly go to Stopped before entering Starting. That seems to me like a smallish price to pay (and arguably it's strictly better because it's more accurate). I'll mull this over, but WDYT of this approach?

I like that approach. I was considering other options, but they all kinda seems like they incur an equivalent CRDB write to make the instance transition from "being constructed, not visible" to "constructed enough that we can correctly start it", which is basically what you're also proposing with the intermediate stopped state.

Yeah, having slept on it, I'm comfortable saying the value of having a uniform start path outweighs the benefit of going directly from Creating to Starting without very briefly passing through Stopped first. Filed #3883 to track the remaining work here.

nexus/src/app/sagas/instance_start.rs

gjcolombo · 2023-08-13T03:13:27Z

Thanks as always for the review! I've added some comments in d59b4cf to cover the topics discussed above.

I'll address the remaining open question about whether the instance create saga should move instances to Stopped before starting them in whatever follow-on PR connects the create saga to the start saga.

Finding boundary switches in the instance start saga requires fleet query access. Use the Nexus instance allocation context to get it instead of the saga initiator's operation context. (This was introduced in #3873; it wasn't previously a problem because the instance create saga set up instance NAT state, and that saga received a boundary switch list as a parameter, which parameter was generated by using the instance allocation context. #4194 made this worse by making instance create use the start saga to start instances instead of using its own saga nodes.) Update the instance-in-silo integration test to make sure that instances created by a silo collaborator actually start. This is unfortunately not very elegant. The `instance_simulate` function family normally uses `OpContext::for_tests` to get an operation context, but that context is associated with a user that isn't in the test silo. To get around this, add some simulation interfaces that take an explicit `OpContext` and then generate one corresponding to the silo user for the test case in question. It seems like it'd be nicer to give helper routines like `instance_simulate` access to a context that is omnipotent across all silos, but I wasn't sure how best to do this. I'm definitely open to suggestions here. Tested via cargo tests. Fixes #4272.

gjcolombo requested a review from smklein August 11, 2023 00:19

smklein approved these changes Aug 12, 2023

View reviewed changes

improve comments

d59b4cf

gjcolombo enabled auto-merge (squash) August 13, 2023 03:11

gjcolombo merged commit b50907c into main Aug 13, 2023

gjcolombo deleted the gjcolombo/instance-start-saga branch August 13, 2023 04:30

gjcolombo mentioned this pull request Aug 14, 2023

Instance create with start = true should use start saga to start instances #3883

Closed

gjcolombo mentioned this pull request Oct 13, 2023

Use opctx_alloc in start saga to query boundary switches #4274

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define a saga for instance start #3873

Define a saga for instance start #3873

gjcolombo commented Aug 11, 2023

smklein left a comment

smklein Aug 12, 2023

gjcolombo Aug 13, 2023

smklein Aug 12, 2023

gjcolombo Aug 12, 2023

smklein Aug 14, 2023

gjcolombo Aug 14, 2023 •

edited

Loading

gjcolombo commented Aug 13, 2023

Define a saga for instance start #3873

Define a saga for instance start #3873

Conversation

gjcolombo commented Aug 11, 2023

smklein left a comment

Choose a reason for hiding this comment

smklein Aug 12, 2023

Choose a reason for hiding this comment

gjcolombo Aug 13, 2023

Choose a reason for hiding this comment

smklein Aug 12, 2023

Choose a reason for hiding this comment

gjcolombo Aug 12, 2023

Choose a reason for hiding this comment

smklein Aug 14, 2023

Choose a reason for hiding this comment

gjcolombo Aug 14, 2023 • edited Loading

Choose a reason for hiding this comment

gjcolombo commented Aug 13, 2023

gjcolombo Aug 14, 2023 •

edited

Loading