Manage OPTE V2P mappings #2536

jmpesp · 2023-03-11T01:18:07Z

During instance creation, create the appropriate OPTE V2P mappings on other sleds so that instances there can connect to the new instance.

During instance deletion, remove the corresponding mappings.

During instance migration, both create and remove the corresponding mappings.

This is one part of #2290. Another commit will be required when oxidecomputer/opte#332 is addressed.

During instance creation, create the appropriate OPTE V2P mappings on other sleds so that instances there can connect to the new instance. During instance deletion, remove the corresponding mappings. During instance migration, both create and remove the corresponding mappings. This is one part of oxidecomputer#2290. Another commit will be required when oxidecomputer/opte#332 is addressed.

gjcolombo · 2023-03-11T05:47:38Z

I haven't looked at the PR in any detail yet, but I did want to make a comment about one thing before it gets to be Monday and I forget to mention it:

During instance migration, both create and remove the corresponding mappings.

I don't think Nexus should do this during the migration saga. I think it's better to do it once the saga is finished and Nexus is sure the instance has migrated, since that minimizes the amount of time the instance will be running with the wrong mappings in place.

For background, the general sequence of events in a migration is

saga runs, starts the target instance, and tells it to start migrating
migration source and target run through the "pre-pause" parts of the migration protocol (negotiating a version, checking compatibility, transferring any state that can be transferred while the source is running)
source pauses & finishes sending state to target
target resumes
Nexus learns migration has succeeded & does any post-migration actions

If Nexus deletes the source's V2P mappings and establish the target's mappings in step 1, then (I assume) connectivity to the source is degraded from step 1 to step 4. If Nexus waits until after the migration is done, connectivity is interrupted in steps 3 through 5. In most cases, I expect step 2 will take much longer than step 5, because step 2 will involve transferring a lot of dirty pages (in the hopes that Propolis can then avoid transferring them while the VM is paused in step 3), whereas step 5 should just involve some (hopefully comparatively speedy) RPCs to the affected sleds. So, to minimize the time where the instance is running with the wrong V2P mappings, Nexus should change mappings after migration finishes.

The other thing to keep in mind here is that successful completion of the migration saga doesn't imply a successful migration--Nexus doesn't know whether the instance will end up running on the source or the target until the saga has succeeded and the involved Propolises take control of its outcome. I suspect we'll need some more general logic in Nexus that watches incoming instance state changes and can say "oops, this instance's location isn't what I thought it was, let me regenerate the V2P mappings for it." I'm happy to try to add that as part of the migration work I have in my queue.

RFD 361 section 6.1 discusses all this in a little more detail.

Sorry for the drive-by comment here--this area's been top-of-mind for me for the last few weeks so I wanted to mention all this. I'd be glad to help take a look at the code changes in more detail next week.

gjcolombo

I took a closer look at the Nexus and sled agent parts of the change and left a couple extra questions about the sagas and synchronization between attempts to program mappings. Some of these are more "file an issue and we'll get to it later" sorts of things than actual merge blockers, but I'd like to get aligned on how we're going to program mappings in the instance sagas.

gjcolombo · 2023-03-11T20:08:09Z

nexus/src/app/sagas/instance_create.rs

@@ -106,6 +106,9 @@ declare_saga_actions! {
    INSTANCE_ENSURE -> "instance_ensure" {
        + sic_instance_ensure
    }
+    V2P_ENSURE -> "v2p_ensure" {


Should this go before the instance ensure step so that the mappings exist by the time the instance starts to run? (I suspect this step is also going to be easier to undo than sic_instance_ensure.)

I can't leave comments on an arbitrary part of the file (ugh) so leaving this here because it's closely related: if this ordering changes and V2P_ENSURE gets an undo step, we should also update the tests in this module to check that no V2P mappings exist after a saga is torn down (there are a bunch of functions down there that check this sort of thing for other side effects).

Agreed, this should come before the instance-ensure call. I also think it makes sense to add a TODO here linking to the relevant OPTE issu about adding deletion of these mappings.

done in 48754ea

gjcolombo · 2023-03-11T20:13:26Z

nexus/src/app/sagas/instance_migrate.rs

@@ -144,6 +144,13 @@ async fn sim_instance_migrate(
    let (instance_id, old_runtime) =
        sagactx.lookup::<(Uuid, InstanceRuntimeState)>("migrate_instance")?;

+    // Remove existing V2P mappings for source instance


If we do end up doing this as part of the saga (see my big comment on the review at large) I'd be inclined to put this in a separate saga step before this one (similar to what happens in the instance create saga).

Agreed, this step and the one adding the new mappings on the destination sled seem like they should be separate saga actions.

nexus/src/app/sled.rs

bnaecker

Thanks a lot for getting this going! It's really exciting to see more integration with OPTE! I've got a few comments about the general approach. I think we should at least reconsider propagating the mappings to all sleds, regardless of whether they need them. It seems like not a heavy additional lift, and I think it's closer to the intent we're after.

Thanks for all the work @jmpesp!

bnaecker · 2023-03-11T22:06:22Z

sled-agent/src/http_entrypoints.rs

+#[endpoint {
+    method = PUT,
+    path = "/set-v2p",
+}]


I might structure these endpoints a bit differently. The mapping we're applying is for a specific instance, telling this sled agent (OPTE, really) where it can find it out in the world. I think it makes sense to have endpoints that are PUT /v2p/{instance_id} with the body, and then DELETE /v2p/{instance_id}

Now, that does mean that the body type here isn't quite right. We have potentially multiple NICs per guest, and what OPTE really needs is the mapping from the virtual address of each of those, to the hosting sled. The VNI and the underlay IP are all the same, so I might structure it something like this:

pub struct VirtualAddress { pub ip: IpAddr, pub mac: MacAddr, } pub struct SetVirtualNetworkInterfaceHost { pub vni: Vni, pub physical_host_ip: Ipv6Addr, pub virtual_addresses: Vec<VirtualAddress>, }

Internally in the simulated sled agent, I would store those as a map from instance ID to the data. That efficiency isn't very relevant, since it's only on the simulated agent, but I think the endpoint structure makes more sense regardless.

As an aside, if we go this route, the type name SetVirtualNetworkInterfaceHost isn't quite right. It would be something like VirtualToPhysicalMapping, which I think is a bit more clear in any case.

Strictly speaking the mapping we're applying is for a specific network interface, not instance. For example, during a PUT /v1/network-interfaces/{interface}, this would incur PUT /v2p/{interface} and DELETE /v2p/{interface} calls, but not affect the instance placement. I've done this in ffb4767 (edit: where "this" is change the endpoints, not do anything with PUT for network-interfaces)

Yes, you're right, the mappings tell OPTE where to find a virtual address in the physical world. And all NICs for an instance are on the same host and share the same VNI, so I was only hoping to simplify the interfaces. If you feel it doesn't do that, that's cool too.

I don't quite follow the bit about incurring some additional calls, could you expand on that?

Sorry, I only mean that "moving" a NIC around (if possible) would incur calls to interface specific endpoints, not instance specific endpoints.

illumos-utils/src/opte/illumos/port_manager.rs

bnaecker · 2023-03-11T22:16:15Z

nexus/src/app/sagas/instance_create.rs

@@ -106,6 +106,9 @@ declare_saga_actions! {
    INSTANCE_ENSURE -> "instance_ensure" {
        + sic_instance_ensure
    }
+    V2P_ENSURE -> "v2p_ensure" {


Agreed, this should come before the instance-ensure call. I also think it makes sense to add a TODO here linking to the relevant OPTE issu about adding deletion of these mappings.

bnaecker · 2023-03-11T22:20:03Z

nexus/src/app/sagas/instance_migrate.rs

@@ -144,6 +144,13 @@ async fn sim_instance_migrate(
    let (instance_id, old_runtime) =
        sagactx.lookup::<(Uuid, InstanceRuntimeState)>("migrate_instance")?;

+    // Remove existing V2P mappings for source instance


Agreed, this step and the one adding the new mappings on the destination sled seem like they should be separate saga actions.

bnaecker · 2023-03-11T23:17:32Z

nexus/src/app/sled.rs

+        // For every sled that isn't the sled this instance was allocated to, create
+        // a virtual to physical mapping for each of this instance's NICs.


I'm not sure about this approach. I get the point of a quick solution, but a more appropriate one doesn't seem like significantly more work. We're creating a bunch of irrelevant mappings. And more importantly, we've explicitly decided to defer VPC peering in the MVP. I don't think we want to send mappings for instances in another VPC. You rightly point out that OPTE will drop such traffic, but I don't think we should bake that knowledge in here. I think it makes sense to try to send the mappings we explicitly want to allow.

As an alternative, what about including a function call here, something like Nexus::sleds_for_vpc_mapping(). That would currently select all sleds that are in the same VPC as the provided instance. In the future, we can update that to make more fine-grained decisions, taking into account peering for example. But we have an interface for selecting the "right" sleds, encapsulating what "right" means.

I believe the actual SQL query to derive this information would be pretty simple. It should look basically like this:

SELECT active_server_id FROM ( SELECT instance_id FROM network_interface WHERE vpc_id = $VPC_ID AND time_deleted IS NULL LIMIT 1 ) AS inst INNER JOIN instance ON inst.instance_id = instance.id

That selects the ID of all instances from the NIC table sharing the same VPC. (You have that from the call to derive_network_interface_info().) Then we join with the instance table on the primary key, and select the sled ID. It's not too much additional work to join on the sled table and pull the IP directly, too.

This is definitely no worse than we have now, in terms of atomicity, and I think it allows us to more faithfully represent the intent and evolve as we need finer-grained data.

Oh, I should have added a big caveat! Diesel can be an absolute nightmare sometimes. The SQL query here is pretty easy, but it involves a subquery. That can sometimes be very hard to express in Diesel. I think it should be straightforward here, but if we find that it can't be expressed easily in Diesel's DSL, then we should seriously reconsider. Hand-rolled queries are definitely much more work.

It's not that it's impossible to do, but I'm anxious about TOCTOU (like Greg mentioned in another comment). Without something always running in the background that can apply corrections we run the risk of these mappings not being correct.

I agree there is a risk of these mappings being incorrect, for example if an instance is moved before they can be applied to all sleds. I don't see how propagating them to sleds that don't need it solves that. As an example, if we propagate a mapping to some sled, but the instance moves, that sled now has the wrong mapping in any case, even if that's only for some period of time.

I had imagined we'd solve this mechanisms put forward in RFD 373 on reliable persistent workflows, such as generation numbers. We could compute the set of mappings at some point, along with generation number. If a sled receives a mapping that is prior to its existing generation, it may simply ignore it. It still seems like we want to compute the right set of data at each generation.

Yeah, I'm not sure what I was thinking.

@bnaecker: @gjcolombo and I discussed the broadcast method, and are convinced at this point that it seems safe. See the comment chain at #2536 (comment).

After a re-read, I think I grok the generation number approach outlined in RFD 373. At this point though I think I'd like to create a follow on issue for that work given that we're thinking the broadcast method is safe. Thoughts?

bnaecker · 2023-03-11T23:23:39Z

nexus/src/app/sled.rs

+                .into();
+
+        let mut last_sled_id: Option<Uuid> = None;
+        loop {


Maybe a nit, bu it seems like this loop could be parallelized and then collected, via something like FuturesUnordered.

I wasn't smart enough to do this with FuturesUnordered, so I did it with tokio::spawn instead: 117f8b3

bnaecker · 2023-03-11T23:29:56Z

nexus/src/app/sagas/instance_create.rs

+    // TODO-idempotent if this action fails half way through, unwind is not
+    // called!


You make a note about this elsewhere, but I'm wondering about unwinding this saga. The way we've structure this, we fail the entire saga node if the request to update mappings on any one sled fails, but without unwinding the mappings we've already sent.

One option would be to put a no-op forward action before this one, that deletes all mappings for an instance. That way we always unwind them, regardless of whether we succeed partway through the following node. I get that there's no actual way to delete these mappings until we fix OPTE#332, but it would structure the saga in a way to alleviate the fact that the sic_v2p_ensure action really takes many individual actions.

done in 48754ea

bnaecker · 2023-03-11T23:32:37Z

nexus/src/app/sled.rs

+            let sleds_page =
+                self.sleds_list(&self.opctx_alloc, &pagparams).await?;
+
+            for sled in &sleds_page {


Same note about parallelization.

see 117f8b3

bnaecker · 2023-03-11T23:33:45Z

nexus/src/app/sled.rs

+                .fetch_for(authz::Action::Read)
+                .await?;
+
+        let instance_nics = self


So, if we change the endoint in the sled agent to be something like DELETE /v2p/{instance_id}, we could remove this call entirely. We only need the instance ID to delete the mappings for on each sled for that instance.

I don't grok this - won't the OPTE ioctl require all fields (it doesn't know about instance or NIC id) to delete the mapping?

Yeah I think you're right. I might have been conflating this with the simulated sled agent, where things are stored that way. It might be possible to send just the VNI + MAC, but I don't see that being a huge win.

bnaecker · 2023-03-11T23:39:03Z

nexus/tests/integration_tests/instances.rs

+    for sled_agent in &sled_agents {
+        let v2p_mappings = sled_agent.v2p_mappings.lock().await;
+        if sled_agent.id != db_instance.runtime().sled_id {
+            assert!(!v2p_mappings.is_empty());


Could we test the contents of the mappings? I don't see any other location that we ensure we've really provided the right data to OPTE.

Good call, done in 210adc9

and cargo fmt

…v2p_mappings

illumos-utils/src/opte/illumos/port_manager.rs

jmpesp · 2023-04-04T15:45:31Z

@luqmana dealt with the conflicts, let me know if there's further changes you'd like to see

luqmana

Thanks for plumbing this james! Looks mostly good but I did have a few questions

illumos-utils/src/opte/illumos/port_manager.rs

nexus/src/app/sagas/instance_create.rs

nexus/src/app/sagas/instance_migrate.rs

nexus/src/app/sled.rs

nexus/tests/integration_tests/instances.rs

luqmana

Thanks James. This definitely improves the current state of things in allowing cross-sled communication. Still a couple TODOs left but those can be tracked in follow up issues.

jmpesp requested a review from bnaecker March 11, 2023 01:18

gjcolombo reviewed Mar 11, 2023

View reviewed changes

bnaecker reviewed Mar 11, 2023

View reviewed changes

jmpesp added 11 commits March 13, 2023 11:04

Merge remote-tracking branch 'upstream/main' into nexus_manage_v2p

1b9dfd9

update for OpContext changes

12e333e

change http entrypoints to PUT/DELETE /v2p/{interface_id}

ffb4767

concurrently run v2p calls

117f8b3

confirm that OPTE is passed correct mappings

210adc9

ensure v2p mappings are unwound correctly

48754ea

comment about TOCTOU

3d5e677

and cargo fmt

ensure v2p mappings created during instance delete unwind

4bdc83c

update TOCTOU comment after discussion with Greg

e9c4411

remove delete_instance_v2p_mappings, add comment for create_instance_…

dbafcc1

…v2p_mappings

perform create_instance_v2p_mappings before sim_instance_migrate

9a73155

luqmana reviewed Mar 28, 2023

View reviewed changes

illumos-utils/src/opte/illumos/port_manager.rs Outdated Show resolved Hide resolved

jmpesp added 5 commits March 29, 2023 16:24

Merge remote-tracking branch 'upstream/main' into nexus_manage_v2p

21e33be

fmt

20545d4

use opte's implemented From when possible

9ba36eb

physical_host_ip should be Ipv6Addr

f015b25

Merge remote-tracking branch 'upstream/main' into nexus_manage_v2p

1d7f2ba

luqmana self-requested a review April 4, 2023 17:37

luqmana reviewed Apr 4, 2023

View reviewed changes

jmpesp added 3 commits April 5, 2023 10:55

do not propagate interface_id for real sled agent

7407c31

set v2p mappings for the instance after the sled_id changes

b459ce6

firewall rules, not OPTE

05ee12b

jmpesp mentioned this pull request Apr 5, 2023

A more targeted approach for V2P mapping orchestration #2770

Open

luqmana approved these changes Apr 5, 2023

View reviewed changes

luqmana mentioned this pull request Apr 5, 2023

disseminate virtual-to-physical IP mappings upon live migration #2002

Closed

jmpesp merged commit bea1af3 into oxidecomputer:main Apr 6, 2023

jmpesp deleted the nexus_manage_v2p branch April 6, 2023 13:15

		// For every sled that isn't the sled this instance was allocated to, create
		// a virtual to physical mapping for each of this instance's NICs.

		// TODO-idempotent if this action fails half way through, unwind is not
		// called!

Manage OPTE V2P mappings #2536

Manage OPTE V2P mappings #2536

Conversation

jmpesp commented Mar 11, 2023

gjcolombo commented Mar 11, 2023

gjcolombo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bnaecker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmpesp Mar 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bnaecker Mar 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmpesp commented Apr 4, 2023

luqmana left a comment

Choose a reason for hiding this comment

luqmana left a comment

Choose a reason for hiding this comment

jmpesp Mar 13, 2023 •

edited

Loading

bnaecker Mar 11, 2023 •

edited

Loading