Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage OPTE V2P mappings #2536

Merged
merged 20 commits into from
Apr 6, 2023
Merged

Manage OPTE V2P mappings #2536

merged 20 commits into from
Apr 6, 2023

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Mar 11, 2023

During instance creation, create the appropriate OPTE V2P mappings on other sleds so that instances there can connect to the new instance.

During instance deletion, remove the corresponding mappings.

During instance migration, both create and remove the corresponding mappings.

This is one part of #2290. Another commit will be required when oxidecomputer/opte#332 is addressed.

During instance creation, create the appropriate OPTE V2P mappings on
other sleds so that instances there can connect to the new instance.

During instance deletion, remove the corresponding mappings.

During instance migration, both create and remove the corresponding
mappings.

This is one part of oxidecomputer#2290. Another commit will be required when
oxidecomputer/opte#332 is addressed.
@jmpesp jmpesp requested a review from bnaecker March 11, 2023 01:18
@gjcolombo
Copy link
Contributor

I haven't looked at the PR in any detail yet, but I did want to make a comment about one thing before it gets to be Monday and I forget to mention it:

During instance migration, both create and remove the corresponding mappings.

I don't think Nexus should do this during the migration saga. I think it's better to do it once the saga is finished and Nexus is sure the instance has migrated, since that minimizes the amount of time the instance will be running with the wrong mappings in place.

For background, the general sequence of events in a migration is

  1. saga runs, starts the target instance, and tells it to start migrating
  2. migration source and target run through the "pre-pause" parts of the migration protocol (negotiating a version, checking compatibility, transferring any state that can be transferred while the source is running)
  3. source pauses & finishes sending state to target
  4. target resumes
  5. Nexus learns migration has succeeded & does any post-migration actions

If Nexus deletes the source's V2P mappings and establish the target's mappings in step 1, then (I assume) connectivity to the source is degraded from step 1 to step 4. If Nexus waits until after the migration is done, connectivity is interrupted in steps 3 through 5. In most cases, I expect step 2 will take much longer than step 5, because step 2 will involve transferring a lot of dirty pages (in the hopes that Propolis can then avoid transferring them while the VM is paused in step 3), whereas step 5 should just involve some (hopefully comparatively speedy) RPCs to the affected sleds. So, to minimize the time where the instance is running with the wrong V2P mappings, Nexus should change mappings after migration finishes.

The other thing to keep in mind here is that successful completion of the migration saga doesn't imply a successful migration--Nexus doesn't know whether the instance will end up running on the source or the target until the saga has succeeded and the involved Propolises take control of its outcome. I suspect we'll need some more general logic in Nexus that watches incoming instance state changes and can say "oops, this instance's location isn't what I thought it was, let me regenerate the V2P mappings for it." I'm happy to try to add that as part of the migration work I have in my queue.

RFD 361 section 6.1 discusses all this in a little more detail.

Sorry for the drive-by comment here--this area's been top-of-mind for me for the last few weeks so I wanted to mention all this. I'd be glad to help take a look at the code changes in more detail next week.

Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a closer look at the Nexus and sled agent parts of the change and left a couple extra questions about the sagas and synchronization between attempts to program mappings. Some of these are more "file an issue and we'll get to it later" sorts of things than actual merge blockers, but I'd like to get aligned on how we're going to program mappings in the instance sagas.

@@ -106,6 +106,9 @@ declare_saga_actions! {
INSTANCE_ENSURE -> "instance_ensure" {
+ sic_instance_ensure
}
V2P_ENSURE -> "v2p_ensure" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go before the instance ensure step so that the mappings exist by the time the instance starts to run? (I suspect this step is also going to be easier to undo than sic_instance_ensure.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't leave comments on an arbitrary part of the file (ugh) so leaving this here because it's closely related: if this ordering changes and V2P_ENSURE gets an undo step, we should also update the tests in this module to check that no V2P mappings exist after a saga is torn down (there are a bunch of functions down there that check this sort of thing for other side effects).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this should come before the instance-ensure call. I also think it makes sense to add a TODO here linking to the relevant OPTE issu about adding deletion of these mappings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 48754ea

@@ -144,6 +144,13 @@ async fn sim_instance_migrate(
let (instance_id, old_runtime) =
sagactx.lookup::<(Uuid, InstanceRuntimeState)>("migrate_instance")?;

// Remove existing V2P mappings for source instance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do end up doing this as part of the saga (see my big comment on the review at large) I'd be inclined to put this in a separate saga step before this one (similar to what happens in the instance create saga).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this step and the one adding the new mappings on the destination sled seem like they should be separate saga actions.

nexus/src/app/sled.rs Show resolved Hide resolved
nexus/src/app/sled.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for getting this going! It's really exciting to see more integration with OPTE! I've got a few comments about the general approach. I think we should at least reconsider propagating the mappings to all sleds, regardless of whether they need them. It seems like not a heavy additional lift, and I think it's closer to the intent we're after.

Thanks for all the work @jmpesp!

Comment on lines 288 to 291
#[endpoint {
method = PUT,
path = "/set-v2p",
}]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might structure these endpoints a bit differently. The mapping we're applying is for a specific instance, telling this sled agent (OPTE, really) where it can find it out in the world. I think it makes sense to have endpoints that are PUT /v2p/{instance_id} with the body, and then DELETE /v2p/{instance_id}

Now, that does mean that the body type here isn't quite right. We have potentially multiple NICs per guest, and what OPTE really needs is the mapping from the virtual address of each of those, to the hosting sled. The VNI and the underlay IP are all the same, so I might structure it something like this:

pub struct VirtualAddress {
    pub ip: IpAddr,
    pub mac: MacAddr,
}

pub struct SetVirtualNetworkInterfaceHost {
    pub vni: Vni,
    pub physical_host_ip: Ipv6Addr,
    pub virtual_addresses: Vec<VirtualAddress>,
}

Internally in the simulated sled agent, I would store those as a map from instance ID to the data. That efficiency isn't very relevant, since it's only on the simulated agent, but I think the endpoint structure makes more sense regardless.

As an aside, if we go this route, the type name SetVirtualNetworkInterfaceHost isn't quite right. It would be something like VirtualToPhysicalMapping, which I think is a bit more clear in any case.

Copy link
Contributor Author

@jmpesp jmpesp Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking the mapping we're applying is for a specific network interface, not instance. For example, during a PUT /v1/network-interfaces/{interface}, this would incur PUT /v2p/{interface} and DELETE /v2p/{interface} calls, but not affect the instance placement. I've done this in ffb4767 (edit: where "this" is change the endpoints, not do anything with PUT for network-interfaces)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right, the mappings tell OPTE where to find a virtual address in the physical world. And all NICs for an instance are on the same host and share the same VNI, so I was only hoping to simplify the interfaces. If you feel it doesn't do that, that's cool too.

I don't quite follow the bit about incurring some additional calls, could you expand on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I only mean that "moving" a NIC around (if possible) would incur calls to interface specific endpoints, not instance specific endpoints.

illumos-utils/src/opte/illumos/port_manager.rs Outdated Show resolved Hide resolved
@@ -106,6 +106,9 @@ declare_saga_actions! {
INSTANCE_ENSURE -> "instance_ensure" {
+ sic_instance_ensure
}
V2P_ENSURE -> "v2p_ensure" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this should come before the instance-ensure call. I also think it makes sense to add a TODO here linking to the relevant OPTE issu about adding deletion of these mappings.

@@ -144,6 +144,13 @@ async fn sim_instance_migrate(
let (instance_id, old_runtime) =
sagactx.lookup::<(Uuid, InstanceRuntimeState)>("migrate_instance")?;

// Remove existing V2P mappings for source instance
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this step and the one adding the new mappings on the destination sled seem like they should be separate saga actions.

Comment on lines +245 to +246
// For every sled that isn't the sled this instance was allocated to, create
// a virtual to physical mapping for each of this instance's NICs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this approach. I get the point of a quick solution, but a more appropriate one doesn't seem like significantly more work. We're creating a bunch of irrelevant mappings. And more importantly, we've explicitly decided to defer VPC peering in the MVP. I don't think we want to send mappings for instances in another VPC. You rightly point out that OPTE will drop such traffic, but I don't think we should bake that knowledge in here. I think it makes sense to try to send the mappings we explicitly want to allow.

As an alternative, what about including a function call here, something like Nexus::sleds_for_vpc_mapping(). That would currently select all sleds that are in the same VPC as the provided instance. In the future, we can update that to make more fine-grained decisions, taking into account peering for example. But we have an interface for selecting the "right" sleds, encapsulating what "right" means.

I believe the actual SQL query to derive this information would be pretty simple. It should look basically like this:

SELECT
    active_server_id
FROM
    (
        SELECT
            instance_id
        FROM
            network_interface
        WHERE
            vpc_id = $VPC_ID
            AND time_deleted IS NULL
        LIMIT
            1
    )
        AS inst
    INNER JOIN instance ON inst.instance_id = instance.id

That selects the ID of all instances from the NIC table sharing the same VPC. (You have that from the call to derive_network_interface_info().) Then we join with the instance table on the primary key, and select the sled ID. It's not too much additional work to join on the sled table and pull the IP directly, too.

This is definitely no worse than we have now, in terms of atomicity, and I think it allows us to more faithfully represent the intent and evolve as we need finer-grained data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I should have added a big caveat! Diesel can be an absolute nightmare sometimes. The SQL query here is pretty easy, but it involves a subquery. That can sometimes be very hard to express in Diesel. I think it should be straightforward here, but if we find that it can't be expressed easily in Diesel's DSL, then we should seriously reconsider. Hand-rolled queries are definitely much more work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that it's impossible to do, but I'm anxious about TOCTOU (like Greg mentioned in another comment). Without something always running in the background that can apply corrections we run the risk of these mappings not being correct.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree there is a risk of these mappings being incorrect, for example if an instance is moved before they can be applied to all sleds. I don't see how propagating them to sleds that don't need it solves that. As an example, if we propagate a mapping to some sled, but the instance moves, that sled now has the wrong mapping in any case, even if that's only for some period of time.

I had imagined we'd solve this mechanisms put forward in RFD 373 on reliable persistent workflows, such as generation numbers. We could compute the set of mappings at some point, along with generation number. If a sled receives a mapping that is prior to its existing generation, it may simply ignore it. It still seems like we want to compute the right set of data at each generation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure what I was thinking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnaecker: @gjcolombo and I discussed the broadcast method, and are convinced at this point that it seems safe. See the comment chain at #2536 (comment).

After a re-read, I think I grok the generation number approach outlined in RFD 373. At this point though I think I'd like to create a follow on issue for that work given that we're thinking the broadcast method is safe. Thoughts?

.into();

let mut last_sled_id: Option<Uuid> = None;
loop {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a nit, bu it seems like this loop could be parallelized and then collected, via something like FuturesUnordered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't smart enough to do this with FuturesUnordered, so I did it with tokio::spawn instead: 117f8b3

Comment on lines 1026 to 1027
// TODO-idempotent if this action fails half way through, unwind is not
// called!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You make a note about this elsewhere, but I'm wondering about unwinding this saga. The way we've structure this, we fail the entire saga node if the request to update mappings on any one sled fails, but without unwinding the mappings we've already sent.

One option would be to put a no-op forward action before this one, that deletes all mappings for an instance. That way we always unwind them, regardless of whether we succeed partway through the following node. I get that there's no actual way to delete these mappings until we fix OPTE#332, but it would structure the saga in a way to alleviate the fact that the sic_v2p_ensure action really takes many individual actions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 48754ea

let sleds_page =
self.sleds_list(&self.opctx_alloc, &pagparams).await?;

for sled in &sleds_page {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note about parallelization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see 117f8b3

.fetch_for(authz::Action::Read)
.await?;

let instance_nics = self
Copy link
Collaborator

@bnaecker bnaecker Mar 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if we change the endoint in the sled agent to be something like DELETE /v2p/{instance_id}, we could remove this call entirely. We only need the instance ID to delete the mappings for on each sled for that instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't grok this - won't the OPTE ioctl require all fields (it doesn't know about instance or NIC id) to delete the mapping?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think you're right. I might have been conflating this with the simulated sled agent, where things are stored that way. It might be possible to send just the VNI + MAC, but I don't see that being a huge win.

for sled_agent in &sled_agents {
let v2p_mappings = sled_agent.v2p_mappings.lock().await;
if sled_agent.id != db_instance.runtime().sled_id {
assert!(!v2p_mappings.is_empty());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we test the contents of the mappings? I don't see any other location that we ensure we've really provided the right data to OPTE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, done in 210adc9

@jmpesp
Copy link
Contributor Author

jmpesp commented Apr 4, 2023

@luqmana dealt with the conflicts, let me know if there's further changes you'd like to see

@luqmana luqmana self-requested a review April 4, 2023 17:37
Copy link
Contributor

@luqmana luqmana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for plumbing this james! Looks mostly good but I did have a few questions

illumos-utils/src/opte/illumos/port_manager.rs Outdated Show resolved Hide resolved
nexus/src/app/sagas/instance_create.rs Show resolved Hide resolved
nexus/src/app/sagas/instance_migrate.rs Outdated Show resolved Hide resolved
nexus/src/app/sled.rs Outdated Show resolved Hide resolved
nexus/src/app/sled.rs Show resolved Hide resolved
nexus/tests/integration_tests/instances.rs Show resolved Hide resolved
Copy link
Contributor

@luqmana luqmana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks James. This definitely improves the current state of things in allowing cross-sled communication. Still a couple TODOs left but those can be tracked in follow up issues.

@jmpesp jmpesp merged commit bea1af3 into oxidecomputer:main Apr 6, 2023
@jmpesp jmpesp deleted the nexus_manage_v2p branch April 6, 2023 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants