-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manage OPTE V2P mappings #2536
Manage OPTE V2P mappings #2536
Conversation
During instance creation, create the appropriate OPTE V2P mappings on other sleds so that instances there can connect to the new instance. During instance deletion, remove the corresponding mappings. During instance migration, both create and remove the corresponding mappings. This is one part of oxidecomputer#2290. Another commit will be required when oxidecomputer/opte#332 is addressed.
I haven't looked at the PR in any detail yet, but I did want to make a comment about one thing before it gets to be Monday and I forget to mention it:
I don't think Nexus should do this during the migration saga. I think it's better to do it once the saga is finished and Nexus is sure the instance has migrated, since that minimizes the amount of time the instance will be running with the wrong mappings in place. For background, the general sequence of events in a migration is
If Nexus deletes the source's V2P mappings and establish the target's mappings in step 1, then (I assume) connectivity to the source is degraded from step 1 to step 4. If Nexus waits until after the migration is done, connectivity is interrupted in steps 3 through 5. In most cases, I expect step 2 will take much longer than step 5, because step 2 will involve transferring a lot of dirty pages (in the hopes that Propolis can then avoid transferring them while the VM is paused in step 3), whereas step 5 should just involve some (hopefully comparatively speedy) RPCs to the affected sleds. So, to minimize the time where the instance is running with the wrong V2P mappings, Nexus should change mappings after migration finishes. The other thing to keep in mind here is that successful completion of the migration saga doesn't imply a successful migration--Nexus doesn't know whether the instance will end up running on the source or the target until the saga has succeeded and the involved Propolises take control of its outcome. I suspect we'll need some more general logic in Nexus that watches incoming instance state changes and can say "oops, this instance's location isn't what I thought it was, let me regenerate the V2P mappings for it." I'm happy to try to add that as part of the migration work I have in my queue. RFD 361 section 6.1 discusses all this in a little more detail. Sorry for the drive-by comment here--this area's been top-of-mind for me for the last few weeks so I wanted to mention all this. I'd be glad to help take a look at the code changes in more detail next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a closer look at the Nexus and sled agent parts of the change and left a couple extra questions about the sagas and synchronization between attempts to program mappings. Some of these are more "file an issue and we'll get to it later" sorts of things than actual merge blockers, but I'd like to get aligned on how we're going to program mappings in the instance sagas.
@@ -106,6 +106,9 @@ declare_saga_actions! { | |||
INSTANCE_ENSURE -> "instance_ensure" { | |||
+ sic_instance_ensure | |||
} | |||
V2P_ENSURE -> "v2p_ensure" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this go before the instance ensure step so that the mappings exist by the time the instance starts to run? (I suspect this step is also going to be easier to undo than sic_instance_ensure
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't leave comments on an arbitrary part of the file (ugh) so leaving this here because it's closely related: if this ordering changes and V2P_ENSURE
gets an undo step, we should also update the tests in this module to check that no V2P mappings exist after a saga is torn down (there are a bunch of functions down there that check this sort of thing for other side effects).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this should come before the instance-ensure call. I also think it makes sense to add a TODO here linking to the relevant OPTE issu about adding deletion of these mappings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 48754ea
@@ -144,6 +144,13 @@ async fn sim_instance_migrate( | |||
let (instance_id, old_runtime) = | |||
sagactx.lookup::<(Uuid, InstanceRuntimeState)>("migrate_instance")?; | |||
|
|||
// Remove existing V2P mappings for source instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do end up doing this as part of the saga (see my big comment on the review at large) I'd be inclined to put this in a separate saga step before this one (similar to what happens in the instance create saga).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this step and the one adding the new mappings on the destination sled seem like they should be separate saga actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for getting this going! It's really exciting to see more integration with OPTE! I've got a few comments about the general approach. I think we should at least reconsider propagating the mappings to all sleds, regardless of whether they need them. It seems like not a heavy additional lift, and I think it's closer to the intent we're after.
Thanks for all the work @jmpesp!
sled-agent/src/http_entrypoints.rs
Outdated
#[endpoint { | ||
method = PUT, | ||
path = "/set-v2p", | ||
}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might structure these endpoints a bit differently. The mapping we're applying is for a specific instance, telling this sled agent (OPTE, really) where it can find it out in the world. I think it makes sense to have endpoints that are PUT /v2p/{instance_id}
with the body, and then DELETE /v2p/{instance_id}
Now, that does mean that the body type here isn't quite right. We have potentially multiple NICs per guest, and what OPTE really needs is the mapping from the virtual address of each of those, to the hosting sled. The VNI and the underlay IP are all the same, so I might structure it something like this:
pub struct VirtualAddress {
pub ip: IpAddr,
pub mac: MacAddr,
}
pub struct SetVirtualNetworkInterfaceHost {
pub vni: Vni,
pub physical_host_ip: Ipv6Addr,
pub virtual_addresses: Vec<VirtualAddress>,
}
Internally in the simulated sled agent, I would store those as a map from instance ID to the data. That efficiency isn't very relevant, since it's only on the simulated agent, but I think the endpoint structure makes more sense regardless.
As an aside, if we go this route, the type name SetVirtualNetworkInterfaceHost
isn't quite right. It would be something like VirtualToPhysicalMapping
, which I think is a bit more clear in any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking the mapping we're applying is for a specific network interface, not instance. For example, during a PUT /v1/network-interfaces/{interface}
, this would incur PUT /v2p/{interface}
and DELETE /v2p/{interface}
calls, but not affect the instance placement. I've done this in ffb4767 (edit: where "this" is change the endpoints, not do anything with PUT for network-interfaces)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right, the mappings tell OPTE where to find a virtual address in the physical world. And all NICs for an instance are on the same host and share the same VNI, so I was only hoping to simplify the interfaces. If you feel it doesn't do that, that's cool too.
I don't quite follow the bit about incurring some additional calls, could you expand on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I only mean that "moving" a NIC around (if possible) would incur calls to interface specific endpoints, not instance specific endpoints.
@@ -106,6 +106,9 @@ declare_saga_actions! { | |||
INSTANCE_ENSURE -> "instance_ensure" { | |||
+ sic_instance_ensure | |||
} | |||
V2P_ENSURE -> "v2p_ensure" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this should come before the instance-ensure call. I also think it makes sense to add a TODO here linking to the relevant OPTE issu about adding deletion of these mappings.
@@ -144,6 +144,13 @@ async fn sim_instance_migrate( | |||
let (instance_id, old_runtime) = | |||
sagactx.lookup::<(Uuid, InstanceRuntimeState)>("migrate_instance")?; | |||
|
|||
// Remove existing V2P mappings for source instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this step and the one adding the new mappings on the destination sled seem like they should be separate saga actions.
// For every sled that isn't the sled this instance was allocated to, create | ||
// a virtual to physical mapping for each of this instance's NICs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this approach. I get the point of a quick solution, but a more appropriate one doesn't seem like significantly more work. We're creating a bunch of irrelevant mappings. And more importantly, we've explicitly decided to defer VPC peering in the MVP. I don't think we want to send mappings for instances in another VPC. You rightly point out that OPTE will drop such traffic, but I don't think we should bake that knowledge in here. I think it makes sense to try to send the mappings we explicitly want to allow.
As an alternative, what about including a function call here, something like Nexus::sleds_for_vpc_mapping()
. That would currently select all sleds that are in the same VPC as the provided instance. In the future, we can update that to make more fine-grained decisions, taking into account peering for example. But we have an interface for selecting the "right" sleds, encapsulating what "right" means.
I believe the actual SQL query to derive this information would be pretty simple. It should look basically like this:
SELECT
active_server_id
FROM
(
SELECT
instance_id
FROM
network_interface
WHERE
vpc_id = $VPC_ID
AND time_deleted IS NULL
LIMIT
1
)
AS inst
INNER JOIN instance ON inst.instance_id = instance.id
That selects the ID of all instances from the NIC table sharing the same VPC. (You have that from the call to derive_network_interface_info()
.) Then we join with the instance table on the primary key, and select the sled ID. It's not too much additional work to join on the sled table and pull the IP directly, too.
This is definitely no worse than we have now, in terms of atomicity, and I think it allows us to more faithfully represent the intent and evolve as we need finer-grained data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I should have added a big caveat! Diesel can be an absolute nightmare sometimes. The SQL query here is pretty easy, but it involves a subquery. That can sometimes be very hard to express in Diesel. I think it should be straightforward here, but if we find that it can't be expressed easily in Diesel's DSL, then we should seriously reconsider. Hand-rolled queries are definitely much more work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not that it's impossible to do, but I'm anxious about TOCTOU (like Greg mentioned in another comment). Without something always running in the background that can apply corrections we run the risk of these mappings not being correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree there is a risk of these mappings being incorrect, for example if an instance is moved before they can be applied to all sleds. I don't see how propagating them to sleds that don't need it solves that. As an example, if we propagate a mapping to some sled, but the instance moves, that sled now has the wrong mapping in any case, even if that's only for some period of time.
I had imagined we'd solve this mechanisms put forward in RFD 373 on reliable persistent workflows, such as generation numbers. We could compute the set of mappings at some point, along with generation number. If a sled receives a mapping that is prior to its existing generation, it may simply ignore it. It still seems like we want to compute the right set of data at each generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not sure what I was thinking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bnaecker: @gjcolombo and I discussed the broadcast method, and are convinced at this point that it seems safe. See the comment chain at #2536 (comment).
After a re-read, I think I grok the generation number approach outlined in RFD 373. At this point though I think I'd like to create a follow on issue for that work given that we're thinking the broadcast method is safe. Thoughts?
.into(); | ||
|
||
let mut last_sled_id: Option<Uuid> = None; | ||
loop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a nit, bu it seems like this loop could be parallelized and then collected, via something like FuturesUnordered
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't smart enough to do this with FuturesUnordered
, so I did it with tokio::spawn
instead: 117f8b3
// TODO-idempotent if this action fails half way through, unwind is not | ||
// called! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You make a note about this elsewhere, but I'm wondering about unwinding this saga. The way we've structure this, we fail the entire saga node if the request to update mappings on any one sled fails, but without unwinding the mappings we've already sent.
One option would be to put a no-op forward action before this one, that deletes all mappings for an instance. That way we always unwind them, regardless of whether we succeed partway through the following node. I get that there's no actual way to delete these mappings until we fix OPTE#332, but it would structure the saga in a way to alleviate the fact that the sic_v2p_ensure
action really takes many individual actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 48754ea
let sleds_page = | ||
self.sleds_list(&self.opctx_alloc, &pagparams).await?; | ||
|
||
for sled in &sleds_page { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same note about parallelization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see 117f8b3
.fetch_for(authz::Action::Read) | ||
.await?; | ||
|
||
let instance_nics = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if we change the endoint in the sled agent to be something like DELETE /v2p/{instance_id}
, we could remove this call entirely. We only need the instance ID to delete the mappings for on each sled for that instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't grok this - won't the OPTE ioctl require all fields (it doesn't know about instance or NIC id) to delete the mapping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think you're right. I might have been conflating this with the simulated sled agent, where things are stored that way. It might be possible to send just the VNI + MAC, but I don't see that being a huge win.
for sled_agent in &sled_agents { | ||
let v2p_mappings = sled_agent.v2p_mappings.lock().await; | ||
if sled_agent.id != db_instance.runtime().sled_id { | ||
assert!(!v2p_mappings.is_empty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we test the contents of the mappings? I don't see any other location that we ensure we've really provided the right data to OPTE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, done in 210adc9
and cargo fmt
@luqmana dealt with the conflicts, let me know if there's further changes you'd like to see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for plumbing this james! Looks mostly good but I did have a few questions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks James. This definitely improves the current state of things in allowing cross-sled communication. Still a couple TODOs left but those can be tracked in follow up issues.
During instance creation, create the appropriate OPTE V2P mappings on other sleds so that instances there can connect to the new instance.
During instance deletion, remove the corresponding mappings.
During instance migration, both create and remove the corresponding mappings.
This is one part of #2290. Another commit will be required when oxidecomputer/opte#332 is addressed.