Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPW for OPTE v2p Mappings #5568

Merged
merged 30 commits into from
May 22, 2024
Merged

RPW for OPTE v2p Mappings #5568

merged 30 commits into from
May 22, 2024

Conversation

internet-diglett
Copy link
Contributor

@internet-diglett internet-diglett commented Apr 19, 2024

TODO

  • Extend db view to include probe v2p mappings
  • Update sagas to trigger rpw activation instead of directly configuring v2p mappings
  • Test that the delete functionality cleans up v2p mappings

Possible Optimizations

  • Expose internal API that a sled-agent can call to pre-emptively refresh v2p mappings? (this will allow individual sleds to have v2p entries repopulated more rapidly upon restarting, as long as Nexus is running)

Related

Resolves #5214
Resolves #4259
Resolves #3107

@davepacheco
Copy link
Collaborator

Ooh, will this also resolve #4259? We had talked about using an RPW for the v2p mappings in that ticket as well.

@internet-diglett
Copy link
Contributor Author

Ooh, will this also resolve #4259? We had talked about using an RPW for the v2p mappings in that ticket as well.

@davepacheco Why yes, yes it will! I love sweeping up multiple bugs / issues into a single fix!

@internet-diglett internet-diglett force-pushed the issue-5214-v2p-mapping-rpw branch from ea311b8 to ca42e1e Compare April 19, 2024 21:57
@internet-diglett
Copy link
Contributor Author

Quick tests are showing that our initial implementation is working

Desired State

VPC 14487919
----------------------------------------------------------------------

IPv4 mappings
----------------------------------------------------------------------
VPC_IP      VPC MAC ADDR       UNDERLAY IP
172.30.0.5  A8:40:25:FD:DA:2B  fd00:1122:3344:102::1
172.30.0.6  A8:40:25:FE:89:1F  fd00:1122:3344:103::1
172.30.0.7  A8:40:25:F8:CA:88  fd00:1122:3344:101::1
172.30.0.8  A8:40:25:F6:66:01  fd00:1122:3344:104::1

Push the v2p state away from the desired state

root@g3:~# /opt/oxide/opte/bin/opteadm set-v2p 172.30.0.7 A8:40:25:FE:89:1F fd00:1122:3344:103::1 14487919

v2p state is now incorrect

VPC 14487919
----------------------------------------------------------------------

IPv4 mappings
----------------------------------------------------------------------
VPC_IP      VPC MAC ADDR       UNDERLAY IP
172.30.0.5  A8:40:25:FD:DA:2B  fd00:1122:3344:102::1
172.30.0.6  A8:40:25:FE:89:1F  fd00:1122:3344:103::1
172.30.0.7  A8:40:25:FE:89:1F  fd00:1122:3344:103::1
172.30.0.8  A8:40:25:F6:66:01  fd00:1122:3344:104::1

rpw corrects the drift

17:47:32.713Z INFO e7ef9388-d6e1-4f1c-8a8b-0c1a0987e391 (ServerContext): v2p mappings to add                                                            
    background_task = v2p_manager                                                                                                                       
    file = nexus/src/app/background/v2p_mappings.rs:147                                                                                                 
    mappings = [SetVirtualNetworkInterfaceHost { physical_host_ip: fd00:1122:3344:101::1, virtual_ip: 172.30.0.7, virtual_mac: MacAddr(MacAddr6([168, 64
, 37, 248, 202, 136])), vni: Vni(14487919) }]                                                                                                           
    sled = g3                            

v2p state is now correct again

VPC 14487919
----------------------------------------------------------------------

IPv4 mappings
----------------------------------------------------------------------
VPC_IP      VPC MAC ADDR       UNDERLAY IP
172.30.0.5  A8:40:25:FD:DA:2B  fd00:1122:3344:102::1
172.30.0.6  A8:40:25:FE:89:1F  fd00:1122:3344:103::1
172.30.0.7  A8:40:25:F8:CA:88  fd00:1122:3344:101::1
172.30.0.8  A8:40:25:F6:66:01  fd00:1122:3344:104::1

@internet-diglett
Copy link
Contributor Author

Going to shift gears and finish up the delete implementation in the OPTE side of things

@internet-diglett internet-diglett marked this pull request as ready for review April 29, 2024 23:45
Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

illumos-utils/src/opte/port_manager.rs Outdated Show resolved Hide resolved
illumos-utils/src/opte/port_manager.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/network_interface.rs Outdated Show resolved Hide resolved
use omicron_common::api::external::ListResultVec;

impl DataStore {
pub async fn v2p_mappings(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this also need a paginated query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... I admittedly tend to use DataPageParams::max_page() in RPWs whenever using db queries that require pagination parameters, and that will return u32::MAX (4_294_967_296) entries, so I'd say it's kind of moot here, unless we want to standardize on a lower limit for db queries in background tasks (which honestly, it wouldn't shock me if we decide that it would be a good idea to do so!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under @davepacheco's direction we've started using this pattern in reconfigurator RPWs:

opctx.check_complex_operations_allowed()?;
let mut zpools = Vec::new();
let mut paginator = Paginator::new(SQL_BATCH_SIZE);
while let Some(p) = paginator.next() {
let batch = self
.zpool_list_all_external(opctx, &p.current_pagparams())
.await?;
paginator = p.found_batch(&batch, &|(z, _)| z.id());
zpools.extend(batch);
}

check_complex_operations_allowed() fails the opctx is associated with an external API client, and then we do still list all the items, but in paginated batches to avoid a giant result set coming from crdb.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the query to use this pattern.

nexus/src/app/background/v2p_mappings.rs Outdated Show resolved Hide resolved
nexus/src/app/background/v2p_mappings.rs Outdated Show resolved Hide resolved
nexus/src/app/background/v2p_mappings.rs Show resolved Hide resolved
Comment on lines 87 to 89
// - it means that delete calls are required as well as set calls,
// meaning that now the ordering of those matters (this may also
// necessitate a generation number for V2P mappings)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think generation numbers are required with the approach of using a periodic background task to correct drift... what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it mostly depends on whether or not it is acceptable to send / read the entire configuration during a reconciliation loop. For things like NAT, the total configuration can be quite large, so we employ generation numbers so we can send only the required updates instead of the full set. For v2p mappings, the individual messages are much smaller. My napkin math from before showed that if we had a vm per core, with a vnic per vm, that would be something in the ballpark of 213KB for a full set of updates for a single sled, which seems reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@internet-diglett
Copy link
Contributor Author

All outstanding comments have been resolved, going ahead and setting this to auto-merge unless anyone has an objection. I'd like to get this soaking on Dogfood sooner rather than later for the upcoming release :)

@internet-diglett
Copy link
Contributor Author

sigh looks like a test is flaky

--- STDERR:              omicron-nexus::test_all integration_tests::instances::test_instance_v2p_mappings ---
log file: /tmp/test_all-0075ccc75846a7de-test_instance_v2p_mappings.15117.0.log
note: configured to log to "/tmp/test_all-0075ccc75846a7de-test_instance_v2p_mappings.15117.0.log"
DB URL: postgresql://root@[::1]:35454/omicron?sslmode=disable
DB address: [::1]:35454
log file: /tmp/test_all-0075ccc75846a7de-test_instance_v2p_mappings.15117.2.log
note: configured to log to "/tmp/test_all-0075ccc75846a7de-test_instance_v2p_mappings.15117.2.log"
log file: /tmp/test_all-0075ccc75846a7de-test_instance_v2p_mappings.15117.3.log
note: configured to log to "/tmp/test_all-0075ccc75846a7de-test_instance_v2p_mappings.15117.3.log"
thread 'integration_tests::instances::test_instance_v2p_mappings' panicked at nexus/tests/integration_tests/instances.rs:4561:10:
v2p mappings should be empty: PermanentError("v2p mappings are still present")

@internet-diglett internet-diglett disabled auto-merge May 18, 2024 06:43
@internet-diglett internet-diglett enabled auto-merge (squash) May 21, 2024 19:02
@internet-diglett
Copy link
Contributor Author

Fixed the flaky test. I was just using the tooling wrong 😛

@internet-diglett internet-diglett disabled auto-merge May 22, 2024 15:57
@internet-diglett internet-diglett enabled auto-merge (squash) May 22, 2024 17:49
@internet-diglett internet-diglett merged commit 2082942 into main May 22, 2024
22 checks passed
@internet-diglett internet-diglett deleted the issue-5214-v2p-mapping-rpw branch May 22, 2024 18:52
@internet-diglett internet-diglett added this to the 9 milestone May 23, 2024
hawkw added a commit that referenced this pull request May 24, 2024
hawkw added a commit that referenced this pull request May 30, 2024
hawkw added a commit that referenced this pull request Jun 3, 2024
hawkw added a commit that referenced this pull request Jun 10, 2024
hawkw added a commit that referenced this pull request Jun 13, 2024
hawkw added a commit that referenced this pull request Jun 19, 2024
hawkw added a commit that referenced this pull request Jun 24, 2024
hawkw added a commit that referenced this pull request Jul 1, 2024
hawkw added a commit that referenced this pull request Jul 3, 2024
hawkw added a commit that referenced this pull request Jul 10, 2024
hawkw added a commit that referenced this pull request Jul 17, 2024
hawkw added a commit that referenced this pull request Jul 26, 2024
hawkw added a commit that referenced this pull request Jul 27, 2024
hawkw added a commit that referenced this pull request Jul 30, 2024
hawkw added a commit that referenced this pull request Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants