[sled-agent] Allocate VNICs over etherstubs, fix inter-zone routing #1066

smklein · 2022-05-13T16:27:16Z

Fixes #987

Goals of this PR:

Be able to ping "sled addresses" (within the Sled's /64) from the GZ or non-GZ
Be able to ping "AZ-wide" services (like the internal DNS service) from either the GZ or the non-GZ

Implementation details of this PR:

Switches all VNIC allocation to occur over an "etherstub" device, called stub0.
Allocate all GZ addresses (bootstrap, Sled, addrconf) over an "etherstub"-allocated VNIC, called underlay0.

smf/sled-agent/config.toml

smklein · 2022-05-13T20:55:57Z

sled-agent/src/illumos/zone.rs

@@ -122,7 +123,7 @@ impl AddressRequest {
    pub fn new_static(ip: IpAddr, prefix: Option<u8>) -> Self {
        let prefix = prefix.unwrap_or_else(|| match ip {
            IpAddr::V4(_) => 24,
-            IpAddr::V6(_) => 64,
+            IpAddr::V6(_) => AZ_PREFIX,


This ended up being a major aspect of this patch - without it, I could ping all /64 addresses between GZ / non-GZ zones, but not the DNS addresses.

However, by opening it up to the AZ prefix, I can also communicate between arbitrary "sled-local" services and the internal-dns server, which resides outside the sled's /64.

I can see how we need that, but I'm not entirely sure it's how we want to solve the problem of routing to the DNS server. IIUC, you're saying that the sled agent's VNICs for Oxide services (sled agent, nexus, propolis, etc) are now things like: fd00:1122:3344:101:/48. What does the sled's /64 prefix mean in this setup? I think @rcgoodfellow or @rmustacc should probably weigh in here, since it seems to me to kinda be skirting the real meaning of that prefix.

I think one option would be to add a separate route which specifies the DNS server's address / prefix. I believe DDM will ultimately be manipulating the OS's routing tables so that's actually true. But that may not be enough, in that traffic from the VNIC also needs a route pointing it to an interface for the DNS address. I don't know enough to be sure here.

For context, this is what my global zone looks like:

// Note, basically all `/64` $ ipadm ... underlay0/linklocal addrconf ok fe80::8:20ff:fea6:3b8/10 underlay0/bootstrap6 static ok fdb0:18c0:4d0c:f4e5::1/64 underlay0/sled6 static ok fd00:1122:3344:101::1/64 underlay0/internaldns static ok fd00:1122:3344:1::2/64

Meanwhile, in Nexus (non-global zone):

// Note, this is where the `/48` shows up - it's the AZ_PREFIX. # ipadm ... oxControlService1/linklocal addrconf ok fe80::8:20ff:fe35:d2a5/10 oxControlService1/omicron6 static ok fd00:1122:3344:101::3/48

This /48 specifically alters the routing within the non-global zone - netstat -rn -f inet6 in Nexus shows the following:

Routing Table: IPv6 Destination/Mask Gateway Flags Ref Use If --------------------------- --------------------------- ----- --- ------- ----- ::1 ::1 UH 2 0 lo0 fd00:1122:3344::/48 fd00:1122:3344:101::3 U 6 2259 oxControlService1 fe80::/10 fe80::8:20ff:fe35:d2a5 U 2 0 oxControlService1

Having all traffic destined for the AZ routed through the interface is the piece I really care about here.

Do you think it would be preferable to:

Continue allocating addresses within non-global zones as /64

Call route to manually add this path to the AZ subnet?

Thanks for those details. I think that is what I expected, that we'd have a route entry that says "anything in fd00:1122:3344::/48 should go out the VNIC oxControlService1". The part I'm wondering about is, that would imply that the netstack would expect that it could use that interface for any traffic from the Nexus zone for any other sled. As I write this, I realize that may be fine. If Nexus is trying to reach another service on the same sled, that packet will go out the VNIC, to the etherstub, and then presumably to the other zone's VNIC. If it's trying to reach something off the sled, I'm less sure of what'll happen there. It looks like it'll still go to the zone VNIC, the etherstub, and then to whatever route you have in the GZ that matches that (if one exists).

I think this is probably fine. It also seems to be working for a single machine, and it's easy enough to update this if we find it doesn't work for multiple machines.

Do you think it would be preferable to:

Continue allocating addresses within non-global zones as /64

Call route to manually add this path to the AZ subnet?

I've verified that this method works too - I'm seeing routing between zones by using:

route add -inet6 <address>/48 <address> -interface

When setting up a non-GZ address

Isn't the point of the default route that "if the destination address doesn't match the other rules, it should use this gateway"?

I tried your suggestion, but this doesn't seem to be working for me:

root@oxz_nexus:~# pfexec route add -inet6 fd00:1122:3344:1::1 fd00:1122:3344:101::3 add host fd00:1122:3344:1::1: gateway fd00:1122:3344:101::3: Network is unreachable // Also does not not work with the `/48` in the destination root@oxz_nexus:~# pfexec route add -inet6 fd00:1122:3344:1::1/48 -inet6 fd00:1122:3344:101::3 add net fd00:1122:3344:1::1/48: gateway fd00:1122:3344:101::3: Network is unreachable

Chatting out-of-band with @bnaecker a bit: By issuing the following in the GZ:

routeadm -e ipv6-forwarding routeadm -u

I'm seeing the routing make the extra hop, from NGZ -> GZ (and now) -> NGZ

That's true, I explained that very poorly. I was trying to point out that this command:

root@oxz_nexus:~# pfexec route add -inet6 default -inet6 fd00:1122:3344:101::1

isn't what I'd expect. In particular, that says for any traffic without a more specific route, send it to the gateway fd00:1122:3344:101::1. But that's not a gateway that the nexus zone has! The netstat -rn output shows the gateway we need as fd00:1122:3344:101::3.

But in any case, Robert pointed out that these routing tables are necessary but not sufficient to get this all to work. Specifically, we need to tell the GZ to actually act as a router, forwarding packets between different networks. That is, we've provided rules (assuming we can figure out how to express them 😆 ) for the routing daemon to use when forwarding packets, but it'll only do so if it's explicitly told it should.

I believe this can be accomplished with the command routeadm -e ipv6-forwarding -u, which enables route forwarding and restarts the SMF service(s) necessary to make that apply to the running system. IIUC, at that point, when the GZ networking stack receives a packet from the nexus zone, with an IP address of the (non-global) DNS zone, it'll attempt to forward that, by consulting the routing table.

I'm hypothesizing, but it seems like we need two routes then:

A route that tells the nexus zone to use it's VNIC's address as the gateway for DNS traffic

A route that tells the GZ how to reach the DNS zone's addresses

The former could be a default route, or a more constrained one listing the prefix for the DNS server. It seems like either should work, as long as the gateway is the IP address of the VNIC in the nexus zone, in this case fd00:1122:3344:101::3.

The latter can be accomplished by adding a route table that directs all the DNS traffic to the GZ's VNIC, I think. My understanding is that this would go onto the GZ VNIC, to the etherstub, and then forwarded to the non-global DNS zone VNIC.

I was initially confused as to why the "virtual switch" that man dladm describes under the create-etherstub command doesn't transparently do this. All the traffic is within that etherstub, and I'd have expected neighbor discovery and thus routing to be done automatically. So why do we need this?

The key is that the DNS addresses are in a different subnet. The etherstub will transparently create routes between all the other non-global zones, but once you're trying to reach an address in a different subnet, that has to involve routing. This explains why the -interface flag worked initially, too. That's effectively telling the etherstub that the other subnet can actually be routed to through the same L2 domain, even though it's on a different L3 subnet.

Robert pointed out that we may actually want a separate etherstub for the DNS zone. That'd more closely model the actual network we're emulating. In particular, we're trying to say that the GZ and all the non-DNS service zones are one little subnet, in the sled's /64. The DNS service is explicitly in a separate /64, for route summarization and the fact that it really is supposed to be a rack-wide or AZ-wide service.

To be clear, we should not add an additional etherstub in this PR. I think that's where we want to go longer-term, but we can defer it for sure.

So summarizing everything. When nexus wants to send a packet to the DNS server, that'll first go to the etherstub. The etherstub will not explicitly have a gateway for that, since it's in another subnet. It'll deliver it to the GZ. At that point, the IP stack in the GZ will take the packet and also note that it doesn't have that address. It'll instead consult the routing tables (assuming forwarding is enabled), and note that it can send that...back to the etherstub! That'll then go to the DNZ zone.

I believe the following will work, which summarizes the above conversation in part, and also makes a few simplifications.

I've tested this out on a fresh VM by creating the zones and doing all the plumbing and things appear to work. Here is what the setup looks like live. It does still require routeadm -e ipv6-forwarding -u in the GZ.

GZ

root@sled# ipadm ADDROBJ TYPE STATE ADDR lo0/v4 static ok 127.0.0.1/8 vioif0/v6 dhcp ok 10.47.0.83/24 lo0/v6 static ok ::1/128 vnic0/v6 addrconf ok fe80::8:20ff:fefc:a943/10 vnic0/omicron static ok fd00:1122:3344:101::1/64 vnic0/dns static ok fd00:1122:3344:1::2/64

root@sled# netstat -nr -f inet6 Routing Table: IPv6 Destination/Mask Gateway Flags Ref Use If --------------------------- --------------------------- ----- --- ------- ----- ::1 ::1 UH 2 20 lo0 fd00:1122:3344:1::/64 fd00:1122:3344:1::2 U 3 16 vnic0 fd00:1122:3344:101::/64 fd00:1122:3344:101::1 U 3 11 vnic0 fe80::/10 fe80::8:20ff:fefc:a943 U 2 0 vnic0

Ping the Omicron zone

root@han:/opt/cargo-bay# ping fd00:1122:3344:101::3 fd00:1122:3344:101::3 is alive

Ping the DNS zone

root@han:/opt/cargo-bay# ping fd00:1122:3344:1::1 fd00:1122:3344:1::1 is alive

DNS Zone

root@dns:~# ipadm ADDROBJ TYPE STATE ADDR lo0/v4 static ok 127.0.0.1/8 lo0/v6 static ok ::1/128 vnic1/v6 addrconf ok fe80::8:20ff:fe6a:2b/10 vnic1/underlay static ok fd00:1122:3344:1::1/64

root@dns:~# netstat -nr -f inet6 Routing Table: IPv6 Destination/Mask Gateway Flags Ref Use If --------------------------- --------------------------- ----- --- ------- ----- ::1 ::1 UH 2 0 lo0 fd00:1122:3344:1::/64 fd00:1122:3344:1::1 U 3 7 vnic1 fd00:1122:3344::/48 fd00:1122:3344:1::2 UG 2 3 fe80::/10 fe80::8:20ff:fe6a:2b U 2 0 vnic1

Ping the Omicron zone

root@dns:~# ping fd00:1122:3344:101::3 ICMPv6 redirect from gateway fe80::8:20ff:fefc:a943 to fd00:1122:3344:101::3 for fd00:1122:3344:101::3 fd00:1122:3344:101::3 is alive

Omicron Zone

root@omicron:~# ipadm ADDROBJ TYPE STATE ADDR lo0/v4 static ok 127.0.0.1/8 lo0/v6 static ok ::1/128 vnic2/v6 addrconf ok fe80::8:20ff:fe54:569d/10 vnic2/underlay static ok fd00:1122:3344:101::3/64 root@omicron:~# netstat -nr -f inet6

Routing Table: IPv6 Destination/Mask Gateway Flags Ref Use If --------------------------- --------------------------- ----- --- ------- ----- ::1 ::1 UH 2 0 lo0 fd00:1122:3344:101::/64 fd00:1122:3344:101::3 U 3 4 vnic2 fd00:1122:3344::/48 fd00:1122:3344:101::1 UG 2 5 fe80::/10 fe80::8:20ff:fe54:569d U 2 0 vnic2

Ping the DNS zone

root@omicron:~# ping fd00:1122:3344:0001::1 ICMPv6 redirect from gateway fe80::8:20ff:fefc:a943 to fd00:1122:3344:1::1 for fd00:1122:3344:1::1 fd00:1122:3344:0001::1 is alive

As of 44dc885 , I am automatically adding these routes within the Sled Agent, and confirm connectivity between all zones / GZ.

smklein · 2022-05-13T21:00:11Z

Underlay routing looks happy now!

 $ DNS_ADDRESS="fd00:1122:3344:1::1"
 $ NEXUS_ADDRESS="fd00:1122:3344:101::3"
 $ SLED_ADDRESS="fd00:1122:3344:101::1"
 
// Ping addresses from Global Zone
 $ ping $DNS_ADDRESS && ping $NEXUS_ADDRESS && ping $SLED_ADDRESS 
fd00:1122:3344:1::1 is alive
fd00:1122:3344:101::3 is alive
fd00:1122:3344:101::1 is alive

// Ping addresses from Nexus Zone
 $ pfexec zlogin oxz_nexus ping $DNS_ADDRESS && ping $NEXUS_ADDRESS && ping $SLED_ADDRESS
fd00:1122:3344:1::1 is alive
fd00:1122:3344:101::3 is alive
fd00:1122:3344:101::1 is alive

// Ping addresses from Internal DNS zone
 $ pfexec zlogin oxz_internal-dns ping $DNS_ADDRESS && ping $NEXUS_ADDRESS && ping $SLED_ADDRESS
fd00:1122:3344:1::1 is alive
fd00:1122:3344:101::3 is alive
fd00:1122:3344:101::1 is alive

bnaecker

Seems OK to me at this point. I'd love some confirmation from others, but I don't think that should block integration, since these changes are likely straightforward to modify.

sled-agent/src/illumos/dladm.rs

…outing (#1066)" This reverts commit 813a859.

Propolis changes since the last update: Gripe when using non-raw block device Update zerocopy dependency nvme: Wire up GetFeatures command Make Viona more robust in the face of errors bump softnpu (#577) Modernize 16550 UART Crucible changes since the last update: Don't check ROP if the scrub is done (#1093) Allow crutest cli to be quiet on generic test (#1070) Offload write encryption (#1066) Simplify handling of BlockReq at program exit (#1085) Update Rust crate byte-unit to v5 (#1054) Remove unused fields in match statements, downstairs edition (#1084) Remove unused fields in match statements and consolidate (#1083) Add logger to Guest (#1082) Drive hash / decrypt tests from Upstairs::apply Wait to reconnect if auto_promote is false Change guest work id from u64 -> GuestWorkId remove BlockOp::Commit (#1072) Various clippy fixes (#1071) Don't panic if tasks are destroyed out of order Update Rust crate reedline to 0.27.1 (#1074) Update Rust crate async-trait to 0.1.75 (#1073) Buffer should destructure to Vec when single-referenced Don't fail to make unencrypted regions (#1067) Fix shadowing in downstairs (#1063) Single-task refactoring (#1058) Update Rust crate tokio to 1.35 (#1052) Update Rust crate openapiv3 to 2.0.0 (#1050) Update Rust crate libc to 0.2.151 (#1049) Update Rust crate rusqlite to 0.30 (#1035)

Propolis changes since the last update: Gripe when using non-raw block device Update zerocopy dependency nvme: Wire up GetFeatures command Make Viona more robust in the face of errors bump softnpu (#577) Modernize 16550 UART Crucible changes since the last update: Don't check ROP if the scrub is done (#1093) Allow crutest cli to be quiet on generic test (#1070) Offload write encryption (#1066) Simplify handling of BlockReq at program exit (#1085) Update Rust crate byte-unit to v5 (#1054) Remove unused fields in match statements, downstairs edition (#1084) Remove unused fields in match statements and consolidate (#1083) Add logger to Guest (#1082) Drive hash / decrypt tests from Upstairs::apply Wait to reconnect if auto_promote is false Change guest work id from u64 -> GuestWorkId remove BlockOp::Commit (#1072) Various clippy fixes (#1071) Don't panic if tasks are destroyed out of order Update Rust crate reedline to 0.27.1 (#1074) Update Rust crate async-trait to 0.1.75 (#1073) Buffer should destructure to Vec when single-referenced Don't fail to make unencrypted regions (#1067) Fix shadowing in downstairs (#1063) Single-task refactoring (#1058) Update Rust crate tokio to 1.35 (#1052) Update Rust crate openapiv3 to 2.0.0 (#1050) Update Rust crate libc to 0.2.151 (#1049) Update Rust crate rusqlite to 0.30 (#1035) --------- Co-authored-by: Alan Hanson <[email protected]>

WIP etherstub VNIC allocation

c20a122

smklein commented May 13, 2022

View reviewed changes

smf/sled-agent/config.toml Show resolved Hide resolved

bnaecker mentioned this pull request May 13, 2022

OPTE for Control Plane Zone Comms oxidecomputer/opte#127

Closed

2 tasks

smklein added 4 commits May 13, 2022 16:02

Working on illumos; time to fix tests

ba5113b

Fmt, fix tests

c95b139

Make opte happy too

b88c15f

re-add type-safety to vnic creation

211a95c

smklein commented May 13, 2022

View reviewed changes

cleanup comments

8b12900

smklein changed the title ~~WIP etherstub VNIC allocation~~ [sled-agent] Allocate VNICs over etherstubs, fix inter-zone routing May 13, 2022

smklein marked this pull request as ready for review May 13, 2022 20:57

smklein requested review from bnaecker and rcgoodfellow May 13, 2022 20:57

Fix tests

5b239a3

bnaecker approved these changes May 13, 2022

View reviewed changes

Manual route

942f843

rcgoodfellow reviewed May 16, 2022

View reviewed changes

sled-agent/src/illumos/dladm.rs Show resolved Hide resolved

smklein added 10 commits May 17, 2022 15:42

Merge branch 'main' into etherstub

e33c646

wip

c9bf704

Merge branch 'main' into etherstub

e1920be

Use 'route' to manually ensure routing works for non-gz zones

4362d67

Merge branch 'main' into etherstub

d9d6cfc

Merge branch 'main' into etherstub

1e93169

Add manual routes to GZ addresses

44dc885

Merge branch 'main' into etherstub

a85594b

Add advice for conflicting addresses

8fa90c7

fix tests

63cbeca

smklein enabled auto-merge (squash) June 1, 2022 00:50

smklein merged commit 813a859 into main Jun 1, 2022

smklein deleted the etherstub branch June 1, 2022 01:45

This was referenced Jun 1, 2022

Tunnel sled initialization requests through sprockets sessions #1128

Merged

[sled agent] Add default routes, using GZ addresses as gateways #1147

Merged

jgallagher added a commit that referenced this pull request Jun 3, 2022

Revert "[sled-agent] Allocate VNICs over etherstubs, fix inter-zone r…

9e410cd

…outing (#1066)" This reverts commit 813a859.

jgallagher mentioned this pull request Jun 8, 2022

Replace bootstrap-agent dropshot server with sprockets session #1173

Merged

leftwo mentioned this pull request Jan 10, 2024

Update Crucible and Propolis versions #4795

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sled-agent] Allocate VNICs over etherstubs, fix inter-zone routing #1066

[sled-agent] Allocate VNICs over etherstubs, fix inter-zone routing #1066

smklein commented May 13, 2022 •

edited

Loading

smklein May 13, 2022

bnaecker May 13, 2022

smklein May 13, 2022

bnaecker May 13, 2022

smklein May 13, 2022

smklein May 18, 2022 •

edited

Loading

smklein May 18, 2022

bnaecker May 18, 2022

rcgoodfellow May 19, 2022

smklein May 31, 2022

smklein commented May 13, 2022 •

edited

Loading

bnaecker left a comment

[sled-agent] Allocate VNICs over etherstubs, fix inter-zone routing #1066

[sled-agent] Allocate VNICs over etherstubs, fix inter-zone routing #1066

Conversation

smklein commented May 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein May 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GZ

DNS Zone

Omicron Zone

Choose a reason for hiding this comment

smklein commented May 13, 2022 • edited Loading

bnaecker left a comment

Choose a reason for hiding this comment

smklein commented May 13, 2022 •

edited

Loading

smklein May 18, 2022 •

edited

Loading

smklein commented May 13, 2022 •

edited

Loading