Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sled-agent] Allocate VNICs over etherstubs, fix inter-zone routing #1066

Merged
merged 18 commits into from
Jun 1, 2022

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented May 13, 2022

Fixes #987

Goals of this PR:

  • Be able to ping "sled addresses" (within the Sled's /64) from the GZ or non-GZ
  • Be able to ping "AZ-wide" services (like the internal DNS service) from either the GZ or the non-GZ

Implementation details of this PR:

  • Switches all VNIC allocation to occur over an "etherstub" device, called stub0.
  • Allocate all GZ addresses (bootstrap, Sled, addrconf) over an "etherstub"-allocated VNIC, called underlay0.

@@ -122,7 +123,7 @@ impl AddressRequest {
pub fn new_static(ip: IpAddr, prefix: Option<u8>) -> Self {
let prefix = prefix.unwrap_or_else(|| match ip {
IpAddr::V4(_) => 24,
IpAddr::V6(_) => 64,
IpAddr::V6(_) => AZ_PREFIX,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ended up being a major aspect of this patch - without it, I could ping all /64 addresses between GZ / non-GZ zones, but not the DNS addresses.

However, by opening it up to the AZ prefix, I can also communicate between arbitrary "sled-local" services and the internal-dns server, which resides outside the sled's /64.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how we need that, but I'm not entirely sure it's how we want to solve the problem of routing to the DNS server. IIUC, you're saying that the sled agent's VNICs for Oxide services (sled agent, nexus, propolis, etc) are now things like: fd00:1122:3344:101:/48. What does the sled's /64 prefix mean in this setup? I think @rcgoodfellow or @rmustacc should probably weigh in here, since it seems to me to kinda be skirting the real meaning of that prefix.

I think one option would be to add a separate route which specifies the DNS server's address / prefix. I believe DDM will ultimately be manipulating the OS's routing tables so that's actually true. But that may not be enough, in that traffic from the VNIC also needs a route pointing it to an interface for the DNS address. I don't know enough to be sure here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, this is what my global zone looks like:

// Note, basically all `/64`
$ ipadm
...
underlay0/linklocal addrconf ok         fe80::8:20ff:fea6:3b8/10
underlay0/bootstrap6 static ok          fdb0:18c0:4d0c:f4e5::1/64
underlay0/sled6   static   ok           fd00:1122:3344:101::1/64
underlay0/internaldns static ok         fd00:1122:3344:1::2/64

Meanwhile, in Nexus (non-global zone):

// Note, this is where the `/48` shows up - it's the AZ_PREFIX.
# ipadm
...
oxControlService1/linklocal addrconf ok fe80::8:20ff:fe35:d2a5/10
oxControlService1/omicron6 static ok    fd00:1122:3344:101::3/48

This /48 specifically alters the routing within the non-global zone - netstat -rn -f inet6 in Nexus shows the following:

Routing Table: IPv6
  Destination/Mask            Gateway                   Flags Ref   Use    If   
--------------------------- --------------------------- ----- --- ------- ----- 
::1                         ::1                         UH      2       0 lo0   
fd00:1122:3344::/48         fd00:1122:3344:101::3       U       6    2259 oxControlService1 
fe80::/10                   fe80::8:20ff:fe35:d2a5      U       2       0 oxControlService1

Having all traffic destined for the AZ routed through the interface is the piece I really care about here.

Do you think it would be preferable to:

  • Continue allocating addresses within non-global zones as /64
  • Call route to manually add this path to the AZ subnet?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for those details. I think that is what I expected, that we'd have a route entry that says "anything in fd00:1122:3344::/48 should go out the VNIC oxControlService1". The part I'm wondering about is, that would imply that the netstack would expect that it could use that interface for any traffic from the Nexus zone for any other sled. As I write this, I realize that may be fine. If Nexus is trying to reach another service on the same sled, that packet will go out the VNIC, to the etherstub, and then presumably to the other zone's VNIC. If it's trying to reach something off the sled, I'm less sure of what'll happen there. It looks like it'll still go to the zone VNIC, the etherstub, and then to whatever route you have in the GZ that matches that (if one exists).

I think this is probably fine. It also seems to be working for a single machine, and it's easy enough to update this if we find it doesn't work for multiple machines.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be preferable to:

  • Continue allocating addresses within non-global zones as /64
  • Call route to manually add this path to the AZ subnet?

I've verified that this method works too - I'm seeing routing between zones by using:

route add -inet6 <address>/48 <address> -interface

When setting up a non-GZ address

Copy link
Collaborator Author

@smklein smklein May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the point of the default route that "if the destination address doesn't match the other rules, it should use this gateway"?

I tried your suggestion, but this doesn't seem to be working for me:

root@oxz_nexus:~# pfexec route add -inet6 fd00:1122:3344:1::1 fd00:1122:3344:101::3
add host fd00:1122:3344:1::1: gateway fd00:1122:3344:101::3: Network is unreachable

// Also does not not work with the `/48` in the destination
root@oxz_nexus:~# pfexec route add -inet6 fd00:1122:3344:1::1/48 -inet6 fd00:1122:3344:101::3
add net fd00:1122:3344:1::1/48: gateway fd00:1122:3344:101::3: Network is unreachable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatting out-of-band with @bnaecker a bit: By issuing the following in the GZ:

routeadm -e ipv6-forwarding
routeadm -u

I'm seeing the routing make the extra hop, from NGZ -> GZ (and now) -> NGZ

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, I explained that very poorly. I was trying to point out that this command:

root@oxz_nexus:~# pfexec route add -inet6 default -inet6 fd00:1122:3344:101::1

isn't what I'd expect. In particular, that says for any traffic without a more specific route, send it to the gateway fd00:1122:3344:101::1. But that's not a gateway that the nexus zone has! The netstat -rn output shows the gateway we need as fd00:1122:3344:101::3.

But in any case, Robert pointed out that these routing tables are necessary but not sufficient to get this all to work. Specifically, we need to tell the GZ to actually act as a router, forwarding packets between different networks. That is, we've provided rules (assuming we can figure out how to express them 😆 ) for the routing daemon to use when forwarding packets, but it'll only do so if it's explicitly told it should.

I believe this can be accomplished with the command routeadm -e ipv6-forwarding -u, which enables route forwarding and restarts the SMF service(s) necessary to make that apply to the running system. IIUC, at that point, when the GZ networking stack receives a packet from the nexus zone, with an IP address of the (non-global) DNS zone, it'll attempt to forward that, by consulting the routing table.

I'm hypothesizing, but it seems like we need two routes then:

  • A route that tells the nexus zone to use it's VNIC's address as the gateway for DNS traffic
  • A route that tells the GZ how to reach the DNS zone's addresses

The former could be a default route, or a more constrained one listing the prefix for the DNS server. It seems like either should work, as long as the gateway is the IP address of the VNIC in the nexus zone, in this case fd00:1122:3344:101::3.

The latter can be accomplished by adding a route table that directs all the DNS traffic to the GZ's VNIC, I think. My understanding is that this would go onto the GZ VNIC, to the etherstub, and then forwarded to the non-global DNS zone VNIC.

I was initially confused as to why the "virtual switch" that man dladm describes under the create-etherstub command doesn't transparently do this. All the traffic is within that etherstub, and I'd have expected neighbor discovery and thus routing to be done automatically. So why do we need this?

The key is that the DNS addresses are in a different subnet. The etherstub will transparently create routes between all the other non-global zones, but once you're trying to reach an address in a different subnet, that has to involve routing. This explains why the -interface flag worked initially, too. That's effectively telling the etherstub that the other subnet can actually be routed to through the same L2 domain, even though it's on a different L3 subnet.

Robert pointed out that we may actually want a separate etherstub for the DNS zone. That'd more closely model the actual network we're emulating. In particular, we're trying to say that the GZ and all the non-DNS service zones are one little subnet, in the sled's /64. The DNS service is explicitly in a separate /64, for route summarization and the fact that it really is supposed to be a rack-wide or AZ-wide service.

To be clear, we should not add an additional etherstub in this PR. I think that's where we want to go longer-term, but we can defer it for sure.

So summarizing everything. When nexus wants to send a packet to the DNS server, that'll first go to the etherstub. The etherstub will not explicitly have a gateway for that, since it's in another subnet. It'll deliver it to the GZ. At that point, the IP stack in the GZ will take the packet and also note that it doesn't have that address. It'll instead consult the routing tables (assuming forwarding is enabled), and note that it can send that...back to the etherstub! That'll then go to the DNZ zone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the following will work, which summarizes the above conversation in part, and also makes a few simplifications.

image

I've tested this out on a fresh VM by creating the zones and doing all the plumbing and things appear to work. Here is what the setup looks like live. It does still require routeadm -e ipv6-forwarding -u in the GZ.

GZ

root@sled# ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
vioif0/v6         dhcp     ok           10.47.0.83/24
lo0/v6            static   ok           ::1/128
vnic0/v6          addrconf ok           fe80::8:20ff:fefc:a943/10
vnic0/omicron     static   ok           fd00:1122:3344:101::1/64
vnic0/dns         static   ok           fd00:1122:3344:1::2/64
root@sled# netstat -nr -f inet6

Routing Table: IPv6
  Destination/Mask            Gateway                   Flags Ref   Use    If
--------------------------- --------------------------- ----- --- ------- -----
::1                         ::1                         UH      2      20 lo0
fd00:1122:3344:1::/64       fd00:1122:3344:1::2         U       3      16 vnic0
fd00:1122:3344:101::/64     fd00:1122:3344:101::1       U       3      11 vnic0
fe80::/10                   fe80::8:20ff:fefc:a943      U       2       0 vnic0

Ping the Omicron zone

root@han:/opt/cargo-bay# ping fd00:1122:3344:101::3
fd00:1122:3344:101::3 is alive

Ping the DNS zone

root@han:/opt/cargo-bay# ping fd00:1122:3344:1::1
fd00:1122:3344:1::1 is alive

DNS Zone

root@dns:~# ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
lo0/v6            static   ok           ::1/128
vnic1/v6          addrconf ok           fe80::8:20ff:fe6a:2b/10
vnic1/underlay    static   ok           fd00:1122:3344:1::1/64
root@dns:~# netstat -nr -f inet6

Routing Table: IPv6
  Destination/Mask            Gateway                   Flags Ref   Use    If
--------------------------- --------------------------- ----- --- ------- -----
::1                         ::1                         UH      2       0 lo0
fd00:1122:3344:1::/64       fd00:1122:3344:1::1         U       3       7 vnic1
fd00:1122:3344::/48         fd00:1122:3344:1::2         UG      2       3
fe80::/10                   fe80::8:20ff:fe6a:2b        U       2       0 vnic1

Ping the Omicron zone

root@dns:~# ping fd00:1122:3344:101::3
ICMPv6 redirect from gateway fe80::8:20ff:fefc:a943
 to fd00:1122:3344:101::3 for fd00:1122:3344:101::3
fd00:1122:3344:101::3 is alive

Omicron Zone

root@omicron:~# ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
lo0/v6            static   ok           ::1/128
vnic2/v6          addrconf ok           fe80::8:20ff:fe54:569d/10
vnic2/underlay    static   ok           fd00:1122:3344:101::3/64
root@omicron:~# netstat -nr -f inet6
Routing Table: IPv6
  Destination/Mask            Gateway                   Flags Ref   Use    If
--------------------------- --------------------------- ----- --- ------- -----
::1                         ::1                         UH      2       0 lo0
fd00:1122:3344:101::/64     fd00:1122:3344:101::3       U       3       4 vnic2
fd00:1122:3344::/48         fd00:1122:3344:101::1       UG      2       5
fe80::/10                   fe80::8:20ff:fe54:569d      U       2       0 vnic2

Ping the DNS zone

root@omicron:~# ping fd00:1122:3344:0001::1
ICMPv6 redirect from gateway fe80::8:20ff:fefc:a943
 to fd00:1122:3344:1::1 for fd00:1122:3344:1::1
fd00:1122:3344:0001::1 is alive

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of 44dc885 , I am automatically adding these routes within the Sled Agent, and confirm connectivity between all zones / GZ.

@smklein smklein changed the title WIP etherstub VNIC allocation [sled-agent] Allocate VNICs over etherstubs, fix inter-zone routing May 13, 2022
@smklein smklein marked this pull request as ready for review May 13, 2022 20:57
@smklein smklein requested review from bnaecker and rcgoodfellow May 13, 2022 20:57
@smklein
Copy link
Collaborator Author

smklein commented May 13, 2022

Underlay routing looks happy now!

 $ DNS_ADDRESS="fd00:1122:3344:1::1"
 $ NEXUS_ADDRESS="fd00:1122:3344:101::3"
 $ SLED_ADDRESS="fd00:1122:3344:101::1"
 
// Ping addresses from Global Zone
 $ ping $DNS_ADDRESS && ping $NEXUS_ADDRESS && ping $SLED_ADDRESS 
fd00:1122:3344:1::1 is alive
fd00:1122:3344:101::3 is alive
fd00:1122:3344:101::1 is alive

// Ping addresses from Nexus Zone
 $ pfexec zlogin oxz_nexus ping $DNS_ADDRESS && ping $NEXUS_ADDRESS && ping $SLED_ADDRESS
fd00:1122:3344:1::1 is alive
fd00:1122:3344:101::3 is alive
fd00:1122:3344:101::1 is alive

// Ping addresses from Internal DNS zone
 $ pfexec zlogin oxz_internal-dns ping $DNS_ADDRESS && ping $NEXUS_ADDRESS && ping $SLED_ADDRESS
fd00:1122:3344:1::1 is alive
fd00:1122:3344:101::3 is alive
fd00:1122:3344:101::1 is alive

Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems OK to me at this point. I'd love some confirmation from others, but I don't think that should block integration, since these changes are likely straightforward to modify.

@smklein smklein enabled auto-merge (squash) June 1, 2022 00:50
@smklein smklein merged commit 813a859 into main Jun 1, 2022
@smklein smklein deleted the etherstub branch June 1, 2022 01:45
jgallagher added a commit that referenced this pull request Jun 3, 2022
leftwo pushed a commit that referenced this pull request Jan 10, 2024
Propolis changes since the last update:
Gripe when using non-raw block device
Update zerocopy dependency
nvme: Wire up GetFeatures command
Make Viona more robust in the face of errors
bump softnpu (#577)
Modernize 16550 UART

Crucible changes since the last update:
Don't check ROP if the scrub is done (#1093)
Allow crutest cli to be quiet on generic test (#1070)
Offload write encryption (#1066)
Simplify handling of BlockReq at program exit (#1085)
Update Rust crate byte-unit to v5 (#1054)
Remove unused fields in match statements, downstairs edition (#1084)
Remove unused fields in match statements and consolidate (#1083)
Add logger to Guest (#1082)
Drive hash / decrypt tests from Upstairs::apply
Wait to reconnect if auto_promote is false
Change guest work id from u64 -> GuestWorkId
remove BlockOp::Commit (#1072)
Various clippy fixes (#1071)
Don't panic if tasks are destroyed out of order
Update Rust crate reedline to 0.27.1 (#1074)
Update Rust crate async-trait to 0.1.75 (#1073)
Buffer should destructure to Vec when single-referenced
Don't fail to make unencrypted regions (#1067)
Fix shadowing in downstairs (#1063)
Single-task refactoring (#1058)
Update Rust crate tokio to 1.35 (#1052)
Update Rust crate openapiv3 to 2.0.0 (#1050)
Update Rust crate libc to 0.2.151 (#1049)
Update Rust crate rusqlite to 0.30 (#1035)
leftwo added a commit that referenced this pull request Jan 11, 2024
Propolis changes since the last update:
Gripe when using non-raw block device
Update zerocopy dependency
nvme: Wire up GetFeatures command
Make Viona more robust in the face of errors
bump softnpu (#577)
Modernize 16550 UART

Crucible changes since the last update:
Don't check ROP if the scrub is done (#1093)
Allow crutest cli to be quiet on generic test (#1070)
Offload write encryption (#1066)
Simplify handling of BlockReq at program exit (#1085)
Update Rust crate byte-unit to v5 (#1054)
Remove unused fields in match statements, downstairs edition (#1084)
Remove unused fields in match statements and consolidate (#1083)
Add logger to Guest (#1082)
Drive hash / decrypt tests from Upstairs::apply
Wait to reconnect if auto_promote is false
Change guest work id from u64 -> GuestWorkId
remove BlockOp::Commit (#1072)
Various clippy fixes (#1071)
Don't panic if tasks are destroyed out of order
Update Rust crate reedline to 0.27.1 (#1074)
Update Rust crate async-trait to 0.1.75 (#1073)
Buffer should destructure to Vec when single-referenced
Don't fail to make unencrypted regions (#1067)
Fix shadowing in downstairs (#1063)
Single-task refactoring (#1058)
Update Rust crate tokio to 1.35 (#1052)
Update Rust crate openapiv3 to 2.0.0 (#1050)
Update Rust crate libc to 0.2.151 (#1049)
Update Rust crate rusqlite to 0.30 (#1035)

---------

Co-authored-by: Alan Hanson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[sled-agent] Allocate VNICs from a per-sled etherstub device, rather than using the physical link
4 participants