-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sled-agent] Allocate VNICs over etherstubs, fix inter-zone routing #1066
Conversation
sled-agent/src/illumos/zone.rs
Outdated
@@ -122,7 +123,7 @@ impl AddressRequest { | |||
pub fn new_static(ip: IpAddr, prefix: Option<u8>) -> Self { | |||
let prefix = prefix.unwrap_or_else(|| match ip { | |||
IpAddr::V4(_) => 24, | |||
IpAddr::V6(_) => 64, | |||
IpAddr::V6(_) => AZ_PREFIX, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ended up being a major aspect of this patch - without it, I could ping all /64
addresses between GZ / non-GZ zones, but not the DNS addresses.
However, by opening it up to the AZ prefix, I can also communicate between arbitrary "sled-local" services and the internal-dns server, which resides outside the sled's /64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how we need that, but I'm not entirely sure it's how we want to solve the problem of routing to the DNS server. IIUC, you're saying that the sled agent's VNICs for Oxide services (sled agent, nexus, propolis, etc) are now things like: fd00:1122:3344:101:/48
. What does the sled's /64 prefix mean in this setup? I think @rcgoodfellow or @rmustacc should probably weigh in here, since it seems to me to kinda be skirting the real meaning of that prefix.
I think one option would be to add a separate route which specifies the DNS server's address / prefix. I believe DDM will ultimately be manipulating the OS's routing tables so that's actually true. But that may not be enough, in that traffic from the VNIC also needs a route pointing it to an interface for the DNS address. I don't know enough to be sure here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context, this is what my global zone looks like:
// Note, basically all `/64`
$ ipadm
...
underlay0/linklocal addrconf ok fe80::8:20ff:fea6:3b8/10
underlay0/bootstrap6 static ok fdb0:18c0:4d0c:f4e5::1/64
underlay0/sled6 static ok fd00:1122:3344:101::1/64
underlay0/internaldns static ok fd00:1122:3344:1::2/64
Meanwhile, in Nexus (non-global zone):
// Note, this is where the `/48` shows up - it's the AZ_PREFIX.
# ipadm
...
oxControlService1/linklocal addrconf ok fe80::8:20ff:fe35:d2a5/10
oxControlService1/omicron6 static ok fd00:1122:3344:101::3/48
This /48
specifically alters the routing within the non-global zone - netstat -rn -f inet6
in Nexus shows the following:
Routing Table: IPv6
Destination/Mask Gateway Flags Ref Use If
--------------------------- --------------------------- ----- --- ------- -----
::1 ::1 UH 2 0 lo0
fd00:1122:3344::/48 fd00:1122:3344:101::3 U 6 2259 oxControlService1
fe80::/10 fe80::8:20ff:fe35:d2a5 U 2 0 oxControlService1
Having all traffic destined for the AZ routed through the interface is the piece I really care about here.
Do you think it would be preferable to:
- Continue allocating addresses within non-global zones as
/64
- Call route to manually add this path to the AZ subnet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for those details. I think that is what I expected, that we'd have a route entry that says "anything in fd00:1122:3344::/48
should go out the VNIC oxControlService1
". The part I'm wondering about is, that would imply that the netstack would expect that it could use that interface for any traffic from the Nexus zone for any other sled. As I write this, I realize that may be fine. If Nexus is trying to reach another service on the same sled, that packet will go out the VNIC, to the etherstub, and then presumably to the other zone's VNIC. If it's trying to reach something off the sled, I'm less sure of what'll happen there. It looks like it'll still go to the zone VNIC, the etherstub, and then to whatever route you have in the GZ that matches that (if one exists).
I think this is probably fine. It also seems to be working for a single machine, and it's easy enough to update this if we find it doesn't work for multiple machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would be preferable to:
- Continue allocating addresses within non-global zones as
/64
- Call route to manually add this path to the AZ subnet?
I've verified that this method works too - I'm seeing routing between zones by using:
route add -inet6 <address>/48 <address> -interface
When setting up a non-GZ address
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the point of the default
route that "if the destination address doesn't match the other rules, it should use this gateway"?
I tried your suggestion, but this doesn't seem to be working for me:
root@oxz_nexus:~# pfexec route add -inet6 fd00:1122:3344:1::1 fd00:1122:3344:101::3
add host fd00:1122:3344:1::1: gateway fd00:1122:3344:101::3: Network is unreachable
// Also does not not work with the `/48` in the destination
root@oxz_nexus:~# pfexec route add -inet6 fd00:1122:3344:1::1/48 -inet6 fd00:1122:3344:101::3
add net fd00:1122:3344:1::1/48: gateway fd00:1122:3344:101::3: Network is unreachable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chatting out-of-band with @bnaecker a bit: By issuing the following in the GZ:
routeadm -e ipv6-forwarding
routeadm -u
I'm seeing the routing make the extra hop, from NGZ -> GZ (and now) -> NGZ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, I explained that very poorly. I was trying to point out that this command:
root@oxz_nexus:~# pfexec route add -inet6 default -inet6 fd00:1122:3344:101::1
isn't what I'd expect. In particular, that says for any traffic without a more specific route, send it to the gateway fd00:1122:3344:101::1
. But that's not a gateway that the nexus zone has! The netstat -rn
output shows the gateway we need as fd00:1122:3344:101::3
.
But in any case, Robert pointed out that these routing tables are necessary but not sufficient to get this all to work. Specifically, we need to tell the GZ to actually act as a router, forwarding packets between different networks. That is, we've provided rules (assuming we can figure out how to express them 😆 ) for the routing daemon to use when forwarding packets, but it'll only do so if it's explicitly told it should.
I believe this can be accomplished with the command routeadm -e ipv6-forwarding -u
, which enables route forwarding and restarts the SMF service(s) necessary to make that apply to the running system. IIUC, at that point, when the GZ networking stack receives a packet from the nexus zone, with an IP address of the (non-global) DNS zone, it'll attempt to forward that, by consulting the routing table.
I'm hypothesizing, but it seems like we need two routes then:
- A route that tells the nexus zone to use it's VNIC's address as the gateway for DNS traffic
- A route that tells the GZ how to reach the DNS zone's addresses
The former could be a default route, or a more constrained one listing the prefix for the DNS server. It seems like either should work, as long as the gateway is the IP address of the VNIC in the nexus zone, in this case fd00:1122:3344:101::3
.
The latter can be accomplished by adding a route table that directs all the DNS traffic to the GZ's VNIC, I think. My understanding is that this would go onto the GZ VNIC, to the etherstub, and then forwarded to the non-global DNS zone VNIC.
I was initially confused as to why the "virtual switch" that man dladm
describes under the create-etherstub
command doesn't transparently do this. All the traffic is within that etherstub, and I'd have expected neighbor discovery and thus routing to be done automatically. So why do we need this?
The key is that the DNS addresses are in a different subnet. The etherstub will transparently create routes between all the other non-global zones, but once you're trying to reach an address in a different subnet, that has to involve routing. This explains why the -interface
flag worked initially, too. That's effectively telling the etherstub that the other subnet can actually be routed to through the same L2 domain, even though it's on a different L3 subnet.
Robert pointed out that we may actually want a separate etherstub for the DNS zone. That'd more closely model the actual network we're emulating. In particular, we're trying to say that the GZ and all the non-DNS service zones are one little subnet, in the sled's /64. The DNS service is explicitly in a separate /64, for route summarization and the fact that it really is supposed to be a rack-wide or AZ-wide service.
To be clear, we should not add an additional etherstub in this PR. I think that's where we want to go longer-term, but we can defer it for sure.
So summarizing everything. When nexus wants to send a packet to the DNS server, that'll first go to the etherstub. The etherstub will not explicitly have a gateway for that, since it's in another subnet. It'll deliver it to the GZ. At that point, the IP stack in the GZ will take the packet and also note that it doesn't have that address. It'll instead consult the routing tables (assuming forwarding is enabled), and note that it can send that...back to the etherstub! That'll then go to the DNZ zone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the following will work, which summarizes the above conversation in part, and also makes a few simplifications.
I've tested this out on a fresh VM by creating the zones and doing all the plumbing and things appear to work. Here is what the setup looks like live. It does still require routeadm -e ipv6-forwarding -u
in the GZ.
GZ
root@sled# ipadm
ADDROBJ TYPE STATE ADDR
lo0/v4 static ok 127.0.0.1/8
vioif0/v6 dhcp ok 10.47.0.83/24
lo0/v6 static ok ::1/128
vnic0/v6 addrconf ok fe80::8:20ff:fefc:a943/10
vnic0/omicron static ok fd00:1122:3344:101::1/64
vnic0/dns static ok fd00:1122:3344:1::2/64
root@sled# netstat -nr -f inet6
Routing Table: IPv6
Destination/Mask Gateway Flags Ref Use If
--------------------------- --------------------------- ----- --- ------- -----
::1 ::1 UH 2 20 lo0
fd00:1122:3344:1::/64 fd00:1122:3344:1::2 U 3 16 vnic0
fd00:1122:3344:101::/64 fd00:1122:3344:101::1 U 3 11 vnic0
fe80::/10 fe80::8:20ff:fefc:a943 U 2 0 vnic0
Ping the Omicron zone
root@han:/opt/cargo-bay# ping fd00:1122:3344:101::3
fd00:1122:3344:101::3 is alive
Ping the DNS zone
root@han:/opt/cargo-bay# ping fd00:1122:3344:1::1
fd00:1122:3344:1::1 is alive
DNS Zone
root@dns:~# ipadm
ADDROBJ TYPE STATE ADDR
lo0/v4 static ok 127.0.0.1/8
lo0/v6 static ok ::1/128
vnic1/v6 addrconf ok fe80::8:20ff:fe6a:2b/10
vnic1/underlay static ok fd00:1122:3344:1::1/64
root@dns:~# netstat -nr -f inet6
Routing Table: IPv6
Destination/Mask Gateway Flags Ref Use If
--------------------------- --------------------------- ----- --- ------- -----
::1 ::1 UH 2 0 lo0
fd00:1122:3344:1::/64 fd00:1122:3344:1::1 U 3 7 vnic1
fd00:1122:3344::/48 fd00:1122:3344:1::2 UG 2 3
fe80::/10 fe80::8:20ff:fe6a:2b U 2 0 vnic1
Ping the Omicron zone
root@dns:~# ping fd00:1122:3344:101::3
ICMPv6 redirect from gateway fe80::8:20ff:fefc:a943
to fd00:1122:3344:101::3 for fd00:1122:3344:101::3
fd00:1122:3344:101::3 is alive
Omicron Zone
root@omicron:~# ipadm
ADDROBJ TYPE STATE ADDR
lo0/v4 static ok 127.0.0.1/8
lo0/v6 static ok ::1/128
vnic2/v6 addrconf ok fe80::8:20ff:fe54:569d/10
vnic2/underlay static ok fd00:1122:3344:101::3/64
root@omicron:~# netstat -nr -f inet6
Routing Table: IPv6
Destination/Mask Gateway Flags Ref Use If
--------------------------- --------------------------- ----- --- ------- -----
::1 ::1 UH 2 0 lo0
fd00:1122:3344:101::/64 fd00:1122:3344:101::3 U 3 4 vnic2
fd00:1122:3344::/48 fd00:1122:3344:101::1 UG 2 5
fe80::/10 fe80::8:20ff:fe54:569d U 2 0 vnic2
Ping the DNS zone
root@omicron:~# ping fd00:1122:3344:0001::1
ICMPv6 redirect from gateway fe80::8:20ff:fefc:a943
to fd00:1122:3344:1::1 for fd00:1122:3344:1::1
fd00:1122:3344:0001::1 is alive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of 44dc885 , I am automatically adding these routes within the Sled Agent, and confirm connectivity between all zones / GZ.
Underlay routing looks happy now! $ DNS_ADDRESS="fd00:1122:3344:1::1"
$ NEXUS_ADDRESS="fd00:1122:3344:101::3"
$ SLED_ADDRESS="fd00:1122:3344:101::1"
// Ping addresses from Global Zone
$ ping $DNS_ADDRESS && ping $NEXUS_ADDRESS && ping $SLED_ADDRESS
fd00:1122:3344:1::1 is alive
fd00:1122:3344:101::3 is alive
fd00:1122:3344:101::1 is alive
// Ping addresses from Nexus Zone
$ pfexec zlogin oxz_nexus ping $DNS_ADDRESS && ping $NEXUS_ADDRESS && ping $SLED_ADDRESS
fd00:1122:3344:1::1 is alive
fd00:1122:3344:101::3 is alive
fd00:1122:3344:101::1 is alive
// Ping addresses from Internal DNS zone
$ pfexec zlogin oxz_internal-dns ping $DNS_ADDRESS && ping $NEXUS_ADDRESS && ping $SLED_ADDRESS
fd00:1122:3344:1::1 is alive
fd00:1122:3344:101::3 is alive
fd00:1122:3344:101::1 is alive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems OK to me at this point. I'd love some confirmation from others, but I don't think that should block integration, since these changes are likely straightforward to modify.
Propolis changes since the last update: Gripe when using non-raw block device Update zerocopy dependency nvme: Wire up GetFeatures command Make Viona more robust in the face of errors bump softnpu (#577) Modernize 16550 UART Crucible changes since the last update: Don't check ROP if the scrub is done (#1093) Allow crutest cli to be quiet on generic test (#1070) Offload write encryption (#1066) Simplify handling of BlockReq at program exit (#1085) Update Rust crate byte-unit to v5 (#1054) Remove unused fields in match statements, downstairs edition (#1084) Remove unused fields in match statements and consolidate (#1083) Add logger to Guest (#1082) Drive hash / decrypt tests from Upstairs::apply Wait to reconnect if auto_promote is false Change guest work id from u64 -> GuestWorkId remove BlockOp::Commit (#1072) Various clippy fixes (#1071) Don't panic if tasks are destroyed out of order Update Rust crate reedline to 0.27.1 (#1074) Update Rust crate async-trait to 0.1.75 (#1073) Buffer should destructure to Vec when single-referenced Don't fail to make unencrypted regions (#1067) Fix shadowing in downstairs (#1063) Single-task refactoring (#1058) Update Rust crate tokio to 1.35 (#1052) Update Rust crate openapiv3 to 2.0.0 (#1050) Update Rust crate libc to 0.2.151 (#1049) Update Rust crate rusqlite to 0.30 (#1035)
Propolis changes since the last update: Gripe when using non-raw block device Update zerocopy dependency nvme: Wire up GetFeatures command Make Viona more robust in the face of errors bump softnpu (#577) Modernize 16550 UART Crucible changes since the last update: Don't check ROP if the scrub is done (#1093) Allow crutest cli to be quiet on generic test (#1070) Offload write encryption (#1066) Simplify handling of BlockReq at program exit (#1085) Update Rust crate byte-unit to v5 (#1054) Remove unused fields in match statements, downstairs edition (#1084) Remove unused fields in match statements and consolidate (#1083) Add logger to Guest (#1082) Drive hash / decrypt tests from Upstairs::apply Wait to reconnect if auto_promote is false Change guest work id from u64 -> GuestWorkId remove BlockOp::Commit (#1072) Various clippy fixes (#1071) Don't panic if tasks are destroyed out of order Update Rust crate reedline to 0.27.1 (#1074) Update Rust crate async-trait to 0.1.75 (#1073) Buffer should destructure to Vec when single-referenced Don't fail to make unencrypted regions (#1067) Fix shadowing in downstairs (#1063) Single-task refactoring (#1058) Update Rust crate tokio to 1.35 (#1052) Update Rust crate openapiv3 to 2.0.0 (#1050) Update Rust crate libc to 0.2.151 (#1049) Update Rust crate rusqlite to 0.30 (#1035) --------- Co-authored-by: Alan Hanson <[email protected]>
Fixes #987
Goals of this PR:
Implementation details of this PR:
stub0
.underlay0
.