Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPTE for Control Plane Zone Comms #127

Closed
2 tasks
rcgoodfellow opened this issue Apr 21, 2022 · 7 comments
Closed
2 tasks

OPTE for Control Plane Zone Comms #127

rcgoodfellow opened this issue Apr 21, 2022 · 7 comments

Comments

@rcgoodfellow
Copy link
Contributor

The system-level zones we'll be running on sleds for: the control plane, storage, customer instances, dendrite, etc., will require communication both directly on the underlay and the boundary services overlay. An example of the latter is Nexus serving off-rack client requests to the user-facing API.

Given the requirement for both underlay and overlay communication, and the encapsulation capabilities of OPTE combined with OPTE's general-purpose architecture – it seems like a win to leverage OPTE for service-zone communications in addition to customer instance communications.

The general situation would look like this

                           overlay                   
             ╔═══════════destinations══════════╗     
             ║                                 ║     
             ║             underlay            ║     
             ║      ┌────destinations───┐      ║     
             ║      │                   │      ║     
             ║      │                   │      ║     
             ║  ┌───────┐           ┌───────┐  ║     
             ╚══│  phy  │           │  phy  │══╝     
                └───────┘           └───────┘        
                    │                   │            
               ┌────┴────────────────┬──┴───────┐    
               │                     │          │    
           ┌───────┐             ┌───────┐  ┌───────┐
           │ opte/ │             │ opte/ │  │ opte/ │
   fd00::1 │  xde  │             │  xde  │  │  xde  │
           └───────┘             └───────┘  └───────┘
               │                     │          │    
     ┌─────────┼─────────┐           │          │    
     │         │         │           │          │    
┌────────┐┌────────┐┌────────┐  ┌────────┐ ┌────────┐
│ system ││ system ││ system │  │        │ │        │
│  zone  ││  zone  ││  zone  │  │instance│ │instance│
└────────┘└────────┘└────────┘  │        │ │        │
 fd00::10  fd00::11  fd00::12   └────────┘ └────────┘

There are a few notable details in this diagram

  • The xde device is plumbed with an IP interface and has an address on the underlay, this would replace the address we are currently adding to lo0.
  • There is an expectation that communications sourced from the xde IP interface in the GZ destined to services in system zones will work. I think this is in the spirit of OPTE's virtual switch architecture, treating each zone interface as being connected to a port. The GZ address could also be on a VNIC hanging off the xde in the GZ.

In an initial implementation, the IP addresses in the zones would be atop VNICs over the xde device. This presents the somewhat awkward situation that we need link-local addresses on these VNICs as well as on the xde device. I've got plans to relax that constraint for on-host communications, but for now, I think it's something we can probably live with.

For underlay traffic, OPTE would mostly be in pass-through mode, letting traffic flow between system-zone instances and external sources or the GZ. When OPTE detects overlay traffic, it behaves similarly as it does for customer instances, performing encap/dcap onto/from the boundary services overlay.

Required Work

  • DLPI implementation for xde for interface plumbing.
  • Multi-port support for xde, right now xde assumes there is only one port for the virtual network interface of the instance it's attached to.
@rcgoodfellow
Copy link
Contributor Author

cc @rzezeski @bnaecker @rmustacc

@rmustacc
Copy link

rmustacc commented Apr 21, 2022

So, the prior intent was that we would leverage OPTE when a given service needed communication to say the outside world or related, but not for in-rack and that a zone that needed to exist on both would have two interfaces, one that had the semantics of a traditional customer-style interface and one that would not.

Can you provide more details about the use of OPTE for non-external traffic? It seems like the underlying issue driving us here is that we want the different zones in exclusive netstacks to be able to talk to each other. Given the whole virtual switch you described, why isn't that just an etherstub?

@rmustacc
Copy link

I guess, just to add another general thought here, the fundamental thing is if we start using OPTE for something like control plane bootstrap, then we get into a chicken and egg scenario. One of the main earlier architectural decisions (which we can revisit) is that OPTE wasn't used in the main implementation of non-external control plane services because of this.

One of the ways in which OPTE isn't the same as an Etherstub / virtual switch is that (I think) we don't really do any true L2 activity. While there is a local loopback, unlike a traditional virtual stub or etherstub, it has to be told what to do. That is, by default, OPTE has no connectivity between things without being told exactly what it should and the thing telling it what it should (to in part avoid split brain) was always designed to be the general control plane, e.g. directives issued by omicron/nexus via sled agent.

To try to put together a bit more of an image of this, I'd imagine something that looks somwhat like this for the general case control plane zones (e.g. things without external connectivity):

+--------------+     +------+
|  etherstub   |---->| GZ   |
+--------------+     | vnic |
 |           |       +------+
 |           |
 V           v
+--------+  +--------+
| vnic   |  | vnic   |
| zone a |  | zone b |
+--------+  +--------+

This is a bit of a hasty sketch, so maybe not very clear. If we had a zone that needed to communicate both externally and internally, e.g. say something implementing the public API:

      +-----------+     +------+
      | Etherstub |     | opte |
      +-----------+     +------+
          |                 |     Programmed by Nexus
          |                 * . . On a specific VPC/geneve ID 
          |                 |     for external connectivity
          |                 |
----------|-----------------|--------------------
 Zone     v                 v   
       +-------+       +----------+
       | vnic  |       | opte/xde |
       +-------+       +----------+
       fd00::10/64      192.168.1.2

@rcgoodfellow
Copy link
Contributor Author

Thanks for the feedback @rmustacc! I did not realize etherstub could be used in this way, and I think this likely simplifies things quite a bit. I'll do a bit of tinkering with this and report back.

@rcgoodfellow
Copy link
Contributor Author

Ok this works great.

In the GZ

root@violin:/opt/cargo-bay# dladm
LINK        CLASS     MTU    STATE    BRIDGE     OVER
vioif0      phys      1500   up       --         --
vioif1      phys      1500   up       --         --
vioif2      phys      1500   up       --         --
etherstub0  etherstub 9000   up       --         --
underlay0   vnic      9000   up       --         etherstub0
vnic0       vnic      9000   up       --         etherstub0
root@violin:/opt/cargo-bay# ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
vioif2/v4         dhcp     ok           10.47.0.40/24
lo0/v6            static   ok           ::1/128
vioif0/v6         addrconf ok           fe80::8:20ff:fe52:d73b/10
underlay0/v6      addrconf ok           fe80::8:20ff:fe03:ffb6/10
underlay0/primary static   ok           fd00::1/64
root@violin:/opt/cargo-bay# ping fd00::2
fd00::2 is alive

in the system-zone

root@iz1:~# dladm
LINK        CLASS     MTU    STATE    BRIDGE     OVER
vnic0       vnic      9000   up       --         ?
root@iz1:~# ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
lo0/v6            static   ok           ::1/128
vnic0/v6          addrconf ok           fe80::8:20ff:fe54:7c93/10
vnic0/underlay    static   ok           fd00::2/64
root@iz1:~# ping fd00::1
fd00::1 is alive

Comms between hosts using the source/destination addresses of the GZ VNICs work the same way as having the primary underlay address on lo0 as well.

From the GZ of a host directly connected to the violin host above.

root@piano:~# dladm
LINK        CLASS     MTU    STATE    BRIDGE     OVER
vioif0      phys      1500   up       --         --
vioif1      phys      1500   up       --         --
vioif2      phys      1500   up       --         --
etherstub0  etherstub 9000   up       --         --
underlay0   vnic      9000   up       --         etherstub0
root@piano:~# ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
lo0/v6            static   ok           ::1/128
vioif0/v6         addrconf ok           fe80::8:20ff:fe53:d0a6/10
underlay0/v6      addrconf ok           fe80::8:20ff:fead:d1b0/10
underlay0/primary static   ok           fd00:1::1/64
root@piano:~# ping -n fd00::1
fd00::1 is alive
root@piano:~# netstat -nr -f inet6

Routing Table: IPv6
  Destination/Mask            Gateway                   Flags Ref   Use    If
--------------------------- --------------------------- ----- --- ------- -----
::1                         ::1                         UH      2     252 lo0
fd00::/64                   fe80::8:20ff:fe52:d73b      UG      2      32
fd00:1::/64                 fd00:1::1                   U       2       0 underlay0
fe80::/10                   fe80::8:20ff:fead:d1b0      U       3       3 underlay0
fe80::/10                   fe80::8:20ff:fe53:d0a6      U       4      22 vioif0

@bnaecker
Copy link
Contributor

@smklein is tackling this as part of Omicron #1066, and the work is tracked under Omicron #987. We can probably close this, but I defer to you @rcgoodfellow.

@rcgoodfellow
Copy link
Contributor Author

SGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants