-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add on-demand communication probes. #4585
Conversation
3726743
to
774e1ed
Compare
563a152
to
23b96d9
Compare
0fee0f5
to
aecff37
Compare
fe3fe2c
to
0e047a6
Compare
0e047a6
to
584ae73
Compare
Threre are still a few |
134e768
to
884bf18
Compare
In that case, my suggestion would be to keep it out of the documented API / CLI / etc. What do you think? |
884bf18
to
9205444
Compare
I've moved the probes API endpoints to an "experimental" API in 4f58a4a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Just a comment on experimental API endpoints.)
Thank you for taking this on! I've been wanting something like this for a while, and I really like this solution (my own half-baked solution was to use sets of tags, but this is easier to read I think.)
e7bbd34
to
54d6f76
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Ry, from what I've looked over this is a very nice addition. I haven't yet had the chance to stand this up and test it locally, I'm hoping to do so shortly. I've left some questions throughout.
On 'what' we can test with this, I guess this gives us really good mileage for any traffic carried by a VPC (Instance<->Probe and Probe<->Rack-external). If we wanted to ask more targeted questions about the underlay itself, would we create probes and zlogin
?
18aee3c
to
82fcaa2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for handling the nits/papercuts etc., looks good!
e7d3a2c
to
fbf34dd
Compare
Unfortunately, I need to disable the The primary issue appears to be SMF service startup failures within zones. Sometimes it's CRDB, sometimes it's more basic things like ndp. These service start failures seem to correspond to periods of heavy I/O. I did try running The CI tests and machinery remain in place within this PR, but the buildomat |
This PR adds the following.
Probes
A new first-class Omicron element called a probe is introduced. A probe is similar to an instance but is underpinned by a zone instead of an HVM. They are managed much like instances in terms of lifecycle. They have network interfaces on a VPC that are externally reachable via ephemeral IP addresses. They also have network interfaces on the underlay network. The primary function of probes in this initial PR is for network testing. They come with daemons like
thundermuffin
pre-installed as SMF services to facilitate communications testing both through boundary services and within the rack. Probes may become more general over time to facilitate more kinds of testing beyond networking.The idea for probes came from the desire to test the Oxide stack in an environment where the hardware is constructed by interconnected virtual machines. Nested virtualization is undesirable for a number of reasons, so probes plus virtual rack topologies give us the ability to test a significant surface area of the control plane and the underlying systems it manages, such as networking and storage. While the motivation for probes was to test in virtual environments, they're just as applicable to hardware environments. Because they're a first-class Omicron element, the same tests written using probes in virtual environments can run on real racks.
Probes currently require fleet admin privileges. This is enforced in the API handlers.
A 4-gimlet 2-sidecar CI test
A primary goal of building the probes mechanism is to have automated multi-switch multi-sled tests that run in CI. This PR also takes the first step along that path. There are two new CI jobs added. The
a4x2-prepare
job builds and packages omicron for each of the 4 gimlets in the topology. This includes RSS configuration, individual sled configuration, and faux-mgs configuration for each scrimlet. The omicron packages and their configuration bits are then tarred up and provided as artifacts for thea4x2-deploy
job.The
a4x2-deploy
job extracts the artifacts froma4x2-prepare
into a set of folders with omicron configuration and deployment archives for each gimlet in the topology. This is in a folder calledcargo-bay
. The cargo bay contains a top-level folder for each virtual gimlet. The a4x2 falcon topology knows to look for these folders and mounts each into the corresponding gimlet via P9fs mounts. In the cargo bay for each gimlet, there is also an initialization script that installs omicron in the VM usingomicron-install
. This initialization script is run automatically by falcon (this is in the code for the topology itself, not something that is hard coded). Once the topology has been launched the control plane is on it's way toward coming up.The way that we test the virtual rack is functional is through probes. There is a new testing program in this PR called
commtest
in theend-to-end-tests
directory. This program takes an address to reach the Oxide API at as a parameter and waits for the/ping
endpoint to become responsive. Once that happens, a probe is launched on each sled in the topology and a basic connectivity test is run against each probe using ICMP. The test checks for packet loss within a configurable threshold for each probe. If the test passes, then the CI job passes.In addition to the virtual rack, the test topology also contains a pair of routers running the Arista network stack that the sidecars are connected to. The connection between the rack and these routers is facilitated by BGP. There is also a third router called the customer edge (
ce
) that connects to both Arista routers. This router is running Linux+FRR. It advertises a default route to both Arista routers that propagates via BGP to the Oxide rack routers. Similarly, the IP pool address block the Oxide rack advertises propagates to the customer edge router via the Arista routers through BGP. This customer edge VM is also running aniptables
configuration to decouple the testing network from the host network on the lab machine the entire topology is running on. All that needs to be done on the lab machine to connect to the rack is create a route to the IP pool block the rack is using that uses the IP address of the external interface on the customer edge machine as a nexthop. This is done in thea4x2-deploy
job.Depends on