Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rss][nexus][sled-agent] Responsibility for deploying services should (where possible) migrate into Nexus #732

Open
7 of 15 tasks
Tracked by #824
smklein opened this issue Mar 7, 2022 · 0 comments
Assignees
Labels
nexus Related to nexus Sled Agent Related to the Per-Sled Configuration and Management
Milestone

Comments

@smklein
Copy link
Collaborator

smklein commented Mar 7, 2022

At the time of writing, #686 introduces the first usage of the "RSS", which makes requests to sled agent to create datasets/services.

Aside from the datasets necessary for initializing Nexus (that is, Nexus itself and CRDB), these service requests should be handled by Nexus "as much as possible" instead of the RSS.

Many operations can trigger a need to request these services partitions:

  • Initializing a new sled
  • Adjusting service location to maintain availability while turning down an old sled
  • Adjusting capacity to serve request load
  • Etc

All these conditions are ongoing, and best handled by Nexus, which maintains a "global" view of the rack and exists beyond initialization.

Fortunately, the APIs defined by the Sled Agent should (more-or-less) remain the same - this issue just addresses the matter of "who calls them".

@smklein smklein added Sled Agent Related to the Per-Sled Configuration and Management nexus Related to nexus labels Mar 7, 2022
@smklein smklein self-assigned this Mar 7, 2022
smklein added a commit that referenced this issue Dec 2, 2022
## Overview

- Implements https://rfd.shared.oxide.computer/rfd/0278 
- This PR moves much of the service configuration from the hard-coded
`config-rss.toml` file to RSS itself.
- In the future (See: #732) many of these services will be initialized
by Nexus. Decoupling their provisioning from the hard-coded versions is
the first step in this process.

### What Changed in the Sled Agent

- Sled Agent
- A new `get_zpools` endpoint is exposed from the Sled Agent. This is
invoked by RSS when figuring out where to provision datasets.
- The UUID for the sled agent is removed from the config file (it's
dynamic, and should not be shared among sleds)

### What Changed in RSS

- `HardcodedSledRequest` (and the corresponding entries in
`config-rss.toml`) has been removed
- A `plan` module was added, where plans for sled generation ("What
sleds should get what addresses?") and service generation ("What
services should run where?") are generated.
- Refactor service and dataset initialization to insert entries into DNS
- Invoke the `handoff_to_nexus`, informing it of all
previously-owned-by-RSS services.

### What Changed in Nexus

- Expand `RackInitializationRequest` to consider both services and
datasets
- `dataset_put` API removed -- beyond the initialization request, Nexus
should be responsible for provisioning new datasets, not the sled agent.

Fixes #1148
Part of #732
Part of #824
@smklein smklein added this to the MVP milestone Feb 10, 2023
smklein added a commit that referenced this issue Feb 21, 2023
#2358)

# Summary

My long-term goal is to have Nexus be in charge of provisioning all
services.

For that to be possible, Nexus must be able to internalize all input
during the handoff from RSS. This PR extends the RSS -> Nexus handoff to
include:

- What "Nexus Services" are being launched?
- What are the ranges of IP addresses that may be used for internal
services?
- What external IP addresses, from that pool, are currently in-use for
Nexus services?

# Nexus Changes

## Database Records
 
- Adds a `nexus_service` record, which just includes the information
about the in-use external IP address.

## IP Address Allocation

- Adds an `explicit_ip` option, which lets callers perform an allocation
with an explicit request for a single IP address. You might ask the
question: "Why not just directly create a record with the IP address in
question, if you want to create it?" We could! But we'd need to recreate
all the logic which validates that the IP address exists within the
known-to-the-DB IP ranges within the pool.
- The ability for an operator to "request Nexus execute with a specific
IP address" is a feature we want anyway, so this isn't wasted work.
- The implementation and tests for this behavior are mostly within
`nexus/src/db/queries/external_ip.rs`

## Rack Initialization

- Populates IP pools and Service records as a part of the RSS handoff.
- Implementation and tests exist within
`nexus/src/db/datastore/rack.rs`.

## Populate

- Move the body of some of the "populate" functions into their correct
spot in the datastore, which makes it easier to...
- ... call all the populate functions -- rather than just a chunk of
them -- from `omicron_nexus::db::datastore::datastore_test`.
- As a consequence, update some tests which assumed the rack would be
"half-populated" -- it's either fully populated, or not populated at
all.

# Sled Agent changes

- Explicitly pass the "IP pool ranges for internal services" up to
Nexus.
- In the future, it'll be possible to pass a larger range of addresses
than just those used by running Nexus services.

Fixes: #1958
Unblocks: #732
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nexus Related to nexus Sled Agent Related to the Per-Sled Configuration and Management
Projects
None yet
Development

No branches or pull requests

1 participant