Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSS <-> Nexus handoff: Set an initial (disabled) target blueprint #5244

Merged
merged 13 commits into from
Mar 15, 2024

Conversation

jgallagher
Copy link
Contributor

@jgallagher jgallagher commented Mar 12, 2024

I would like to test that after an RSS handoff, the initial blueprint we created in RSS and set in the Nexus handoff matches the blueprint we would generate from the first inventory collections. I can do that by hand on madrid, but I'm not sure there's a good way to do that in an automated test. Is this something I could lean on a4x2 to check?

This does not modify the services table or the services field of the RSS handoff. Removing those will come in a followup PR that will need an accompanying note for deployed systems (i.e., instructions to set a blueprint so the current users of services will continue to function by checking the current target blueprint).

Closes #5222.

nexus/db-queries/src/db/datastore/rack.rs Outdated Show resolved Hide resolved
Comment on lines +666 to +670
BlueprintTarget {
target_id: blueprint.id,
enabled: false,
time_made_target: Utc::now(),
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wouldn't really expect anything to be able to concurrently set a different target while we're doing this operation, would we? As in, other than this pathway, is "setting the target" still an operator/developer-driven operation?

Why I ask: I'm just curious if some other background task in Nexus could attempt to concurrently set a blueprint and cause this blueprint_target_set call to fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle a background task could sneak in, set a blueprint target, and cause this to fail. In practice the only thing that could do that is a support operator:

  1. A blueprint can only be made the initial target if its parent blueprint is None
  2. The only ways to get a blueprint with a parent of None are to generate one from an inventory collection (which is expected to only be a support-driven operator, forever, as once we do it to deployed systems and merge this change it will never be required again) or to construct one by hand, which is what RSS is doing on this PR.

sled-agent/src/rack_setup/service.rs Show resolved Hide resolved
sled-agent/src/rack_setup/service.rs Show resolved Hide resolved
@jgallagher jgallagher self-assigned this Mar 12, 2024
nexus/db-queries/src/db/datastore/rack.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/rack.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/rack.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/rack.rs Outdated Show resolved Hide resolved
@@ -46,6 +46,7 @@ macaddr.workspace = true
mg-admin-client.workspace = true
nexus-client.workspace = true
nexus-config.workspace = true
nexus-types.workspace = true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like potentially a big deal...but seems fine! (I say that because nexus_types I think is mostly implementation details of Nexus and it seems a little sketchy to leak that outside? But it doesn't seem worth creating a separate package for this right now.)

Come to think of it, it seems like this shouldn't be necessary because RSS would be using a progenitor-generated type. Is this necessary because our internal Nexus client uses replace to use this type instead of generating one? (And if I recall correctly, that's because we have functionality like diff that we want to use in omdb. Ugh.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's the replace directive on Blueprint specifically that causes this. That is kind of a dicey thing to be replaceing, especially as we're actively changing it. :-/

sled-agent/src/rack_setup/service.rs Outdated Show resolved Hide resolved
@@ -450,7 +489,16 @@ pub async fn run_standalone_server(
None => vec![],
};

let services =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding right, we're still constructing Services for these zones and still inserting those in the db. Is that right? (I think that's the right call until we finish #4947.)

Copy link
Contributor Author

@jgallagher jgallagher Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. I think there are three steps:

  1. Ensure new systems start off with a blueprint set (this PR)
  2. Ensure existing systems get a blueprint set (support operation)
  3. Remove the service table and all the fallout from it (including dropping Services from the RSS handoff) will be a followup PR that can be in flight or even landed concurrently with 2, as long as we do 2 before deploying 3

@jgallagher
Copy link
Contributor Author

I ran through RSS on madrid and everything came out as expected. We had a disabled blueprint:

root@oxz_switch1:~# omdb nexus blueprints list
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:102::4]:12221
T ENA ID                                   PARENT TIME_CREATED
* no  30fc006d-5db7-4c9e-85d3-954100ad7c8e <none> 2024-03-14T16:20:51.377Z

Running blueprints show on it confirmed it came from RSS:

time_created: 2024-03-14T16:20:51.377863Z, creator: "RSS", comment: "initial blueprint from rack setup"

I then generated a blueprint from a collection, and the diff showed no changes between the blueprint from RSS and the blueprint from the collection.

//
// TODO-john How should we do this? Baking this knowledge of how Nexus
// sets up external DNS here seems bad.
external_dns_version: Generation::new().next(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davepacheco Do you have thoughts on this change I made getting caught up with main? A few obvious options:

  • This is a little gross but not the end of the world
  • We export an "external DNS generation after rack setup" from nexus-db-queries and use that here
  • We set this to Generation::new() and let the planner sort out that there's a new external DNS later

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about:

  • we set it to Generation::new() here
  • during rack initialization, Nexus sets it to whatever it wants after receiving this blueprint but before writing it to the database

The explanation would be: as far as RSS is concerned, it's generation 1 -- the oldest possible generation should be a safe bet and it doesn't know any better. When Nexus receives it, it's in a position to say: "okay, RSS handed me this initial blueprint, but it didn't know I was going to write an initial external DNS generation, so I'm going to adjust that here based on the actions I'm taking [which are: creating generation 2 of external DNS]". What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that a lot - thanks! Done in 0b28298

@jgallagher jgallagher merged commit cc59eff into main Mar 15, 2024
22 checks passed
@jgallagher jgallagher deleted the john/rss-handoff-sets-initial-blueprint branch March 15, 2024 14:23
jgallagher added a commit that referenced this pull request Apr 16, 2024
I still need to test this on madrid and I do _not_ want to merge it
before we cut the next release, but I believe this is ready for review.

Related changes / fallout also included in this PR:

* `omdb db services ...` subcommands are all gone. I believe this
functionality is covered by `omdb`'s inspection of blueprints instead.
* I removed the `SledResourceKind::{Dataset,Service,Reserved}` variants
that were unused. This isn't required, strictly speaking, but
`SledResourceKind::Service` was intended to reference the `service`
table, so I thought it was clearer to drop these for now (we can add
them back when we get to the point of using them).

There are two major pieces of functionality that now _require_ systems
to have a current target blueprint set:

* `DataStore::nexus_external_addresses()` now takes an explicit
`Blueprint` instead of an `Option<Blueprint>`. Its callers (silo
creation and reconfigurator DNS updates) fail if they cannot read the
current target blueprint.
* `DataStore::vpc_resolve_to_sleds()` now _implicitly_ requires a target
blueprint to be set, if and only if the VPC being queried involves
control plane services. (In practice today, that means the VPC ID is
exactly `SERVICES_VPC_ID`, although in the future we could have multiple
service-related VPCs.) I didn't want to make this method take an
explicit blueprint, because I think its most frequent callers are
specifying instance VPCs, which do not need to access the blueprint.

These two together mean that deploying this change to a system without a
target blueprint will result in (a) the inability to create silos or
update external DNS via reconfigurator and (b) services (including
Nexus) will not get the OPTE firewall rules they need to allow incoming
traffic. All newly-deployed systems have a (disabled) blueprint set as
of #5244, but we'll need to perform the necessary support operation to
bring already-deployed systems in line.

Closes #4947.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

every new system should have a target blueprint
3 participants