-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RSS <-> Nexus handoff: Set an initial (disabled) target blueprint #5244
Conversation
BlueprintTarget { | ||
target_id: blueprint.id, | ||
enabled: false, | ||
time_made_target: Utc::now(), | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We wouldn't really expect anything to be able to concurrently set a different target while we're doing this operation, would we? As in, other than this pathway, is "setting the target" still an operator/developer-driven operation?
Why I ask: I'm just curious if some other background task in Nexus could attempt to concurrently set a blueprint and cause this blueprint_target_set
call to fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle a background task could sneak in, set a blueprint target, and cause this to fail. In practice the only thing that could do that is a support operator:
- A blueprint can only be made the initial target if its parent blueprint is
None
- The only ways to get a blueprint with a parent of
None
are to generate one from an inventory collection (which is expected to only be a support-driven operator, forever, as once we do it to deployed systems and merge this change it will never be required again) or to construct one by hand, which is what RSS is doing on this PR.
@@ -46,6 +46,7 @@ macaddr.workspace = true | |||
mg-admin-client.workspace = true | |||
nexus-client.workspace = true | |||
nexus-config.workspace = true | |||
nexus-types.workspace = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like potentially a big deal...but seems fine! (I say that because nexus_types
I think is mostly implementation details of Nexus and it seems a little sketchy to leak that outside? But it doesn't seem worth creating a separate package for this right now.)
Come to think of it, it seems like this shouldn't be necessary because RSS would be using a progenitor-generated type. Is this necessary because our internal Nexus client uses replace
to use this type instead of generating one? (And if I recall correctly, that's because we have functionality like diff
that we want to use in omdb. Ugh.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's the replace
directive on Blueprint
specifically that causes this. That is kind of a dicey thing to be replace
ing, especially as we're actively changing it. :-/
@@ -450,7 +489,16 @@ pub async fn run_standalone_server( | |||
None => vec![], | |||
}; | |||
|
|||
let services = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding right, we're still constructing Service
s for these zones and still inserting those in the db. Is that right? (I think that's the right call until we finish #4947.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. I think there are three steps:
- Ensure new systems start off with a blueprint set (this PR)
- Ensure existing systems get a blueprint set (support operation)
- Remove the
service
table and all the fallout from it (including droppingService
s from the RSS handoff) will be a followup PR that can be in flight or even landed concurrently with 2, as long as we do 2 before deploying 3
I ran through RSS on madrid and everything came out as expected. We had a disabled blueprint:
Running
I then generated a blueprint from a collection, and the diff showed no changes between the blueprint from RSS and the blueprint from the collection. |
sled-agent/src/rack_setup/service.rs
Outdated
// | ||
// TODO-john How should we do this? Baking this knowledge of how Nexus | ||
// sets up external DNS here seems bad. | ||
external_dns_version: Generation::new().next(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davepacheco Do you have thoughts on this change I made getting caught up with main
? A few obvious options:
- This is a little gross but not the end of the world
- We export an "external DNS generation after rack setup" from
nexus-db-queries
and use that here - We set this to
Generation::new()
and let the planner sort out that there's a new external DNS later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about:
- we set it to
Generation::new()
here - during rack initialization, Nexus sets it to whatever it wants after receiving this blueprint but before writing it to the database
The explanation would be: as far as RSS is concerned, it's generation 1 -- the oldest possible generation should be a safe bet and it doesn't know any better. When Nexus receives it, it's in a position to say: "okay, RSS handed me this initial blueprint, but it didn't know I was going to write an initial external DNS generation, so I'm going to adjust that here based on the actions I'm taking [which are: creating generation 2 of external DNS]". What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that a lot - thanks! Done in 0b28298
I still need to test this on madrid and I do _not_ want to merge it before we cut the next release, but I believe this is ready for review. Related changes / fallout also included in this PR: * `omdb db services ...` subcommands are all gone. I believe this functionality is covered by `omdb`'s inspection of blueprints instead. * I removed the `SledResourceKind::{Dataset,Service,Reserved}` variants that were unused. This isn't required, strictly speaking, but `SledResourceKind::Service` was intended to reference the `service` table, so I thought it was clearer to drop these for now (we can add them back when we get to the point of using them). There are two major pieces of functionality that now _require_ systems to have a current target blueprint set: * `DataStore::nexus_external_addresses()` now takes an explicit `Blueprint` instead of an `Option<Blueprint>`. Its callers (silo creation and reconfigurator DNS updates) fail if they cannot read the current target blueprint. * `DataStore::vpc_resolve_to_sleds()` now _implicitly_ requires a target blueprint to be set, if and only if the VPC being queried involves control plane services. (In practice today, that means the VPC ID is exactly `SERVICES_VPC_ID`, although in the future we could have multiple service-related VPCs.) I didn't want to make this method take an explicit blueprint, because I think its most frequent callers are specifying instance VPCs, which do not need to access the blueprint. These two together mean that deploying this change to a system without a target blueprint will result in (a) the inability to create silos or update external DNS via reconfigurator and (b) services (including Nexus) will not get the OPTE firewall rules they need to allow incoming traffic. All newly-deployed systems have a (disabled) blueprint set as of #5244, but we'll need to perform the necessary support operation to bring already-deployed systems in line. Closes #4947.
I would like to test that after an RSS handoff, the initial blueprint we created in RSS and set in the Nexus handoff matches the blueprint we would generate from the first inventory collections. I can do that by hand on
madrid
, but I'm not sure there's a good way to do that in an automated test. Is this something I could lean on a4x2 to check?This does not modify the
services
table or theservices
field of the RSS handoff. Removing those will come in a followup PR that will need an accompanying note for deployed systems (i.e., instructions to set a blueprint so the current users ofservices
will continue to function by checking the current target blueprint).Closes #5222.