Blueprint structure allows a variety of "illegal" combinations #7078

jgallagher · 2024-11-15T18:49:36Z

Blueprint currently has four different maps, all keyed by sled ID:

Lines 144 to 163 in 7cf372d

    
               /// A map of sled id -> desired state of the sled. 
        
               /// 
        
               /// A sled is considered part of the control plane cluster iff it has an 
        
               /// entry in this map. 
        
               pub sled_state: BTreeMap<SledUuid, SledState>, 
        
               /// A map of sled id -> zones deployed on each sled, along with the 
        
               /// [`BlueprintZoneDisposition`] for each zone. 
        
               /// 
        
               /// Unlike `sled_state`, this map may contain entries for sleds that are no 
        
               /// longer a part of the control plane cluster (e.g., sleds that have been 
        
               /// decommissioned, but still have expunged zones where cleanup has not yet 
        
               /// completed). 
        
               pub blueprint_zones: BTreeMap<SledUuid, BlueprintZonesConfig>, 
        
               /// A map of sled id -> disks in use on each sled. 
        
               pub blueprint_disks: BTreeMap<SledUuid, BlueprintPhysicalDisksConfig>, 
        
               /// A map of sled id -> datasets in use on each sled 
        
               pub blueprint_datasets: BTreeMap<SledUuid, BlueprintDatasetsConfig>,

In general we would expect all those maps to have the same keys, but in practice that isn't true. A couple examples:

sled_state drops decommissioned sleds, but they may still exist in the other maps (but should only contain expunged values). We have to work around this when diff'ing blueprints.
Clients that use BlueprintBuilder, not the planner, that fail to call sled_ensure_disks / sled_ensure_datasets at appropriate times will emit blueprints that contain zones that reference datasets / disks that one would reasonably expect to exist in their relative maps, but those maps may be out of date or empty. Empty maps happen to work today because for backwards compatibility reasons we allow them to be empty, but doing so is a little messy and will eventually be removed once production systems can all be expected to have populated maps. That will break any of these clients (tests and reconfigurator-cli, today).

I think (?) we should probably only have one map here, keyed by sled ID, with values that encompass all the blueprint details for a single sled (today: state + zones + disks + datasets, growing in the future to support update versioning as needed). This may require some fundamental rework of BlueprintBuilder and possibly the planner, as today they manage these maps mostly independently (which is problematic!).

The text was updated successfully, but these errors were encountered:

…ol, it must be from a disk present in the blueprint (#7106) Builds and is staged on top of #7105. The intended change here is in the first commit (8274174): In `BlueprintBuilder::sled_select_zpool()`, instead of only looking at the `PlanningInput`, we also look at the disks present in the blueprint, and only select a zpool that the planning input says is in service and that we have in the blueprint. This had a surprisingly-large blast radius in terms of tests - we had _many_ tests which were adding zones (which implicitly selects a zpool) from a `BlueprintBuilder` where there were no disks configured at all, causing them to emit invalid blueprints. These should all be fixed as of this PR, but I'm a little worried about test fragility in general, particularly with an eye toward larger changes like #7078. Nothing to do about that at the moment, but something to keep an eye on. Fixes #7079.

…uilder` (#7204) This is a step on the road to #7078. The current `BlueprintBuilder` internals would not at all be amenable to squishing the disparate maps in `Blueprint` together, so this PR tries to rework those internals. `BlueprintBuilder` now does its own map squishing (inside `new_based_on()`), combining state+zones+disks+datasets into one helper (`SledEditor`). All modifications are preformed via those editors, then `BlueprintBuilder::build()` breaks the combined editors' results back out into four maps for `Blueprint`. There should be only one functional change in this PR (marked with a `TODO-john`): previously, expunging a zone checked that that zone had not been modified in the current planning cycle. I'm not sure that's really necessary; it felt like some defensiveness born out of the complexity of the builder itself, maybe? I think we could put it back (inside `ZonesEditor`, presumably) if folks think it would be good to keep. My intention was to change the tests as little as possible to ensure I didn't break anything as I was moving functionality around, so `new_based_on()` and `build()` have some arguably-pretty-gross concessions to behave exactly the way the builder did before these changes. I'd like to _remove_ those concessions, but those will be nontrivial behavioral changes that I don't want to try to roll in with all of this cleanup. I think I landed this pretty well; there are only a few expectorate changes (due to the slightly reworked `ZonesEditor` producing a different ordering for a few tests), and none in `omdb` (which is where I've often seen incidental planner changes bubble out). Apologies for the size of the PR. I'd lightly recommend that the new `blueprint_editor/` module and its subcomponents should be pretty quickly reviewable and should be looked at first; they're _supposed_ to be simple and obvious, and are largely ported over from the prior storage / disks / datasets / zones editors. I did rework how we're handling #6645 backwards compat to try to reduce how much we need to pass `SledResources` around, so that complicates things in `DatasetsEditor` some, but not too bad IMO. (I also added a test for this, since I changed it and I don't think we had one before.) The `BlueprintBuilter` changes should then be more or less the natural way one would use a collection of `SledEditor`s without changing its API or behavior (yet!).

jgallagher · 2024-12-19T21:22:29Z

This shouldn't have been closed; github picked up on "fix #7078" in a commit message, but in context it's clearly not that the commit fixes this.

This commit introduces an automated blueprint diff mechanism based on diffus along with the visitor pattern for traversing heterogeneous diff trees. It builds on several prior commits that introduced diffus support and added visitors for types: #7261 #7336 #7362 #7366 The new `BlueprintDiffer` implements the relevant visitors and accumulates change state from visitor method callbacks. The accumulated state is the `BlueprintDiff` which replaces the old `BlueprintDiff`. The new structure intentionally groups resources by sleds, as the 4 primary maps in a `Blueprint` (`sled_state`, `blueprint_zones`, `blueprint_disks`, `blueprint_datasets`) are going to be collapsed into a single map. This allows us to modify the visitors to traverse the new blueprint diffs in one place and not have to worry about modifying the `BlueprintDiff` itself or any consumers. More details about why we are collapsing these maps can be found in #7078. We primarily use `BlueprintDiff`s for testing and omdb output, and therefore the printed representation is absolutely critical for us. This commit reuses the types and formatting contained in `nexus/ types/src/deployment/ blueprint_display.rs` and provides a new type, `BpDiffPrintable` that can be created from a `BlueprintDiff`. `BpDiffPrintable` contains our formatted tables ready to be used by `BlueprintDiffDisplay` along with the orignal blueprints to render the diff. It's possible that we can collapse `BlueprintDiffDisplay` and `BpDiffPrintable` into one type and simplify a bit. I just haven't done that yet. The printable output of the diffs has maintained backwards compatibility in all cases except for errors and warnings. Those have been specifically adapted to work with our visitors and be accumulated while walking the diffus diffs. This only resulted in the change to one test output file, which should leave us confident in the correctness of this new implementation. Usage going forward -------------------- This is a significant change to our blueprint diff code, and in some cases it may take slightly more boilerplate in order to diff newly added fields or structs, than in the older code. However, there are a few things this new style has going for it. Figuring out the differences between blueprints is now done automatically and completely. There is no need to compute these differences, which is both error prone, and complicated. See the old [BpDiffZones](https:// github.com/ oxidecomputer/omicron/blob/888f90db8b36ccdf667d96423ae7805824c48aa9/ nexus/types/src/deployment/blueprint_diff.rs#L195-L340) code for what finding the differences for just zones in a blueprint looked like. With automated diffs output from diffus, we now need to write code to expose the necessary change information to consumers. We want to do this in a unified and rigorous manner, and one that composes nicely. We chose a `visitor` pattern for this. For each somewhat complex type (use your judgement), we provide a visitor trait that walks the heterogeneous diff tree and calls trait methods when a change is found. User code implements only the visitor methods that it cares about and then can accumulate its own internal state based on which callbacks fire. We provide a top-level `VisitBlueprint` trait to detect all changes in a blueprint and implement it in `BlueprintDiffer` to construct a `BlueprintDiff` in a format we can use. There are a few important things to note about our visitors. While diffus provides a complete tree of all changes as well as fields and variants that have not changed, the visitors as currently implemented only expose changes. These changes are unified for simplicity into a `Change` type. We also don't currently expose all changes. For most fields that we don't expect to ever change, there is not a change callback. Mostly this is because the visitors were written before I had to use them :). It turns out it may actually useful to expose these fields so that we can report errors or warnings in our diffs if they do change unexpectedly. You can see an example of error tracking in the `BlueprintDiffer::visit_zone_type_change` method and its test output in `nexus/reconfigurator/planning/tests/output/ planner_nonprovisionable_2_2a.txt`. It is tedious to write visitor traits to walk a diffus diff and then implement these traits to accumulate the state we want. It can reasonably be asked why we wouldn't just walk the tree and generate the state we want directly without the visitors. Indeed, this is quite reasonable for small one off tasks. But we are building a foundation that will grow over time in terms of type structure and number of developers working on the system. For such a long term project, it's usefulf to decouple the tree walking from the state accumulation. For one thing, it makes each bit easier to read and write on its own. For another, it allows us to build different implementations of the visitors for different use cases, such as testing. Remember, users only have to implement the methods they care about. Therefore if they only want to see if a zone changed, they don't have to worry about parsing diffus types, but can just implement a callback. An example of this is the `TestVisitor` in `live-tests/tests/test_nexus_add_remove.rs`. I anticipate we will find ways to make implementing and using diff visitors more ergonomic over time while maintaining the compositional rigor that the separation of concerns provides in this model. Testing -------- I mainly have run planning tests locally to ensure diffs work as expected. I still need to run the live tests and play with omdb.

This commit introduces an automated blueprint diff mechanism based on [diffus](https://github.com/oxidecomputer/diffus) along with the visitor pattern for traversing heterogeneous diff trees. It builds on several prior commits that introduced diffus support and added visitors for types: #7261 #7336 #7362 #7366 The new `BlueprintDiffer` implements the relevant visitors and accumulates change state from visitor method callbacks. The accumulated state is the `BlueprintDiff` which replaces the old `BlueprintDiff`. The new structure intentionally groups resources by sleds, as the 4 primary maps in a `Blueprint` (`sled_state`, `blueprint_zones`, `blueprint_disks`, `blueprint_datasets`) are going to be collapsed into a single map. This allows us to modify the visitors to traverse the new blueprint diffs in one place and not have to worry about modifying the `BlueprintDiff` itself or any consumers. More details about why we are collapsing these maps can be found in #7078. We primarily use `BlueprintDiff`s for testing and omdb output, and therefore the printed representation is absolutely critical for us. This commit reuses the types and formatting contained in `nexus/ types/src/deployment/ blueprint_display.rs` and provides a new type, `BpDiffPrintable` that can be created from a `BlueprintDiff`. `BpDiffPrintable` contains our formatted tables ready to be used by `BlueprintDiffDisplay` along with the orignal blueprints to render the diff. It's possible that we can collapse `BlueprintDiffDisplay` and `BpDiffPrintable` into one type and simplify a bit. I just haven't done that yet. The printable output of the diffs has maintained backwards compatibility in all cases except for errors and warnings. Those have been specifically adapted to work with our visitors and be accumulated while walking the diffus diffs. This only resulted in the change to one test output file, which should leave us confident in the correctness of this new implementation. Usage going forward -------------------- This is a significant change to our blueprint diff code, and in some cases it may take slightly more boilerplate in order to diff newly added fields or structs, than in the older code. However, there are a few things this new style has going for it. Figuring out the differences between blueprints is now done automatically and completely. There is no need to compute these differences, which is both error prone, and complicated. See the old [BpDiffZones](https:// github.com/oxidecomputer/omicron/blob/888f90db8b36ccdf667d96423ae7805824c48aa9/nexus/types/src/deployment/blueprint_diff.rs#L195-L340) code for what finding the differences for just zones in a blueprint looked like. With automated diffs output from diffus, we now need to write code to expose the necessary change information to consumers. We want to do this in a unified and rigorous manner, and one that composes nicely. We chose a `visitor` pattern for this. For each somewhat complex type (use your judgement), we provide a visitor trait that walks the heterogeneous diff tree and calls trait methods when a change is found. User code implements only the visitor methods that it cares about and then can accumulate its own internal state based on which callbacks fire. We provide a top-level `VisitBlueprint` trait to detect all changes in a blueprint and implement it in `BlueprintDiffer` to construct a `BlueprintDiff` in a format we can use. There are a few important things to note about our visitors. While diffus provides a complete tree of all changes as well as fields and variants that have not changed, the visitors as currently implemented only expose changes. These changes are unified for simplicity into a `Change` type. We also don't currently expose all changes. For most fields that we don't expect to ever change, there is not a change callback. Mostly this is because the visitors were written before I had to use them :). It turns out it may actually useful to expose these fields so that we can report errors or warnings in our diffs if they do change unexpectedly. You can see an example of error tracking in the `BlueprintDiffer::visit_zone_type_change` method and its test output in `nexus/reconfigurator/planning/tests/output/ planner_nonprovisionable_2_2a.txt`. It is tedious to write visitor traits to walk a diffus diff and then implement these traits to accumulate the state we want. It can reasonably be asked why we wouldn't just walk the tree and generate the state we want directly without the visitors. Indeed, this is quite reasonable for small one off tasks. But we are building a foundation that will grow over time in terms of type structure and number of developers working on the system. For such a long term project, it's usefulf to decouple the tree walking from the state accumulation. For one thing, it makes each bit easier to read and write on its own. For another, it allows us to build different implementations of the visitors for different use cases, such as testing. Remember, users only have to implement the methods they care about. Therefore if they only want to see if a zone changed, they don't have to worry about parsing diffus types, but can just implement a callback. An example of this is the `TestVisitor` in `live-tests/tests/test_nexus_add_remove.rs`. I anticipate we will find ways to make implementing and using diff visitors more ergonomic over time while maintaining the compositional rigor that the separation of concerns provides in this model. Testing -------- I mainly have run planning tests locally to ensure diffs work as expected. I still need to run the live tests and play with omdb.

This commit introduces an automated blueprint diff mechanism based on [diffus](https://github.com/oxidecomputer/diffus) along with the visitor pattern for traversing heterogeneous diff trees. It builds on several prior commits that introduced diffus support and added visitors for types: #7261 #7336 #7362 #7366 The new `BlueprintDiffer` implements the relevant visitors and accumulates change state from visitor method callbacks. The accumulated state is the `BlueprintDiff` which replaces the old `BlueprintDiff`. The new structure intentionally groups resources by sleds, as the 4 primary maps in a `Blueprint` (`sled_state`, `blueprint_zones`, `blueprint_disks`, `blueprint_datasets`) are going to be collapsed into a single map. This allows us to modify the visitors to traverse the new blueprint diffs in one place and not have to worry about modifying the `BlueprintDiff` itself or any consumers. More details about why we are collapsing these maps can be found in #7078. We primarily use `BlueprintDiff`s for testing and omdb output, and therefore the printed representation is absolutely critical for us. This commit reuses the types and formatting contained in `nexus/ types/src/deployment/ blueprint_display.rs` and provides a new type, `BpDiffPrintable` that can be created from a `BlueprintDiff`. `BpDiffPrintable` contains our formatted tables ready to be used by `BlueprintDiffDisplay` along with the orignal blueprints to render the diff. It's possible that we can collapse `BlueprintDiffDisplay` and `BpDiffPrintable` into one type and simplify a bit. I just haven't done that yet. The printable output of the diffs has maintained backwards compatibility in all cases except for errors and warnings. Those have been specifically adapted to work with our visitors and be accumulated while walking the diffus diffs. This only resulted in the change to one test output file, which should leave us confident in the correctness of this new implementation. An optional `show_unchanged` flag was added at key points in the code and in the future we plan to change the default to only show actual changes. We didn't do that here so that we could ensure the output of the new implementation matches the existing code. We also plan to add more columns, such as `disposition` for datasets. Usage going forward -------------------- This is a significant change to our blueprint diff code, and in some cases it may take slightly more boilerplate in order to diff newly added fields or structs, than in the older code. However, there are a few things this new style has going for it. Figuring out the differences between blueprints is now done automatically and completely. There is no need to compute these differences, which is both error prone, and complicated. See the old [BpDiffZones](https:// github.com/oxidecomputer/omicron/blob/888f90db8b36ccdf667d96423ae7805824c48aa9/nexus/types/src/deployment/blueprint_diff.rs#L195-L340) code for what finding the differences for just zones in a blueprint looked like. With automated diffs output from diffus, we now need to write code to expose the necessary change information to consumers. We want to do this in a unified and rigorous manner, and one that composes nicely. We chose a `visitor` pattern for this. For each somewhat complex type (use your judgement), we provide a visitor trait that walks the heterogeneous diff tree and calls trait methods when a change is found. User code implements only the visitor methods that it cares about and then can accumulate its own internal state based on which callbacks fire. We provide a top-level `VisitBlueprint` trait to detect all changes in a blueprint and implement it in `BlueprintDiffer` to construct a `BlueprintDiff` in a format we can use. There are a few important things to note about our visitors. While diffus provides a complete tree of all changes as well as fields and variants that have not changed, the visitors as currently implemented only expose changes. These changes are unified for simplicity into a `Change` type. We also don't currently expose all changes. For most fields that we don't expect to ever change, there is not a change callback. Mostly this is because the visitors were written before I had to use them :). It turns out it may actually useful to expose these fields so that we can report errors or warnings in our diffs if they do change unexpectedly. You can see an example of error tracking in the `BlueprintDiffer::visit_zone_type_change` method and its test output in `nexus/reconfigurator/planning/tests/output/ planner_nonprovisionable_2_2a.txt`. It is tedious to write visitor traits to walk a diffus diff and then implement these traits to accumulate the state we want. It can reasonably be asked why we wouldn't just walk the tree and generate the state we want directly without the visitors. Indeed, this is quite reasonable for small one off tasks. But we are building a foundation that will grow over time in terms of type structure and number of developers working on the system. For such a long term project, it's usefulf to decouple the tree walking from the state accumulation. For one thing, it makes each bit easier to read and write on its own. For another, it allows us to build different implementations of the visitors for different use cases, such as testing. Remember, users only have to implement the methods they care about. Therefore if they only want to see if a zone changed, they don't have to worry about parsing diffus types, but can just implement a callback. An example of this is the `TestVisitor` in `live-tests/tests/test_nexus_add_remove.rs`. I anticipate we will find ways to make implementing and using diff visitors more ergonomic over time while maintaining the compositional rigor that the separation of concerns provides in this model. Testing -------- I mainly have run planning tests locally to ensure diffs work as expected. I still need to run the live tests and play with omdb.

This commit introduces an automated blueprint diff mechanism based on [diffus](https://github.com/oxidecomputer/diffus) along with the visitor pattern for traversing heterogeneous diff trees. It builds on several prior commits that introduced diffus support and added visitors for types: #7261 #7336 #7362 #7366 The new `BlueprintDiffer` implements the relevant visitors and accumulates change state from visitor method callbacks. The accumulated state is the `BlueprintDiff` which replaces the old `BlueprintDiff`. The new structure intentionally groups resources by sleds, as the 4 primary maps in a `Blueprint` (`sled_state`, `blueprint_zones`, `blueprint_disks`, `blueprint_datasets`) are going to be collapsed into a single map. This allows us to modify the visitors to traverse the new blueprint diffs in one place and not have to worry about modifying the `BlueprintDiff` itself or any consumers. More details about why we are collapsing these maps can be found in #7078. We primarily use `BlueprintDiff`s for testing and omdb output, and therefore the printed representation is absolutely critical for us. This commit reuses the types and formatting contained in `nexus/ types/src/deployment/ blueprint_display.rs` and provides a new type, `BpDiffPrintable` that can be created from a `BlueprintDiff`. `BpDiffPrintable` contains our formatted tables ready to be used by `BlueprintDiffDisplay` along with the orignal blueprints to render the diff. It's possible that we can collapse `BlueprintDiffDisplay` and `BpDiffPrintable` into one type and simplify a bit. I just haven't done that yet. The printable output of the diffs has maintained backwards compatibility in all cases except for errors and warnings. Those have been specifically adapted to work with our visitors and be accumulated while walking the diffus diffs. This only resulted in the change to one test output file, which should leave us confident in the correctness of this new implementation. An optional `show_unchanged` flag was added at key points in the code and in the future we plan to change the default to only show actual changes. We didn't do that here so that we could ensure the output of the new implementation matches the existing code. We also plan to add more columns, such as `disposition` for datasets. Usage going forward -------------------- This is a significant change to our blueprint diff code, and in some cases it may take slightly more boilerplate in order to diff newly added fields or structs, than in the older code. However, there are a few things this new style has going for it. Figuring out the differences between blueprints is now done automatically and completely. There is no need to compute these differences, which is both error prone, and complicated. See the old [BpDiffZones](https:// github.com/oxidecomputer/omicron/blob/888f90db8b36ccdf667d96423ae7805824c48aa9/nexus/types/src/deployment/blueprint_diff.rs#L195-L340) code for what finding the differences for just zones in a blueprint looked like. With automated diffs output from diffus, we now need to write code to expose the necessary change information to consumers. We want to do this in a unified and rigorous manner, and one that composes nicely. We chose a `visitor` pattern for this. For each somewhat complex type (use your judgement), we provide a visitor trait that walks the heterogeneous diff tree and calls trait methods when a change is found. User code implements only the visitor methods that it cares about and then can accumulate its own internal state based on which callbacks fire. We provide a top-level `VisitBlueprint` trait to detect all changes in a blueprint and implement it in `BlueprintDiffer` to construct a `BlueprintDiff` in a format we can use. There are a few important things to note about our visitors. While diffus provides a complete tree of all changes as well as fields and variants that have not changed, the visitors as currently implemented only expose changes. These changes are unified for simplicity into a `Change` type. We also don't currently expose all changes. For most fields that we don't expect to ever change, there is not a change callback. Mostly this is because the visitors were written before I had to use them :). It turns out it may actually be useful to expose these fields so that we can report errors or warnings in our diffs if they do change unexpectedly. You can see an example of error tracking in the `BlueprintDiffer::visit_zone_type_change` method and its test output in `nexus/reconfigurator/planning/tests/output/ planner_nonprovisionable_2_2a.txt`. It is tedious to write visitor traits to walk a diffus diff and then implement these traits to accumulate the state we want. It can reasonably be asked why we wouldn't just walk the tree and generate the state we want directly without the visitors. Indeed, this is quite reasonable for small one off tasks. But we are building a foundation that will grow over time in terms of type structure and number of developers working on the system. For such a long term project, it's usefulf to decouple the tree walking from the state accumulation. For one thing, it makes each bit easier to read and write on its own. For another, it allows us to build different implementations of the visitors for different use cases, such as testing. Remember, users only have to implement the methods they care about. Therefore if they only want to see if a zone changed, they don't have to worry about parsing diffus types, but can just implement a callback. An example of this is the `TestVisitor` in `live-tests/tests/test_nexus_add_remove.rs`. I anticipate we will find ways to make implementing and using diff visitors more ergonomic over time while maintaining the compositional rigor that the separation of concerns provides in this model. Testing -------- I mainly have run planning tests locally to ensure diffs work as expected. I still need to run the live tests and play with omdb.

jgallagher mentioned this issue Nov 15, 2024

[reconfigurator] BlueprintBuilder API allows non-planner clients to emit invalid blueprints #7080

Closed

jgallagher self-assigned this Nov 15, 2024

jgallagher mentioned this issue Nov 19, 2024

[reconfigurator] BlueprintBuilder cleanup 4/5 - when choosing a zpool, it must be from a disk present in the blueprint #7106

Merged

jgallagher mentioned this issue Dec 4, 2024

[reconfigurator] Introduce a combined SledEditor inside BlueprintBuilder #7204

Merged

jgallagher mentioned this issue Dec 11, 2024

[reconfigurator] SledEditor: be more strict about decommissioned sleds #7234

Merged

jgallagher closed this as completed in #7234 Dec 16, 2024

jgallagher closed this as completed in ca21fe7 Dec 16, 2024

jgallagher reopened this Dec 19, 2024

jgallagher mentioned this issue Jan 6, 2025

Reconfigurator "deploy datasets then zones" ordering will break for zone expungement #7309

Open

andrewjstone mentioned this issue Jan 25, 2025

New BlueprintDiff implementation #7402

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blueprint structure allows a variety of "illegal" combinations #7078

Blueprint structure allows a variety of "illegal" combinations #7078

jgallagher commented Nov 15, 2024

jgallagher commented Dec 19, 2024

Blueprint structure allows a variety of "illegal" combinations #7078

Blueprint structure allows a variety of "illegal" combinations #7078

Comments

jgallagher commented Nov 15, 2024

jgallagher commented Dec 19, 2024