Nexus must create "Control Plane Disks" for disks automatically #5503

smklein · 2024-04-10T22:51:24Z

This is the root case of #5502 .

(The decision to create a new issue is a little arbitrary, I just wanted to deep-dive onto the root cause here, while letting the symptoms stay visible on #5502).

As of #5172 , RSS / Nexus are responsible for managing "control plane disks" and "control plane zpools". In this world, the sled agent no longer autonomously formats disks, but waits for the "go-ahead" from Nexus.

For bootstrapping, RSS also has this responsibility:

It creates a cofiguration based on inventory:

omicron/sled-agent/src/rack_setup/plan/service.rs

Lines 345 to 370 in 20299e0

    
           for sled_info in sled_info.iter_mut() { 
        
               let disks = sled_info 
        
                   .inventory 
        
                   .disks 
        
                   .iter() 
        
                   .filter(|disk| { 
        
                       matches!(disk.variant, SledAgentTypes::DiskVariant::U2) 
        
                   }) 
        
                   .map(|disk| OmicronPhysicalDiskConfig { 
        
                       identity: disk.identity.clone(), 
        
                       id: Uuid::new_v4(), 
        
                       pool_id: Uuid::new_v4(), 
        
                   }) 
        
                   .collect(); 
        
               sled_info.request.disks = OmicronPhysicalDisksConfig { 
        
                   generation: Generation::new(), 
        
                   disks, 
        
               }; 
        
               sled_info.u2_zpools = sled_info 
        
                   .request 
        
                   .disks 
        
                   .disks 
        
                   .iter() 
        
                   .map(|disk| ZpoolName::new_external(disk.pool_id)) 
        
                   .collect(); 
        
           }

Sends the configuration to sleds:

omicron/sled-agent/src/rack_setup/service.rs

Lines 348 to 350 in 20299e0

    
           let result = client 
        
               .omicron_physical_disks_put(&storage_config.clone()) 
        
               .await;

And tells Nexus about it in the RSS handoff, where the info is stored in CRDB:

omicron/nexus/db-queries/src/db/datastore/rack.rs

Lines 739 to 760 in 20299e0

    
           for physical_disk in physical_disks { 
        
               Self::physical_disk_upsert_on_connection(&conn, &opctx, physical_disk) 
        
                   .await 
        
                   .map_err(|e| { 
        
                       error!(log, "Failed to upsert physical disk"; "err" => #%e); 
        
                       err.set(RackInitError::PhysicalDiskInsert(e)) 
        
                           .unwrap(); 
        
                       DieselError::RollbackTransaction 
        
                   })?; 
        
           } 
        
           info!(log, "Inserted physical disks"); 
        
           for zpool in zpools { 
        
               Self::zpool_upsert_on_connection(&conn, &opctx, zpool).await.map_err( 
        
                   |e| { 
        
                       error!(log, "Failed to upsert zpool"; "err" => #%e); 
        
                       err.set(RackInitError::ZpoolInsert(e)).unwrap(); 
        
                       DieselError::RollbackTransaction 
        
                   }, 
        
               )?; 
        
           }

So, this works in the initialization case.

However, in the case of "sled addition", we're missing the follow-up parts:

Nexus needs to assemble the OmicronPhysicalDisksConfig that gets sent to sleds
- For "new disks": This means scanning the inventory for physical disks, and creating a new control plane object if one does not exist. Note that "expunged disks" still exist, and should not be automatically re-added.
- For "existing disks": This means using control plane disks that are not expunged.
As a part of blueprint execution, Nexus needs to actually send OmicronPhysicalDisksConfig to individual sleds

Regarding new disk addition, there's a bit of uncertainty about "when we should perform this action".

We could do it once during "sled add" - take a specific point-in-time view of inventory, adopt everything then, and subsequently refuse to auto-online disks
... or we could have "auto-online" be the default policy, until the disk is expunged, at which time it must be explicitly re-added

My opinion incoming: I think we've gone back-and-forth on this a bit because it's a policy decision. An operator could explicitly opt into an "add physical disk" API for each addition. Or they could just want all disks to auto-attach and start working. We could save this as a toggle, and enable both pathways? But that might be unnecessary effort, if we end up recommending / observing that customers prefer a particular option.

The text was updated successfully, but these errors were encountered:

andrewjstone · 2024-04-10T23:02:56Z

Regarding new disk addition, there's a bit of uncertainty about "when we should perform this action".

We could do it once during "sled add" - take a specific point-in-time view of inventory, adopt everything then, and subsequently refuse to auto-online disks
... or we could have "auto-online" be the default policy, until the disk is expunged, at which time it must be explicitly re-added

I think my preference is to always 'auto-online' right now. We don't know timing wise which disks we'll get first and we still need to add APIs for explicit adoption. Once we add those APIs we can choose not to auto-online. If we don't auto-online, we may also want to show the customers what disks they are getting when they add a sled by showing them in list-uninitialized.

… via blueprints (#5506) Automatically adopt disks and deploy them to sleds using blueprints - Queries for physical disk info during reconfigurator planning phase - Adds "physical disks" to blueprint, in-memory as well as the database schema - Blueprint planning now ensures that in-service physical disks appear in the blueprint - Blueprint execution sends a request to sled agents via `omicron-physical-disks PUT` - A background task has been added to automatically adopt new physical disks as control plane objects, and to insert them into the database - "Physical disk upsert" has largely been changed to "Physical disk insert", to avoid potential overwriting issues. "Zpool upsert" has also been updated to "Zpool insert". - The physical disk "vendor/serial/model" uniqueness constraint has been removed for decommissioned disks. This will provide a pathway to eventually re-provisioning deleted disks, if an operator asks for it. Fixes #5503 , #5502

smklein self-assigned this Apr 11, 2024

smklein mentioned this issue Apr 11, 2024

Nexus/Reconfigurator: Automatically adopt disks, deploy them to sleds via blueprints #5506

Merged

9 tasks

smklein closed this as completed in #5506 Apr 14, 2024

davepacheco mentioned this issue Apr 16, 2024

NTP zone failed to start due to U2NotFound error #5502

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nexus must create "Control Plane Disks" for disks automatically #5503

Nexus must create "Control Plane Disks" for disks automatically #5503

smklein commented Apr 10, 2024

andrewjstone commented Apr 10, 2024 •

edited

Loading

Nexus must create "Control Plane Disks" for disks automatically #5503

Nexus must create "Control Plane Disks" for disks automatically #5503

Comments

smklein commented Apr 10, 2024

andrewjstone commented Apr 10, 2024 • edited Loading

andrewjstone commented Apr 10, 2024 •

edited

Loading