Nexus needs reconciliation process for Sled failure / removal #612

smklein · 2022-01-19T16:58:51Z

Nexus currently stores information about Sled IDs within CRDB, including:

Instances belonging to a particular Sled
Zpools, datasets, and (pending) regions belonging to a particular Sled
(soon) Versioning and Inventory information for a sled.

If a sled irrecoverably fails - not just reboots, but is either destroyed or unplugged - Nexus needs to have a process for recovering all data which has been detached from the Sled.

This will likely involve:

Identifying all instances previously running on the Sled have faulted, if they do not have backups
Migration of regions to new locations, if available
Cleaning sled-specific information from the database

Additionally, it is important to note that "unplugging a sled" is unpredictable, and may occur at any time. There is separate, critical question of distinguishing between a temporary vs permanent failure (we would want to treat a crash + reboot much differently from a long-term sled removal). However, in the period of time while we're still making that decision, we need to have a policy for dealing with ongoing operations to that sled.

Arguably, the "deletion" operations are particularly nasty in this time period - we don't know if we can discard those requests ("the sled is gone, so we just need to clean the DB"), store them for later ("the sled if offline now, but if it comes back, we'll tell it to perform the deletion") or fail those requests ("deletion is not possible while this sled is offline!")

andrewjstone · 2022-01-19T17:15:55Z

Great writeup @smklein. I would just like to add that we have discussed sled removal in the past and I believe tentatively decided that "detecting" permanent removal should be done by a rack admin in a controlled operation. @rmustacc

leftwo · 2022-01-19T18:21:22Z

Yeah, @smklein great write up. I was thinking about this as well, specifically the case where a crucible region on a sled is deleted when that sled is not present and how do we deal with it when it comes back.

davepacheco · 2024-03-14T04:45:27Z

This process is fleshed out in RFD 457.

davepacheco · 2024-04-17T17:40:36Z

Closing this issue not because it's wholly done but because more specific pieces are tracked by other issues and the Reconfigurator project/board so I don't think this represents any particular open work or milestones any more. Feel free to reopen if that's wrong!

smklein mentioned this issue Jan 19, 2022

Disk creation/deletion allocates Crucible regions via sagas #511

Merged

bnaecker mentioned this issue Jul 22, 2022

Need prototype reporting of persistent sled faults #1366

Closed

smklein added this to the FCS milestone Jul 7, 2023

morlandi7 added the known issue To include in customer documentation and training label Jul 11, 2023

morlandi7 modified the milestones: FCS, 1.0.3 Aug 15, 2023

morlandi7 modified the milestones: 1.0.3, 3, 4 Oct 2, 2023

morlandi7 modified the milestones: 4, 5 Nov 14, 2023

morlandi7 modified the milestones: 5, 6 Dec 4, 2023

askfongjojo mentioned this issue Jan 10, 2024

what's involved with removing a sled? #4787

Closed

morlandi7 modified the milestones: 6, 7 Jan 25, 2024

askfongjojo removed the known issue To include in customer documentation and training label Mar 9, 2024

morlandi7 modified the milestones: 7, 8 Mar 12, 2024

davepacheco self-assigned this Apr 16, 2024

davepacheco closed this as completed Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nexus needs reconciliation process for Sled failure / removal #612

Nexus needs reconciliation process for Sled failure / removal #612

smklein commented Jan 19, 2022

andrewjstone commented Jan 19, 2022

leftwo commented Jan 19, 2022

davepacheco commented Mar 14, 2024

davepacheco commented Apr 17, 2024

Nexus needs reconciliation process for Sled failure / removal #612

Nexus needs reconciliation process for Sled failure / removal #612

Comments

smklein commented Jan 19, 2022

andrewjstone commented Jan 19, 2022

leftwo commented Jan 19, 2022

davepacheco commented Mar 14, 2024

davepacheco commented Apr 17, 2024