Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nexus needs reconciliation process for Sled failure / removal #612

Closed
smklein opened this issue Jan 19, 2022 · 4 comments
Closed

Nexus needs reconciliation process for Sled failure / removal #612

smklein opened this issue Jan 19, 2022 · 4 comments
Assignees
Milestone

Comments

@smklein
Copy link
Collaborator

smklein commented Jan 19, 2022

Nexus currently stores information about Sled IDs within CRDB, including:

  • Instances belonging to a particular Sled
  • Zpools, datasets, and (pending) regions belonging to a particular Sled
  • (soon) Versioning and Inventory information for a sled.

If a sled irrecoverably fails - not just reboots, but is either destroyed or unplugged - Nexus needs to have a process for recovering all data which has been detached from the Sled.

This will likely involve:

  • Identifying all instances previously running on the Sled have faulted, if they do not have backups
  • Migration of regions to new locations, if available
  • Cleaning sled-specific information from the database

Additionally, it is important to note that "unplugging a sled" is unpredictable, and may occur at any time. There is separate, critical question of distinguishing between a temporary vs permanent failure (we would want to treat a crash + reboot much differently from a long-term sled removal). However, in the period of time while we're still making that decision, we need to have a policy for dealing with ongoing operations to that sled.

Arguably, the "deletion" operations are particularly nasty in this time period - we don't know if we can discard those requests ("the sled is gone, so we just need to clean the DB"), store them for later ("the sled if offline now, but if it comes back, we'll tell it to perform the deletion") or fail those requests ("deletion is not possible while this sled is offline!")

@andrewjstone
Copy link
Contributor

Great writeup @smklein. I would just like to add that we have discussed sled removal in the past and I believe tentatively decided that "detecting" permanent removal should be done by a rack admin in a controlled operation. @rmustacc

@leftwo
Copy link
Contributor

leftwo commented Jan 19, 2022

Yeah, @smklein great write up. I was thinking about this as well, specifically the case where a crucible region on a sled is deleted when that sled is not present and how do we deal with it when it comes back.

@smklein smklein added this to the FCS milestone Jul 7, 2023
@morlandi7 morlandi7 added the known issue To include in customer documentation and training label Jul 11, 2023
@morlandi7 morlandi7 modified the milestones: FCS, 1.0.3 Aug 15, 2023
@morlandi7 morlandi7 modified the milestones: 1.0.3, 3, 4 Oct 2, 2023
@morlandi7 morlandi7 modified the milestones: 4, 5 Nov 14, 2023
@morlandi7 morlandi7 modified the milestones: 5, 6 Dec 4, 2023
@morlandi7 morlandi7 modified the milestones: 6, 7 Jan 25, 2024
@askfongjojo askfongjojo removed the known issue To include in customer documentation and training label Mar 9, 2024
@morlandi7 morlandi7 modified the milestones: 7, 8 Mar 12, 2024
@davepacheco
Copy link
Collaborator

This process is fleshed out in RFD 457.

@davepacheco davepacheco self-assigned this Apr 16, 2024
@davepacheco
Copy link
Collaborator

Closing this issue not because it's wholly done but because more specific pieces are tracked by other issues and the Reconfigurator project/board so I don't think this represents any particular open work or milestones any more. Feel free to reopen if that's wrong!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants