-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nexus needs reconciliation process for Sled failure / removal #612
Comments
Yeah, @smklein great write up. I was thinking about this as well, specifically the case where a crucible region on a sled is deleted when that sled is not present and how do we deal with it when it comes back. |
This process is fleshed out in RFD 457. |
Closing this issue not because it's wholly done but because more specific pieces are tracked by other issues and the Reconfigurator project/board so I don't think this represents any particular open work or milestones any more. Feel free to reopen if that's wrong! |
Nexus currently stores information about Sled IDs within CRDB, including:
If a sled irrecoverably fails - not just reboots, but is either destroyed or unplugged - Nexus needs to have a process for recovering all data which has been detached from the Sled.
This will likely involve:
Additionally, it is important to note that "unplugging a sled" is unpredictable, and may occur at any time. There is separate, critical question of distinguishing between a temporary vs permanent failure (we would want to treat a crash + reboot much differently from a long-term sled removal). However, in the period of time while we're still making that decision, we need to have a policy for dealing with ongoing operations to that sled.
Arguably, the "deletion" operations are particularly nasty in this time period - we don't know if we can discard those requests ("the sled is gone, so we just need to clean the DB"), store them for later ("the sled if offline now, but if it comes back, we'll tell it to perform the deletion") or fail those requests ("deletion is not possible while this sled is offline!")
The text was updated successfully, but these errors were encountered: