-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: prevent startup after physical corruption #103899
Labels
A-storage
Relating to our storage engine (Pebble) on-disk storage.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
E-quick-win
Likely to be a quick win for someone experienced.
O-sre
For issues SRE opened or otherwise cares about tracking.
T-storage
Storage Team
Comments
jbowens
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
A-storage
Relating to our storage engine (Pebble) on-disk storage.
T-storage
Storage Team
E-quick-win
Likely to be a quick win for someone experienced.
labels
May 25, 2023
@RahulAggarwal1016 is going to pick this one up. Thanks, Rahul! |
raggar
added a commit
to raggar/cockroach
that referenced
this issue
Jul 28, 2023
Currently if a node faces sstable corruption, that node will crash and try to automatically restart. Since it is likely that the node may crash again, we would like to prevent the node from attempting to restart itself. As a result, this pr created a `PreventStartupFile` when a node experiences sstable corruption. Fixes: cockroachdb#103899 Release-note: None
raggar
added a commit
to raggar/cockroach
that referenced
this issue
Aug 1, 2023
Currently if a node faces sstable corruption, that node will crash and try to automatically restart. Since it is likely that the node may crash again, we would like to prevent the node from attempting to restart itself. As a result, this pr created a `PreventStartupFile` when a node experiences sstable corruption. Fixes: cockroachdb#103899 Release-note: None
raggar
added a commit
to raggar/cockroach
that referenced
this issue
Aug 1, 2023
Currently if a node faces sstable corruption, that node will crash and try to automatically restart. Since it is likely that the node may crash again, we would like to prevent the node from attempting to restart itself. As a result, this pr created a `PreventStartupFile` when a node experiences sstable corruption. Fixes: cockroachdb#103899 Release-note: None
raggar
added a commit
to raggar/cockroach
that referenced
this issue
Aug 2, 2023
Currently if a node faces sstable corruption, that node will crash and try to automatically restart. Since it is likely that the node may crash again, we would like to prevent the node from attempting to restart itself. As a result, this pr created a `PreventStartupFile` when a node experiences sstable corruption. Fixes: cockroachdb#103899 Release note: None
raggar
added a commit
to raggar/cockroach
that referenced
this issue
Aug 3, 2023
Currently if a node faces sstable corruption, that node will crash and try to automatically restart. Since it is likely that the node may crash again, we would like to prevent the node from attempting to restart itself. As a result, this pr created a `PreventStartupFile` when a node experiences sstable corruption. Fixes: cockroachdb#103899 Release note: None
raggar
added a commit
to raggar/cockroach
that referenced
this issue
Aug 8, 2023
Currently if a node faces sstable corruption, that node will crash and try to automatically restart. Since it is likely that the node may crash again, we would like to prevent the node from attempting to restart itself. As a result, this pr created a `PreventStartupFile` when a node experiences sstable corruption. Fixes: cockroachdb#103899 Release note: None
raggar
added a commit
to raggar/cockroach
that referenced
this issue
Aug 9, 2023
Currently if a node faces sstable corruption, that node will crash and try to automatically restart. Since it is likely that the node may crash again, we would like to prevent the node from attempting to restart itself. As a result, this pr created a `PreventStartupFile` when a node experiences sstable corruption. Fixes: cockroachdb#103899 Release note: None
joshimhoff
added
the
O-sre
For issues SRE opened or otherwise cares about tracking.
label
Aug 14, 2023
craig bot
pushed a commit
that referenced
this issue
Aug 14, 2023
107828: storage: Write PreventStartupFile on Node SSTFile Corruption r=RahulAggarwal1016 a=RahulAggarwal1016 Currently if a node faces sstable corruption, that node will crash and try to automatically restart. Since it is likely that the node may crash again, we would like to prevent the node from attempting to restart itself. As a result, this pr created a `PreventStartupFile` when a node experiences sstable corruption. Note: There is another place where a node crashes due to sstable corruption: https://github.com/cockroachdb/cockroach/blob/ba053dd3ff75dbca557e3b21195ed4cdff250660/pkg/storage/pebble_iterator.go#L958 however I am not quite sure how to access a `Pebble` instance without it getting a little messy. Fixes: #103899 Release-note: None Co-authored-by: Rahul Aggarwal <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-storage
Relating to our storage engine (Pebble) on-disk storage.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
E-quick-win
Likely to be a quick win for someone experienced.
O-sre
For issues SRE opened or otherwise cares about tracking.
T-storage
Storage Team
Replication-level corruption results in crashing the node, but also the writing of a special file (
base.PreventedStartupFile
) to prevent node startup:cockroach/pkg/kv/kvserver/replica_corruption.go
Lines 57 to 69 in df21930
This is done because the node requires manual intervention. Today, corruption of physical media (eg, bit rot resulting in node-local data loss) similarly requires manual intervention to replace the node. Although we hope to support automatic recovery of physical corruption from other replicas some day (#67568), in the meantime we require the node with local corruption be decommissioned and replaced. We should consider writing
base.PreventedStartupFile
before crashing these nodes to prevent automatic restarting of a node that we have every expectation will crash again.Jira issue: CRDB-28250
The text was updated successfully, but these errors were encountered: