Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: prevent startup after physical corruption #103899

Closed
jbowens opened this issue May 25, 2023 · 1 comment · Fixed by #107828
Closed

storage: prevent startup after physical corruption #103899

jbowens opened this issue May 25, 2023 · 1 comment · Fixed by #107828
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. E-quick-win Likely to be a quick win for someone experienced. O-sre For issues SRE opened or otherwise cares about tracking. T-storage Storage Team

Comments

@jbowens
Copy link
Collaborator

jbowens commented May 25, 2023

Replication-level corruption results in crashing the node, but also the writing of a special file (base.PreventedStartupFile) to prevent node startup:

preventStartupMsg := fmt.Sprintf(`ATTENTION:
this node is terminating because replica %s detected an inconsistent state.
Please contact the CockroachDB support team. It is not necessarily safe
to replace this node; cluster data may still be at risk of corruption.
A file preventing this node from restarting was placed at:
%s
`, r, path)
if err := fs.WriteFile(r.store.TODOEngine(), path, []byte(preventStartupMsg)); err != nil {
log.Warningf(ctx, "%v", err)
}

This is done because the node requires manual intervention. Today, corruption of physical media (eg, bit rot resulting in node-local data loss) similarly requires manual intervention to replace the node. Although we hope to support automatic recovery of physical corruption from other replicas some day (#67568), in the meantime we require the node with local corruption be decommissioned and replaced. We should consider writing base.PreventedStartupFile before crashing these nodes to prevent automatic restarting of a node that we have every expectation will crash again.

Jira issue: CRDB-28250

@jbowens jbowens added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team E-quick-win Likely to be a quick win for someone experienced. labels May 25, 2023
@jbowens jbowens self-assigned this May 31, 2023
@jbowens jbowens removed their assignment Jul 24, 2023
@nicktrav
Copy link
Collaborator

@RahulAggarwal1016 is going to pick this one up. Thanks, Rahul!

raggar added a commit to raggar/cockroach that referenced this issue Jul 28, 2023
Currently if a node faces sstable corruption, that node will crash and
try to automatically restart. Since it is likely that the node may crash
again, we would like to prevent the node from attempting to restart
itself. As a result, this pr created a `PreventStartupFile` when a node
experiences sstable corruption.

Fixes: cockroachdb#103899
Release-note: None
raggar added a commit to raggar/cockroach that referenced this issue Aug 1, 2023
Currently if a node faces sstable corruption, that node will crash and
try to automatically restart. Since it is likely that the node may crash
again, we would like to prevent the node from attempting to restart
itself. As a result, this pr created a `PreventStartupFile` when a node
experiences sstable corruption.

Fixes: cockroachdb#103899
Release-note: None
raggar added a commit to raggar/cockroach that referenced this issue Aug 1, 2023
Currently if a node faces sstable corruption, that node will crash and
try to automatically restart. Since it is likely that the node may crash
again, we would like to prevent the node from attempting to restart
itself. As a result, this pr created a `PreventStartupFile` when a node
experiences sstable corruption.

Fixes: cockroachdb#103899
Release-note: None
raggar added a commit to raggar/cockroach that referenced this issue Aug 2, 2023
Currently if a node faces sstable corruption, that node will crash and
try to automatically restart. Since it is likely that the node may crash
again, we would like to prevent the node from attempting to restart
itself. As a result, this pr created a `PreventStartupFile` when a node
experiences sstable corruption.

Fixes: cockroachdb#103899
Release note: None
raggar added a commit to raggar/cockroach that referenced this issue Aug 3, 2023
Currently if a node faces sstable corruption, that node will crash and
try to automatically restart. Since it is likely that the node may crash
again, we would like to prevent the node from attempting to restart
itself. As a result, this pr created a `PreventStartupFile` when a node
experiences sstable corruption.

Fixes: cockroachdb#103899
Release note: None
raggar added a commit to raggar/cockroach that referenced this issue Aug 8, 2023
Currently if a node faces sstable corruption, that node will crash and
try to automatically restart. Since it is likely that the node may crash
again, we would like to prevent the node from attempting to restart
itself. As a result, this pr created a `PreventStartupFile` when a node
experiences sstable corruption.

Fixes: cockroachdb#103899
Release note: None
raggar added a commit to raggar/cockroach that referenced this issue Aug 9, 2023
Currently if a node faces sstable corruption, that node will crash and
try to automatically restart. Since it is likely that the node may crash
again, we would like to prevent the node from attempting to restart
itself. As a result, this pr created a `PreventStartupFile` when a node
experiences sstable corruption.

Fixes: cockroachdb#103899
Release note: None
@joshimhoff joshimhoff added the O-sre For issues SRE opened or otherwise cares about tracking. label Aug 14, 2023
craig bot pushed a commit that referenced this issue Aug 14, 2023
107828: storage: Write PreventStartupFile on Node SSTFile Corruption r=RahulAggarwal1016 a=RahulAggarwal1016

Currently if a node faces sstable corruption, that node will crash and try to automatically restart. Since it is likely that the node may crash again, we would like to prevent the node from attempting to restart itself. As a result, this pr created a `PreventStartupFile` when a node experiences sstable corruption.

Note: There is another place where a node crashes due to sstable corruption: https://github.com/cockroachdb/cockroach/blob/ba053dd3ff75dbca557e3b21195ed4cdff250660/pkg/storage/pebble_iterator.go#L958 however I am not quite sure how to access a `Pebble` instance without it getting a little messy.  

Fixes: #103899
Release-note: None

Co-authored-by: Rahul Aggarwal <[email protected]>
@craig craig bot closed this as completed in bbfd162 Aug 14, 2023
@jbowens jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. E-quick-win Likely to be a quick win for someone experienced. O-sre For issues SRE opened or otherwise cares about tracking. T-storage Storage Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants