Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage,kv: tolerate corruption of sideloaded sstables #91029

Open
jbowens opened this issue Oct 31, 2022 · 3 comments
Open

storage,kv: tolerate corruption of sideloaded sstables #91029

jbowens opened this issue Oct 31, 2022 · 3 comments
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team

Comments

@jbowens
Copy link
Collaborator

jbowens commented Oct 31, 2022

If an AddSSTable's sstable is sideloaded and becomes corrupted (eg, due to a bad disk), the operator has no recourse other than to replace the node.

This issue is intended to track isolation of corruption of the raft log / sideloaded sstables, in contrast to #67568 which tracks recovery from corruption of already-applied state.

See #90834 for an example.

Jira issue: CRDB-21080

@jbowens jbowens added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team T-kv-replication labels Oct 31, 2022
@blathers-crl
Copy link

blathers-crl bot commented Oct 31, 2022

cc @cockroachdb/replication

@erikgrinaker
Copy link
Contributor

This is related to #75903, in that any failure to apply a Raft command will crash the node anyway. The proposed solution there is to cordon the replica (and then discard the faulty replica and upreplicate elsewhere, unless all replicas are faulty), which is likely the preferable approach here as well. That said, if the SST is corrupt then the disks are likely faulty, so we may not want to keep the node running and risk further corruption anyway.

Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team
Projects
Status: Backlog
Status: Incoming
Development

No branches or pull requests

3 participants