-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage, kv: replica corruption reporting and recovery #67568
Labels
A-storage
Relating to our storage engine (Pebble) on-disk storage.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
O-sre
For issues SRE opened or otherwise cares about tracking.
Comments
itsbilal
added
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
A-storage
Relating to our storage engine (Pebble) on-disk storage.
labels
Jul 13, 2021
If there is a Pebble checksum failure we may not want to delete the sstable on disk. That is, we might want to logically delete the sstable from the LSM, but move the sstable to some holding pen. I'm not up to date on what SSDs do nowadays with regard to hardware failures, but it seems safest to not allow the storage corresponding to a checksum failure to be reused. |
itsbilal
added a commit
to itsbilal/cockroach
that referenced
this issue
Aug 3, 2021
This PR proposes a design for responding to sstable corruption by marking matching replicas as corrupt and letting Cockroach replication delete / replace them with data from other nodes. Informs cockroachdb#67568. Release note: None.
itsbilal
added a commit
to itsbilal/cockroach
that referenced
this issue
Aug 24, 2021
This PR proposes a design for responding to sstable corruption by marking matching replicas as corrupt and letting Cockroach replication delete / replace them with data from other nodes. Informs cockroachdb#67568. Release note: None. Release justification:
nicktrav
added a commit
to nicktrav/pebble
that referenced
this issue
Aug 27, 2021
To provide better detection of corruption upon ingestion of an SSTable, add the ability for an `sstable.Reader` to validate the checksums of all blocks in the underlying file for the SSTable. The new `(*sstable.Reader).ValidateBlockChecksums` will be used by a subsequent change to perform validation of an SSTable upon ingestion, with any corruption that may be detected detected passed to the corruption reporting and recovery mechanism outlined in cockroachdb/cockroach#67568. See also: cockroachdb#1203.
nicktrav
added a commit
to nicktrav/pebble
that referenced
this issue
Aug 27, 2021
To provide better detection of corruption upon ingestion of an SSTable, add the ability for an `sstable.Reader` to validate the checksums of all blocks in the underlying file for the SSTable. The new `(*sstable.Reader).ValidateBlockChecksums` will be used by a subsequent change to perform validation of an SSTable upon ingestion, with any corruption that may be detected detected passed to the corruption reporting and recovery mechanism outlined in cockroachdb/cockroach#67568. See also: cockroachdb#1203.
nicktrav
added a commit
to nicktrav/pebble
that referenced
this issue
Aug 30, 2021
To provide better detection of corruption upon ingestion of an SSTable, add the ability for an `sstable.Reader` to validate the checksums of all blocks in the underlying file for the SSTable. The new `(*sstable.Reader).ValidateBlockChecksums` will be used by a subsequent change to perform validation of an SSTable upon ingestion, with any corruption that may be detected detected passed to the corruption reporting and recovery mechanism outlined in cockroachdb/cockroach#67568. See also: cockroachdb#1203.
nicktrav
added a commit
to nicktrav/pebble
that referenced
this issue
Aug 30, 2021
To provide better detection of corruption upon ingestion of an SSTable, add the ability for an `sstable.Reader` to validate the checksums of all blocks in the underlying file for the SSTable. The new `(*sstable.Reader).ValidateBlockChecksums` will be used by a subsequent change to perform validation of an SSTable upon ingestion, with any corruption that may be detected detected passed to the corruption reporting and recovery mechanism outlined in cockroachdb/cockroach#67568. See also: cockroachdb#1203.
itsbilal
added a commit
to itsbilal/cockroach
that referenced
this issue
Sep 2, 2021
This PR proposes a design for responding to sstable corruption by marking matching replicas as corrupt and letting Cockroach replication delete / replace them with data from other nodes. Informs cockroachdb#67568. Release note: None. Release justification:
itsbilal
added a commit
to itsbilal/cockroach
that referenced
this issue
Sep 28, 2021
Adds a new roachtest, sstable-corruption/table, that imports TPCC, then finds an sstable that contains table keys (which are replicated) on one node, then calls `dd` to write random bytes into the middle of them. It then checks that the corrupt node either crashes on restart, or crashes after the tpcc workload is run on it. Informs cockroachdb#67568. Release note: None.
itsbilal
added a commit
to itsbilal/cockroach
that referenced
this issue
Sep 30, 2021
This PR proposes a design for responding to sstable corruption by marking matching replicas as corrupt and letting Cockroach replication delete / replace them with data from other nodes. Informs cockroachdb#67568. Release note: None. Release justification:
itsbilal
added a commit
to itsbilal/cockroach
that referenced
this issue
Oct 1, 2021
This PR proposes a design for responding to sstable corruption by marking matching replicas as corrupt and letting Cockroach replication delete / replace them with data from other nodes. Informs cockroachdb#67568. Release note: None. Release justification:
itsbilal
added a commit
to itsbilal/cockroach
that referenced
this issue
Oct 14, 2021
Adds a new roachtest, sstable-corruption/table, that imports TPCC, then finds an sstable that contains table keys (which are replicated) on one node, then calls `dd` to write random bytes into the middle of them. It then checks that the corrupt node either crashes on restart, or crashes after the tpcc workload is run on it. Informs cockroachdb#67568. Release note: None.
itsbilal
added a commit
to itsbilal/cockroach
that referenced
this issue
Oct 14, 2021
Adds a new roachtest, sstable-corruption/table, that imports TPCC, then finds an sstable that contains table keys (which are replicated) on one node, then calls `dd` to write random bytes into the middle of them. It then checks that the corrupt node either crashes on restart, or crashes after the tpcc workload is run on it. Informs cockroachdb#67568. Release note: None.
itsbilal
added a commit
to itsbilal/cockroach
that referenced
this issue
Oct 15, 2021
Adds a new roachtest, sstable-corruption/table, that imports TPCC, then finds an sstable that contains table keys (which are replicated) on one node, then calls `dd` to write random bytes into the middle of them. It then checks that the corrupt node either crashes on restart, or crashes after the tpcc workload is run on it. Informs cockroachdb#67568. Release note: None.
itsbilal
added a commit
to itsbilal/cockroach
that referenced
this issue
Oct 18, 2021
Adds a new roachtest, sstable-corruption/table, that imports TPCC, then finds an sstable that contains table keys (which are replicated) on one node, then calls `dd` to write random bytes into the middle of them. It then checks that the corrupt node either crashes on restart, or crashes after the tpcc workload is run on it. Informs cockroachdb#67568. Release note: None.
craig bot
pushed a commit
that referenced
this issue
Oct 18, 2021
70845: roachtest: Add sstable-corruption roachtest r=tbg a=itsbilal Adds a new roachtest, sstable-corruption/table, that imports TPCC, then finds an sstable that contains table keys (which are replicated) on one node, then calls `dd` to write random bytes into the middle of them. It then checks that the corrupt node either crashes on restart, or crashes after the tpcc workload is run on it. Informs #67568. Release note: None. Co-authored-by: Bilal Akhtar <[email protected]>
This was referenced Oct 31, 2022
joshimhoff
added
the
O-sre
For issues SRE opened or otherwise cares about tracking.
label
Aug 14, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-storage
Relating to our storage engine (Pebble) on-disk storage.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
O-sre
For issues SRE opened or otherwise cares about tracking.
This is a meta issue to track the issue of recovering from corrupt sstables.
Currently, cockroach just crashes with an error like the one(s) in
#57934 . From sentry reports,
it's clear that this is a fairly frequent occurrence in the wild.
But we can do better than crashing the node. Cockroach could just mark the corresponding
replica(s) as corrupt, and up-replicate from other nodes, which should hopefully have
clean, non-corrupt copies of those keys. This reduces the impact on the cluster
and reduces the chances that a bunch of sporadic instances of hardware-level
sstable corruption would significantly impact uptime of the entire cluster.
Parts of this project (details TBD):
Jira issue: CRDB-8611
The text was updated successfully, but these errors were encountered: