storage: background corruption should fatal the node #101101

jbowens · 2023-04-10T15:35:50Z

Currently, when an iterator encounters a corruption error (eg, an sstable block checksum failure), the node is fataled with a descriptive error message. However if a background process like a compaction encounters a compaction error, the error is logged and the node continues. If the background process was a compaction, the same compaction is likely to be retried over and over unproductively. See cockroachdb/pebble#270. See #67568 for automatic recovery from corruption.

We should fatal the process in these background corruption cases.

Jira issue: CRDB-26803

a-robinson · 2023-04-25T15:00:39Z

+1 for this, it'd be a big improvement. I saw a situation recently where a node hit a checksum failure during compaction and just kept retrying the same compaction over and over. The whole time there were a bunch of new files accumulating in L0 and operations on the node just kept getting slower and slower as a result. Crashing would have been far preferable.

100929: cdc: fix sarama case sensitivity r=[miretskiy] a=HonoreDB Fixes #100706 by lowercasing compression codecs and passing them through transparently to Sarama rather than doing our own validation. Release note (enterprise change): The kafka_sink_config Compression and RequiredAcks options are now case-insensitive. 102252: storage: fatal on corruption encountered in background r=RaduBerinde a=jbowens Previously, on-disk corruption would only fatal the node if an interator observed it. Corruption encountered by a background job like a compaction would not fatal the node. This can result in busy churning through compactions that repeatedly fail, impacting cluster stability and user query latencies. Now, on-disk corruption results in immediately exiting the node. Epic: none Fixes: #101101 Release note (ops change): When local corruption of data is encountered by a background job, a node will now exit immediately. Co-authored-by: Aaron Zinger <[email protected]> Co-authored-by: Jackson Owens <[email protected]>

Previously, on-disk corruption would only fatal the node if an interator observed it. Corruption encountered by a background job like a compaction would not fatal the node. This can result in busy churning through compactions that repeatedly fail, impacting cluster stability and user query latencies. Now, on-disk corruption results in immediately exiting the node. Epic: none Fixes: #101101 Release note (ops change): When local corruption of data is encountered by a background job, a node will now exit immediately.

jbowens added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Apr 10, 2023

jbowens added the E-quick-win Likely to be a quick win for someone experienced. label Apr 17, 2023

jbowens mentioned this issue Apr 25, 2023

storage: fatal on corruption encountered in background #102252

Merged

craig bot closed this as completed in 4c5be04 Apr 25, 2023

blathers-crl bot mentioned this issue Apr 25, 2023

release-22.2: storage: fatal on corruption encountered in background #102273

Merged

blathers-crl bot mentioned this issue Apr 25, 2023

release-23.1: storage: fatal on corruption encountered in background #102274

Merged

jbowens added this to [Deprecated] Storage Jun 4, 2024

jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: background corruption should fatal the node #101101

storage: background corruption should fatal the node #101101

jbowens commented Apr 10, 2023 •

edited by cockroach-jira-scripts

Loading

a-robinson commented Apr 25, 2023

storage: background corruption should fatal the node #101101

storage: background corruption should fatal the node #101101

Comments

jbowens commented Apr 10, 2023 • edited by cockroach-jira-scripts Loading

a-robinson commented Apr 25, 2023

jbowens commented Apr 10, 2023 •

edited by cockroach-jira-scripts

Loading