Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: background corruption should fatal the node #101101

Closed
jbowens opened this issue Apr 10, 2023 · 1 comment · Fixed by #102252
Closed

storage: background corruption should fatal the node #101101

jbowens opened this issue Apr 10, 2023 · 1 comment · Fixed by #102252
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. E-quick-win Likely to be a quick win for someone experienced. T-storage Storage Team

Comments

@jbowens
Copy link
Collaborator

jbowens commented Apr 10, 2023

Currently, when an iterator encounters a corruption error (eg, an sstable block checksum failure), the node is fataled with a descriptive error message. However if a background process like a compaction encounters a compaction error, the error is logged and the node continues. If the background process was a compaction, the same compaction is likely to be retried over and over unproductively. See cockroachdb/pebble#270. See #67568 for automatic recovery from corruption.

We should fatal the process in these background corruption cases.

Jira issue: CRDB-26803

@jbowens jbowens added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Apr 10, 2023
@jbowens jbowens added the E-quick-win Likely to be a quick win for someone experienced. label Apr 17, 2023
@a-robinson
Copy link
Contributor

+1 for this, it'd be a big improvement. I saw a situation recently where a node hit a checksum failure during compaction and just kept retrying the same compaction over and over. The whole time there were a bunch of new files accumulating in L0 and operations on the node just kept getting slower and slower as a result. Crashing would have been far preferable.

craig bot pushed a commit that referenced this issue Apr 25, 2023
100929: cdc: fix sarama case sensitivity r=[miretskiy] a=HonoreDB

Fixes #100706 by lowercasing compression codecs and passing them through transparently to Sarama rather than doing our own validation.

Release note (enterprise change): The kafka_sink_config Compression and RequiredAcks options are now case-insensitive.

102252: storage: fatal on corruption encountered in background r=RaduBerinde a=jbowens

Previously, on-disk corruption would only fatal the node if an interator observed it. Corruption encountered by a background job like a compaction would not fatal the node. This can result in busy churning through compactions that repeatedly fail, impacting cluster stability and user query latencies.

Now, on-disk corruption results in immediately exiting the node.

Epic: none
Fixes: #101101
Release note (ops change): When local corruption of data is encountered by a background job, a node will now exit immediately.

Co-authored-by: Aaron Zinger <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
@craig craig bot closed this as completed in 4c5be04 Apr 25, 2023
blathers-crl bot pushed a commit that referenced this issue Apr 25, 2023
Previously, on-disk corruption would only fatal the node if an interator
observed it. Corruption encountered by a background job like a compaction would
not fatal the node. This can result in busy churning through compactions that
repeatedly fail, impacting cluster stability and user query latencies.

Now, on-disk corruption results in immediately exiting the node.

Epic: none
Fixes: #101101
Release note (ops change): When local corruption of data is encountered by a
background job, a node will now exit immediately.
blathers-crl bot pushed a commit that referenced this issue Apr 25, 2023
Previously, on-disk corruption would only fatal the node if an interator
observed it. Corruption encountered by a background job like a compaction would
not fatal the node. This can result in busy churning through compactions that
repeatedly fail, impacting cluster stability and user query latencies.

Now, on-disk corruption results in immediately exiting the node.

Epic: none
Fixes: #101101
Release note (ops change): When local corruption of data is encountered by a
background job, a node will now exit immediately.
@jbowens jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. E-quick-win Likely to be a quick win for someone experienced. T-storage Storage Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants