-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: background corruption should fatal the node #101101
Labels
A-storage
Relating to our storage engine (Pebble) on-disk storage.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
E-quick-win
Likely to be a quick win for someone experienced.
T-storage
Storage Team
Comments
jbowens
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
A-storage
Relating to our storage engine (Pebble) on-disk storage.
T-storage
Storage Team
labels
Apr 10, 2023
+1 for this, it'd be a big improvement. I saw a situation recently where a node hit a checksum failure during compaction and just kept retrying the same compaction over and over. The whole time there were a bunch of new files accumulating in L0 and operations on the node just kept getting slower and slower as a result. Crashing would have been far preferable. |
craig bot
pushed a commit
that referenced
this issue
Apr 25, 2023
100929: cdc: fix sarama case sensitivity r=[miretskiy] a=HonoreDB Fixes #100706 by lowercasing compression codecs and passing them through transparently to Sarama rather than doing our own validation. Release note (enterprise change): The kafka_sink_config Compression and RequiredAcks options are now case-insensitive. 102252: storage: fatal on corruption encountered in background r=RaduBerinde a=jbowens Previously, on-disk corruption would only fatal the node if an interator observed it. Corruption encountered by a background job like a compaction would not fatal the node. This can result in busy churning through compactions that repeatedly fail, impacting cluster stability and user query latencies. Now, on-disk corruption results in immediately exiting the node. Epic: none Fixes: #101101 Release note (ops change): When local corruption of data is encountered by a background job, a node will now exit immediately. Co-authored-by: Aaron Zinger <[email protected]> Co-authored-by: Jackson Owens <[email protected]>
blathers-crl bot
pushed a commit
that referenced
this issue
Apr 25, 2023
Previously, on-disk corruption would only fatal the node if an interator observed it. Corruption encountered by a background job like a compaction would not fatal the node. This can result in busy churning through compactions that repeatedly fail, impacting cluster stability and user query latencies. Now, on-disk corruption results in immediately exiting the node. Epic: none Fixes: #101101 Release note (ops change): When local corruption of data is encountered by a background job, a node will now exit immediately.
blathers-crl bot
pushed a commit
that referenced
this issue
Apr 25, 2023
Previously, on-disk corruption would only fatal the node if an interator observed it. Corruption encountered by a background job like a compaction would not fatal the node. This can result in busy churning through compactions that repeatedly fail, impacting cluster stability and user query latencies. Now, on-disk corruption results in immediately exiting the node. Epic: none Fixes: #101101 Release note (ops change): When local corruption of data is encountered by a background job, a node will now exit immediately.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-storage
Relating to our storage engine (Pebble) on-disk storage.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
E-quick-win
Likely to be a quick win for someone experienced.
T-storage
Storage Team
Currently, when an iterator encounters a corruption error (eg, an sstable block checksum failure), the node is fataled with a descriptive error message. However if a background process like a compaction encounters a compaction error, the error is logged and the node continues. If the background process was a compaction, the same compaction is likely to be retried over and over unproductively. See cockroachdb/pebble#270. See #67568 for automatic recovery from corruption.
We should fatal the process in these background corruption cases.
Jira issue: CRDB-26803
The text was updated successfully, but these errors were encountered: