-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WAL corruptions are silently ignored #453
Comments
@lni You are correct about the Pebble behavior. This is intentional and I believe matches the behavior of RocksDB's
Note that this case can occur without actual hardware corruption due to WAL recycling. |
@petermattis thanks for the quick response! Will you use pebble to store Raft logs in CockroachDB? For such use case I guess RocksDB's kAbsoluteConsistency mode is required? |
Yes.
Why? |
Thanks for the detailed explanation. I spent some time reading various papers/articles on this topic, it is more complicated than what I previously knew. My concern is still the hardware corruption part - all records after a corrupted record (say a bit flip) are silently dropped and the corruption itself is never reported. This is going to cause problems for Raft. |
It's a good observation. FWIW, RocksDB's |
|
@ajkr Thanks for the info. I think kPointInTimeRecovery is not ideal for Raft. According to SSD vendor specs, uncorrectable bit error happens at certain advertised rate (e.g. 1 sector per 1^17 bits read for Intel enterprise SSDs). This can cause some Raft log entries reported as persisted to disappear after a reboot. As mentioned by @petermattis above, that directly violates a core requirement of Raft. Any plan to support something similar to kTolerateCorruptedTailRecords as an option? |
@lni Raft lives in a blissful world where data that has been written durably to disk is never later corrupted. CockroachDB adds additional verification of the consistency of Raft replicas on top of Raft, performing these consistency checks once every 24 hours. The story of what happens when there is an inconsistency is still a bit lacking, though an attempt is made to evict the replica in the minority and consider the majority to be correct. This isn't a Raft-level concept, but something CRDB built on top of Raft. While Recovery of the RocksDB/Pebble WAL is also only one aspect of Raft replica state. For the most part, the RockDB/Pebble WAL is never read as records are written to it and also written to sstables. Corruption of a Raft log record written to an sstable is the more likely scenario given that the sstable data is rewritten many times during compactions and there is only a small period of time where a record will only be written to the WAL. The Raft library we're using (
Not at this time. We wouldn't want to add support for it until we had a plan to use it for CRDB. I don't currently see how we'd use |
@petermattis Regarding just the WAL, where we can't distinguish between a harmless corruption and a harmful one, I wonder if the story becomes slightly simpler when we separate the (Pebble/Rocsk)DB instances that store the state machine with no WAL, and the raft state, which will have a WAL. The only raft state in the state-machine-db will be the |
I'm not seeing any problems with this logic. I'll think about this more, though. |
@petermattis Thanks a lot for the detailed reply above, I spent a little bit more time to dig deeper on this. I just read this FAST2018 paper on how to handle & recover from raft log corruptions. It covers quite a few issues we discussed here in this thread. It also has a taxonomy on possible responses to corruptions, kPointInTimeRecovery is basically the 'truncate' approach described in the paper and the paper provided an example (figure 2) on how such approach can cause committed entries to disappear from the entire cluster. 'Truncate' is thus marked as unsafe. I won't worry too much on how to recover from any hardware corrupted log. When there is a checksum mismatch that can be proven to be actual hardware data corruption (e.g. when sync write = true and the next record is complete and valid), I just want to be notified so I can at least crash to ensure safety.
For PRs relate to new features that won't be used by CRDB, will they be considered & accepted? |
The FAST paper is a good read and I think it highlights that the work involve traverses both the storage layer and the replication layer. Naively skipping over corrupted log records at the storage layer is extremely problematic (which is what
Do you have a proposal for how to do this? The FAST paper uses knowledge at the replication layer to make this determination.
From the README: |
My understanding is that kPointInTimeRecovery silently truncates the WAL from any hardware corrupted record while kTolerateCorruptedTailRecords returns an error on such hardware corrupted record. As explained by @ajkr above, kTolerateCorruptedTailRecords only skips partially written tail record at the end of the WAL caused by a crash. Anything I missed here?
Section 3.3.3 "Disentangling Crashes and Corruption in Log" of the paper describes how hardware corruptions is determined in the storage layer. Could you please shed some light on which part of the paper uses replication layer info to determine hardware corruption?
Fair enough. Thanks for the info and this wonderful project! |
Sligthly related to this conversation: @petermattis what happens if we use CRDB (be it on Rocksdb or Pebble) on a checksumming file system (like ZFS), which will return an error if something wants to read a bad block (instead of returning corrupted data)? |
We have marked this issue as stale because it has been inactive for |
TODO: try to summarize this discussion here, or in another issue. It isn't clear we want to do anything here, but I've swapped out most state on this discussion and there is enough here to (re)consume that we shouldn't close this issue yet. |
@sumeerbhola - given your recent work / investigation in this space, I'm wondering if you have an inclination one way or the other? cc: @tbg @erikgrinaker for visibility, in case there are strong feelings here from a CRDB replication perspective. |
There's been one development since this discussion was active: Since #871, Pebble has begun requiring closed WALs to have clean tails. This guards against corruption in WALs other than the last active one, but does nothing for the most recent WAL. In the most recent WAL, Pebble continues to assume an invalid entry is the end the log. As I understand it, there's two broad categories of solutions discussed here:
|
In Cockroach, we do expect missing raft entries to be detected promptly by the Raft implementation, which makes this not a priority from Cockroach's perspective. |
How so? The replication layer relies on durability of the storage engine. Silently discarded data would not necessarily be detected and could lead to basically any split brain scenario if we're exceptionally unlucky. |
This was derived from Ben’s slack comment. I don’t personally have any context on how likely it is for dropped raft log entries to go undetected. |
@tbg split brain because of raft HardState loss? If we only lost raft log entries, am I correct in assuming that we would quickly detect this, so it would not be silent corruption? And for the state machine we can afford to lose a suffix, since we can reapply from the raft log. |
Maybe I'm misunderstanding the question, but if a replica loses a tail of raft log entries, this is a classic violation of durability and wouldn't necessarily be detectable. For the most unlikely but most clear example, if all three replicas crash and lose the last log entry 100, then who should detect that? The entry is gone even though it was acked. Note that the ideas in cockroachdb/cockroach#88442 are applicable here. We could say that if there's WAL tail corruption, we put all voters on this node into the mode described in the issue until they're known to have all indexes that they could've possibly acked in their previous life. |
Just to clarify a bit, we don't know when there's WAL tail corruption. We can't distinguish between WAL tail corruption and an in-progress but incomplete write. Maybe that's exactly what you're saying? After every crash, on boot all nodes' voters enter the described mode? Clean exits that closed the WAL by emitting an EOF trailer could skip the process on boot, because it's unambiguous where the true end of the WAL is. |
See also: Judging from what was written above, it sounds as though pebble would stop replaying the WAL if there's a corrupted block somewhere in the middle of the committed part of the WAL. That could have bad consequences; are we sure we don't want to at least fail in such cases? |
With WAL reuse, there’s no way to know what the committed part of the WAL is. There are ways we can reduce the number scenarios where corruption goes undetected, but given that they can’t eliminate them, I think we should leverage replication. |
If we began padding WAL writes to block boundaries, like we would do if we began using direct I/O, we could read ahead to see if any future blocks entries successfully decode to a log entry encoding the same log number. If they do, we've detected corruption in the middle of the WAL. This still does nothing for the case where a section of the tail of the WAL is corrupted, in which case we can't distinguish between an in-progress, unacknowledged write and a committed, acknowledged but corrupted write. To detect that without internode communication would require a second synchronous fsync, I think. |
I no longer think this is possible in the cloud environment as written. My understanding is that on at least one cloud provider, writes to blocks N and blocks N+1 in parallel will not necessarily complete in order; block N+1 may be persisted, while block N is not. We'd need to include in blocks some record of the offset to which we believe we've already durably synced. Then if we read ahead and observed an intact block that claimed we durably synced up to a higher offset than we successfully replayed, we know corruption occurred. |
When replaying WAL, ErrInvalidChunk is returned if the recorded CRC doesn't match the data (say caused by disk hardware), code is here.
It causes the replayWAL method to stop processing further WAL records without notifying the user of such corruption, code is here.
I'd expect pebble.Open() to return an error for such corruption.
Jira issue: PEBBLE-179
The text was updated successfully, but these errors were encountered: