-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: fix possible raft log panic after fsync error #37102
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent find!
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajkr)
pkg/storage/engine/rocksdb.go, line 717 at r1 (raw file):
s.Unlock() if r.cfg.Dir != "" && err == nil {
Please add some commentary around this, including mention of the kernel behavior around fsyncs.
Detected with cockroachdb#36989 applied by running `./bin/roachtest run --local '^system-crash/sync-errors=true$'`. With some slight modification to that test's constants it could repro errors like this within a minute: ``` panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost? ``` Debugging showed `DBSyncWAL` can be called even after a sync failure. I guess if it returns success any time after it fails it will ack writes that aren't recoverable in WAL. They aren't recoverable because RocksDB stops recovery upon hitting the offset corresponding to the lost write (typically there should be a corruption there). Meanwhile, there are still successfully synced writes at later offsets in the file. The fix is simple. If `DBSyncWAL` returns an error once, keep track of that error and return it for all future writes. Release note: None
2ef0a60
to
13a54c0
Compare
bors r+ Will check whether Pebble needs the same fix and backport this to 19.1. Also some day I'd like to look into recoverable I/O errors like |
37102: storage: fix possible raft log panic after fsync error r=ajkr a=ajkr Detected with #36989 applied by running `./bin/roachtest run --local '^system-crash/sync-errors=true$'`. With some slight modification to that test's constants it could repro errors like this within a minute: ``` panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost? ``` Debugging showed `DBSyncWAL` can be called even after a sync failure. I guess if it returns success any time after it fails it will ack writes that aren't recoverable in WAL. They aren't recoverable because RocksDB stops recovery upon hitting the offset corresponding to the lost write (typically there should be a corruption there). Meanwhile, there are still successfully synced writes at later offsets in the file. The fix is simple. If `DBSyncWAL` returns an error once, keep track of that error and return it for all future writes. Release note: None Co-authored-by: Andrew Kryczka <[email protected]>
Build succeeded |
That is an excellent find (our very own fsyncgate?) Are #32927, #36522, #36567, #36592 all explained by this? Also, this change definitely needs a release note. I'm also surprised that a unit test wasn't even discussed. If we fix something without adding a test, there ought to be a justification (such as: #37102 repros this and will be merged shortly, I personally would still add the unit test though). The ship for the release note on |
Yes. We'll let this be documented in 19.1.1 when that is released and then it doesn't also need to be documented in the first 19.2 alpha. |
Got it. It is a little bit confusing, for example, if I'm upgrading from 19.1.0 to 19.2 alpha, it is unclear whether I'm getting this fix. I can try making an empty commit with the release notes errata for master if it's desirable. |
Do we have debug zip for any of those? Couldn't find it. If the dmesg contains messages like "lost async page write" leading up to the crash then it seems highly likely. If not, I am not sure.
Right, will come up with a release note before landing the backports at least. |
Thanks!
No, Sentry issues don't provide us with much information (we'd need to impl a version of debug zip without PII... seems pretty hopeless). Our best bet is a user contacting support. |
Detected with #36989 applied by running
./bin/roachtest run --local '^system-crash/sync-errors=true$'
.With some slight modification to that test's constants it could repro
errors like this within a minute:
Debugging showed
DBSyncWAL
can be called even after a sync failure.I guess if it returns success any time after it fails it will ack
writes that aren't recoverable in WAL. They aren't recoverable because
RocksDB stops recovery upon hitting the offset corresponding to the
lost write (typically there should be a corruption there). Meanwhile,
there are still successfully synced writes at later offsets in the
file.
The fix is simple. If
DBSyncWAL
returns an error once, keep track ofthat error and return it for all future writes.
Release note: None