release-2.1: storage: fix possible raft log panic after fsync error #37214
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport 1/1 commits from #37102.
/cc @cockroachdb/release
Detected with #36989 applied by running
./bin/roachtest run --local '^system-crash/sync-errors=true$'
.With some slight modification to that test's constants it could repro
errors like this within a minute:
Debugging showed
DBSyncWAL
can be called even after a sync failure.I guess if it returns success any time after it fails it will ack
writes that aren't recoverable in WAL. They aren't recoverable because
RocksDB stops recovery upon hitting the offset corresponding to the
lost write (typically there should be a corruption there). Meanwhile,
there are still successfully synced writes at later offsets in the
file.
The fix is simple. If
DBSyncWAL
returns an error once, keep track ofthat error and return it for all future writes.
Release note (bug fix): Fixed possible panic while recovering from a WAL
on which a sync operation failed.