Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-2.1: storage: fix possible raft log panic after fsync error #37214

Merged
merged 1 commit into from
Apr 30, 2019

Conversation

ajkr
Copy link
Contributor

@ajkr ajkr commented Apr 30, 2019

Backport 1/1 commits from #37102.

/cc @cockroachdb/release


Detected with #36989 applied by running
./bin/roachtest run --local '^system-crash/sync-errors=true$'.
With some slight modification to that test's constants it could repro
errors like this within a minute:

panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost?

Debugging showed DBSyncWAL can be called even after a sync failure.
I guess if it returns success any time after it fails it will ack
writes that aren't recoverable in WAL. They aren't recoverable because
RocksDB stops recovery upon hitting the offset corresponding to the
lost write (typically there should be a corruption there). Meanwhile,
there are still successfully synced writes at later offsets in the
file.

The fix is simple. If DBSyncWAL returns an error once, keep track of
that error and return it for all future writes.

Release note (bug fix): Fixed possible panic while recovering from a WAL
on which a sync operation failed.

@ajkr ajkr requested a review from a team April 30, 2019 19:04
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Detected with cockroachdb#36989 applied by running
`./bin/roachtest run --local '^system-crash/sync-errors=true$'`.
With some slight modification to that test's constants it could repro
errors like this within a minute:

```
panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost?
```

Debugging showed `DBSyncWAL` can be called even after a sync failure.
I guess if it returns success any time after it fails it will ack
writes that aren't recoverable in WAL. They aren't recoverable because
RocksDB stops recovery upon hitting the offset corresponding to the
lost write (typically there should be a corruption there). Meanwhile,
there are still successfully synced writes at later offsets in the
file.

The fix is simple. If `DBSyncWAL` returns an error once, keep track of
that error and return it for all future writes.

Release note (bug fix): Fixed possible panic while recovering from a WAL
on which a sync operation failed.
@ajkr ajkr force-pushed the backport2.1-37102 branch from 352b702 to 0eb7e95 Compare April 30, 2019 19:10
@ajkr ajkr merged commit dab475d into cockroachdb:release-2.1 Apr 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants