Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: fix possible raft log panic after fsync error #37102

Merged
merged 1 commit into from
Apr 25, 2019

Conversation

ajkr
Copy link
Contributor

@ajkr ajkr commented Apr 24, 2019

Detected with #36989 applied by running ./bin/roachtest run --local '^system-crash/sync-errors=true$'.
With some slight modification to that test's constants it could repro
errors like this within a minute:

panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost?

Debugging showed DBSyncWAL can be called even after a sync failure.
I guess if it returns success any time after it fails it will ack
writes that aren't recoverable in WAL. They aren't recoverable because
RocksDB stops recovery upon hitting the offset corresponding to the
lost write (typically there should be a corruption there). Meanwhile,
there are still successfully synced writes at later offsets in the
file.

The fix is simple. If DBSyncWAL returns an error once, keep track of
that error and return it for all future writes.

Release note: None

@ajkr ajkr requested a review from a team April 24, 2019 22:46
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Excellent find!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajkr)


pkg/storage/engine/rocksdb.go, line 717 at r1 (raw file):

		s.Unlock()

		if r.cfg.Dir != "" && err == nil {

Please add some commentary around this, including mention of the kernel behavior around fsyncs.

Detected with cockroachdb#36989 applied by running `./bin/roachtest run --local '^system-crash/sync-errors=true$'`.
With some slight modification to that test's constants it could repro
errors like this within a minute:

```
panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost?
```

Debugging showed `DBSyncWAL` can be called even after a sync failure.
I guess if it returns success any time after it fails it will ack
writes that aren't recoverable in WAL. They aren't recoverable because
RocksDB stops recovery upon hitting the offset corresponding to the
lost write (typically there should be a corruption there). Meanwhile,
there are still successfully synced writes at later offsets in the
file.

The fix is simple. If `DBSyncWAL` returns an error once, keep track of
that error and return it for all future writes.

Release note: None
@ajkr ajkr force-pushed the fix-fsync-error-raft-panic branch from 2ef0a60 to 13a54c0 Compare April 25, 2019 02:14
@ajkr
Copy link
Contributor Author

ajkr commented Apr 25, 2019

bors r+

Will check whether Pebble needs the same fix and backport this to 19.1. Also some day I'd like to look into recoverable I/O errors like ENOSPC. For example if the user clears up some space after hitting it can cockroach continue?

craig bot pushed a commit that referenced this pull request Apr 25, 2019
37102: storage: fix possible raft log panic after fsync error r=ajkr a=ajkr

Detected with #36989 applied by running `./bin/roachtest run --local '^system-crash/sync-errors=true$'`.
With some slight modification to that test's constants it could repro
errors like this within a minute:

```
panic: tocommit(375) is out of range [lastIndex(374)]. Was the raft log corrupted, truncated, or lost?
```

Debugging showed `DBSyncWAL` can be called even after a sync failure.
I guess if it returns success any time after it fails it will ack
writes that aren't recoverable in WAL. They aren't recoverable because
RocksDB stops recovery upon hitting the offset corresponding to the
lost write (typically there should be a corruption there). Meanwhile,
there are still successfully synced writes at later offsets in the
file.

The fix is simple. If `DBSyncWAL` returns an error once, keep track of
that error and return it for all future writes.

Release note: None

Co-authored-by: Andrew Kryczka <[email protected]>
@craig
Copy link
Contributor

craig bot commented Apr 25, 2019

Build succeeded

@craig craig bot merged commit 13a54c0 into cockroachdb:master Apr 25, 2019
@tbg
Copy link
Member

tbg commented Apr 25, 2019

That is an excellent find (our very own fsyncgate?)

Are #32927, #36522, #36567, #36592 all explained by this?

Also, this change definitely needs a release note. I'm also surprised that a unit test wasn't even discussed. If we fix something without adding a test, there ought to be a justification (such as: #37102 repros this and will be merged shortly, I personally would still add the unit test though).

The ship for the release note on master has sailed (right @bdarnell) but please update #37109 to include a comprehensive release note that lists the kinds of errors potentially explained by this bug. (SEs and Support like to search for errors they see in release notes).

@bdarnell
Copy link
Contributor

The ship for the release note on master has sailed (right @bdarnell)

Yes. We'll let this be documented in 19.1.1 when that is released and then it doesn't also need to be documented in the first 19.2 alpha.

@ajkr
Copy link
Contributor Author

ajkr commented Apr 25, 2019

The ship for the release note on master has sailed (right @bdarnell)

Yes. We'll let this be documented in 19.1.1 when that is released and then it doesn't also need to be documented in the first 19.2 alpha.

Got it. It is a little bit confusing, for example, if I'm upgrading from 19.1.0 to 19.2 alpha, it is unclear whether I'm getting this fix. I can try making an empty commit with the release notes errata for master if it's desirable.

@ajkr
Copy link
Contributor Author

ajkr commented Apr 25, 2019

That is an excellent find (our very own fsyncgate?)

Are #32927, #36522, #36567, #36592 all explained by this?

Do we have debug zip for any of those? Couldn't find it. If the dmesg contains messages like "lost async page write" leading up to the crash then it seems highly likely. If not, I am not sure.

Also, this change definitely needs a release note. I'm also surprised that a unit test wasn't even discussed. If we fix something without adding a test, there ought to be a justification (such as: #37102 repros this and will be merged shortly, I personally would still add the unit test though).

The ship for the release note on master has sailed (right @bdarnell) but please update #37109 to include a comprehensive release note that lists the kinds of errors potentially explained by this bug. (SEs and Support like to search for errors they see in release notes).

Right, will come up with a release note before landing the backports at least.

@tbg
Copy link
Member

tbg commented Apr 25, 2019

Thanks!

Do we have debug zip for any of those?

No, Sentry issues don't provide us with much information (we'd need to impl a version of debug zip without PII... seems pretty hopeless). Our best bet is a user contacting support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants