Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: better handling of background errors #270

Open
petermattis opened this issue Sep 14, 2019 · 7 comments
Open

db: better handling of background errors #270

petermattis opened this issue Sep 14, 2019 · 7 comments

Comments

@petermattis
Copy link
Collaborator

petermattis commented Sep 14, 2019

When a background error occurs during a flush or compaction, Pebble currently just logs the error and attempts to retry the operation indefinitely. Depending on the error which occurred, this isn't terribly helpful. If the error was due to lack of disk space, a retry loop can be helpful. But if the error is due to a logic bug in Pebble or some sort of corruption it would be better to place the the DB in a read-only mode. This is the strategy used by RocksDB.

Jira issue: PEBBLE-181

@petermattis
Copy link
Collaborator Author

@rohansuri has volunteered to take a look at this issue.

@rohansuri
Copy link
Contributor

Overview of RocksDB's handling: https://github.com/facebook/rocksdb/wiki/Background-Error-Handling. I'll understand all the behaviours and put up the plan/any questions I have.

rohansuri added a commit to rohansuri/pebble that referenced this issue Mar 16, 2020
rohansuri added a commit to rohansuri/pebble that referenced this issue May 11, 2020
rohansuri added a commit to rohansuri/pebble that referenced this issue May 11, 2020
rohansuri added a commit to rohansuri/pebble that referenced this issue May 11, 2020
rohansuri added a commit to rohansuri/pebble that referenced this issue Jul 17, 2020
Once we stop panicking on commitPipeline failures and start placing DB into
read-only mode, all queued writers waiting for their sync need to be notified
of the sync error so that they can run to completion.

See also: cockroachdb#270
rohansuri added a commit to rohansuri/pebble that referenced this issue Jul 21, 2020
Once we stop panicking on commitPipeline failures and start placing DB into
read-only mode, all queued writers waiting for their sync need to be notified
of the sync error so that they can run to completion.

See also: cockroachdb#270
rohansuri added a commit to rohansuri/pebble that referenced this issue Jul 21, 2020
Once we stop panicking on commitPipeline failures and start placing DB into
read-only mode, all queued writers waiting for their sync need to be notified
of the sync error so that they can run to completion.

See also: cockroachdb#270
@rohansuri
Copy link
Contributor

@petermattis Does the scope of this also cover to not panic on manifest write failures and continue serving reads? That will require the following changes: facebook/rocksdb#6316, facebook/rocksdb#5379.

@petermattis
Copy link
Collaborator Author

@petermattis Does the scope of this also cover to not panic on manifest write failures and continue serving reads? That will require the following changes: facebook/rocksdb#6316, facebook/rocksdb#5379.

Yes, we'd want to be able to recover from out-of-disk errors during WAL writes, flushes/compactions, and MANIFEST writes.

@sumeerbhola
Copy link
Collaborator

We also need more visible error reporting so that we can quickly find errors and not have to scan a very verbose Pebble log. This came up when discussing a CockroachDB postmortem.

@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it
in 10 days to keep the issue queue tidy. Thank you for your
contribution to Pebble!

jbowens added a commit to jbowens/pebble that referenced this issue Oct 6, 2023
After a table is ingested, an asynchronous job reads the table and validates
its checksums. If corruption is uncovered, the job fatals the process.
Previously, a transient I/O error would also fatal the process, even if the I/O
error did not indicate corruption. This commit adapts this code path to log the
error to the BackgroundError event listener and re-queue the file for
validation.

Informs cockroachdb#270.
Informs cockroachdb#1115.
jbowens added a commit to jbowens/pebble that referenced this issue Oct 9, 2023
After a table is ingested, an asynchronous job reads the table and validates
its checksums. If corruption is uncovered, the job fatals the process.
Previously, a transient I/O error would also fatal the process, even if the I/O
error did not indicate corruption. This commit adapts this code path to log the
error to the BackgroundError event listener and re-queue the file for
validation.

Informs cockroachdb#270.
Informs cockroachdb#1115.
jbowens added a commit to jbowens/pebble that referenced this issue Oct 9, 2023
After a table is ingested, an asynchronous job reads the table and validates
its checksums. If corruption is uncovered, the job fatals the process.
Previously, a transient I/O error would also fatal the process, even if the I/O
error did not indicate corruption. This commit adapts this code path to log the
error to the BackgroundError event listener and re-queue the file for
validation.

Informs cockroachdb#270.
Informs cockroachdb#1115.
jbowens added a commit that referenced this issue Oct 9, 2023
After a table is ingested, an asynchronous job reads the table and validates
its checksums. If corruption is uncovered, the job fatals the process.
Previously, a transient I/O error would also fatal the process, even if the I/O
error did not indicate corruption. This commit adapts this code path to log the
error to the BackgroundError event listener and re-queue the file for
validation.

Informs #270.
Informs #1115.
Copy link

github-actions bot commented Dec 6, 2023

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it
in 10 days to keep the issue queue tidy. Thank you for your
contribution to Pebble!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: 24.2 candidates
Development

No branches or pull requests

4 participants