-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db: better handling of background errors #270
Comments
@rohansuri has volunteered to take a look at this issue. |
Overview of RocksDB's handling: https://github.com/facebook/rocksdb/wiki/Background-Error-Handling. I'll understand all the behaviours and put up the plan/any questions I have. |
Once we stop panicking on commitPipeline failures and start placing DB into read-only mode, all queued writers waiting for their sync need to be notified of the sync error so that they can run to completion. See also: cockroachdb#270
Once we stop panicking on commitPipeline failures and start placing DB into read-only mode, all queued writers waiting for their sync need to be notified of the sync error so that they can run to completion. See also: cockroachdb#270
Once we stop panicking on commitPipeline failures and start placing DB into read-only mode, all queued writers waiting for their sync need to be notified of the sync error so that they can run to completion. See also: cockroachdb#270
@petermattis Does the scope of this also cover to not panic on manifest write failures and continue serving reads? That will require the following changes: facebook/rocksdb#6316, facebook/rocksdb#5379. |
Yes, we'd want to be able to recover from out-of-disk errors during WAL writes, flushes/compactions, and MANIFEST writes. |
We also need more visible error reporting so that we can quickly find errors and not have to scan a very verbose Pebble log. This came up when discussing a CockroachDB postmortem. |
We have marked this issue as stale because it has been inactive for |
After a table is ingested, an asynchronous job reads the table and validates its checksums. If corruption is uncovered, the job fatals the process. Previously, a transient I/O error would also fatal the process, even if the I/O error did not indicate corruption. This commit adapts this code path to log the error to the BackgroundError event listener and re-queue the file for validation. Informs cockroachdb#270. Informs cockroachdb#1115.
After a table is ingested, an asynchronous job reads the table and validates its checksums. If corruption is uncovered, the job fatals the process. Previously, a transient I/O error would also fatal the process, even if the I/O error did not indicate corruption. This commit adapts this code path to log the error to the BackgroundError event listener and re-queue the file for validation. Informs cockroachdb#270. Informs cockroachdb#1115.
After a table is ingested, an asynchronous job reads the table and validates its checksums. If corruption is uncovered, the job fatals the process. Previously, a transient I/O error would also fatal the process, even if the I/O error did not indicate corruption. This commit adapts this code path to log the error to the BackgroundError event listener and re-queue the file for validation. Informs cockroachdb#270. Informs cockroachdb#1115.
After a table is ingested, an asynchronous job reads the table and validates its checksums. If corruption is uncovered, the job fatals the process. Previously, a transient I/O error would also fatal the process, even if the I/O error did not indicate corruption. This commit adapts this code path to log the error to the BackgroundError event listener and re-queue the file for validation. Informs #270. Informs #1115.
We have marked this issue as stale because it has been inactive for |
When a background error occurs during a flush or compaction, Pebble currently just logs the error and attempts to retry the operation indefinitely. Depending on the error which occurred, this isn't terribly helpful. If the error was due to lack of disk space, a retry loop can be helpful. But if the error is due to a logic bug in Pebble or some sort of corruption it would be better to place the the DB in a read-only mode. This is the strategy used by RocksDB.
Jira issue: PEBBLE-181
The text was updated successfully, but these errors were encountered: