db: better handling of background errors #270

petermattis · 2019-09-14T14:09:21Z

When a background error occurs during a flush or compaction, Pebble currently just logs the error and attempts to retry the operation indefinitely. Depending on the error which occurred, this isn't terribly helpful. If the error was due to lack of disk space, a retry loop can be helpful. But if the error is due to a logic bug in Pebble or some sort of corruption it would be better to place the the DB in a read-only mode. This is the strategy used by RocksDB.

Jira issue: PEBBLE-181

petermattis · 2020-02-18T17:19:14Z

@rohansuri has volunteered to take a look at this issue.

rohansuri · 2020-03-04T14:07:42Z

Overview of RocksDB's handling: https://github.com/facebook/rocksdb/wiki/Background-Error-Handling. I'll understand all the behaviours and put up the plan/any questions I have.

Resolves: cockroachdb#270

Once we stop panicking on commitPipeline failures and start placing DB into read-only mode, all queued writers waiting for their sync need to be notified of the sync error so that they can run to completion. See also: cockroachdb#270

rohansuri · 2020-07-26T21:56:23Z

@petermattis Does the scope of this also cover to not panic on manifest write failures and continue serving reads? That will require the following changes: facebook/rocksdb#6316, facebook/rocksdb#5379.

petermattis · 2020-07-29T14:17:31Z

@petermattis Does the scope of this also cover to not panic on manifest write failures and continue serving reads? That will require the following changes: facebook/rocksdb#6316, facebook/rocksdb#5379.

Yes, we'd want to be able to recover from out-of-disk errors during WAL writes, flushes/compactions, and MANIFEST writes.

sumeerbhola · 2020-12-17T15:09:32Z

We also need more visible error reporting so that we can quickly find errors and not have to scan a very verbose Pebble log. This came up when discussing a CockroachDB postmortem.

github-actions · 2022-06-13T11:01:43Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it
in 10 days to keep the issue queue tidy. Thank you for your
contribution to Pebble!

After a table is ingested, an asynchronous job reads the table and validates its checksums. If corruption is uncovered, the job fatals the process. Previously, a transient I/O error would also fatal the process, even if the I/O error did not indicate corruption. This commit adapts this code path to log the error to the BackgroundError event listener and re-queue the file for validation. Informs cockroachdb#270. Informs cockroachdb#1115.

After a table is ingested, an asynchronous job reads the table and validates its checksums. If corruption is uncovered, the job fatals the process. Previously, a transient I/O error would also fatal the process, even if the I/O error did not indicate corruption. This commit adapts this code path to log the error to the BackgroundError event listener and re-queue the file for validation. Informs #270. Informs #1115.

github-actions · 2023-12-06T11:01:11Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it
in 10 days to keep the issue queue tidy. Thank you for your
contribution to Pebble!

petermattis mentioned this issue Oct 4, 2019

Correctly handle full memtable when replaying WAL. #328

Merged

rohansuri added a commit to rohansuri/pebble that referenced this issue Mar 16, 2020

WIP: db: handle kMemtable bg error

28fcd26

Resolves: cockroachdb#270

rohansuri added a commit to rohansuri/pebble that referenced this issue May 11, 2020

WIP: db: handle kMemtable bg error

723f7fa

Resolves: cockroachdb#270

rohansuri added a commit to rohansuri/pebble that referenced this issue May 11, 2020

WIP: db: handle kFlush bg error

91f474c

Resolves: cockroachdb#270

rohansuri added a commit to rohansuri/pebble that referenced this issue May 11, 2020

WIP: db: handle kWriteCallback bg error

8d1688d

Resolves: cockroachdb#270

rohansuri mentioned this issue Jul 17, 2020

record: propagate WAL sync errors to queued writers #824

Merged

jbowens mentioned this issue Apr 19, 2021

db: better testing of error handling pathways #1115

Open

github-actions bot added the no-issue-activity label Jun 13, 2022

jbowens removed the no-issue-activity label Jun 13, 2022

nicktrav mentioned this issue Jun 21, 2022

roachtest: clearrange/checks=false failed cockroachdb/cockroach#82924

Closed

jbowens mentioned this issue Apr 10, 2023

storage: background corruption should fatal the node cockroachdb/cockroach#101101

Closed

jbowens mentioned this issue Oct 6, 2023

db: tolerate errors during async ingest sstable validation #2982

Merged

itsbilal mentioned this issue Oct 26, 2023

roachtest: kv95/enc=false/nodes=4/ssds=8 failed cockroachdb/cockroach#112730

Closed

github-actions bot added the no-issue-activity label Dec 6, 2023

jbowens removed the no-issue-activity label Dec 6, 2023

jbowens added this to [Deprecated] Storage Jun 4, 2024

jbowens moved this to 24.2 candidates in [Deprecated] Storage Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: better handling of background errors #270

db: better handling of background errors #270

petermattis commented Sep 14, 2019 •

edited by cockroach-jira-scripts

Loading

petermattis commented Feb 18, 2020

rohansuri commented Mar 4, 2020

rohansuri commented Jul 26, 2020

petermattis commented Jul 29, 2020

sumeerbhola commented Dec 17, 2020

github-actions bot commented Jun 13, 2022

github-actions bot commented Dec 6, 2023

db: better handling of background errors #270

db: better handling of background errors #270

Comments

petermattis commented Sep 14, 2019 • edited by cockroach-jira-scripts Loading

petermattis commented Feb 18, 2020

rohansuri commented Mar 4, 2020

rohansuri commented Jul 26, 2020

petermattis commented Jul 29, 2020

sumeerbhola commented Dec 17, 2020

github-actions bot commented Jun 13, 2022

github-actions bot commented Dec 6, 2023

petermattis commented Sep 14, 2019 •

edited by cockroach-jira-scripts

Loading