backupccl: restore/pause/* roachtest fails on checkForKeyCollisions error #98779

msbutler · 2023-03-16T15:59:50Z

The restore/pause/tpce/15GB/aws/nodes=4/cpus=8 test on this commit failed after a restore resumed and re-ingested some keys that were already added before the pause. These keys should be allowed to collide when the disallowShaddowingBelow option is set to a TS of 1. . This failure is also likely reproducible on the larger restore/pause/tpce/80GB/aws/nodes=4/cpus=8 test on master. Not I was testing a crdb sha before adding the checkpoint frontier in #97862

@itsbilal mentioned that some updates were made to checkForKeyCollision code in the past week, so there may have been a bug introduced in these changes.

A debug zip is here.

Jira issue: CRDB-25516

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-03-16T15:59:54Z

cc @cockroachdb/disaster-recovery

msbutler · 2023-03-17T00:43:05Z

with the slimmer restore with pause test, and with a more verbose checkForKeyCollision error, I've verified the colliding key has a different value than the one initially ingested. On the fifth run of:

roachtest run restore/pause/tpce/15GB/aws/nodes=4/cpus=8 --user=$CLUSTER --cockroach=artifacts/cockroach  --workload artifacts/workload --cloud aws --count 10

RESTORE job 848502650761150465: stepping through state failed with error: importing 146 ranges: addsstable [/Table/116/2/‹43000000016›/‹"TWR"›/‹2006-02-21T15:47:16.309Z›/‹20000001     7269455›/‹0›,/Table/116/2/‹43000009994›/‹"TRCMPRB"›/‹2006-02-22T10:58:43.105Z›/‹200000017292658›/‹0›/‹NULL›): checking for key collisions: ingested key collides with an existing one: /Table/116/2/‹43000001121›/‹"TWLB"›/‹2006-02-23T14:08:12.808Z›/     ‹200000017373801›/‹0›; extValue: [‹236› ‹172› ‹155› ‹5› ‹3› ‹85› ‹4› ‹52› ‹138› ‹11› ‹70› ‹19› ‹200› ‹1›]; sstValue [‹112› ‹192› ‹45› ‹117› ‹3› ‹85› ‹4› ‹52› ‹138› ‹11› ‹70› ‹19› ‹144› ‹3›]; timestamp ‹1679012880.841814117,0›

This could be real bad. It implies that the restore_data_processor does not always return the latest mvcc value for a given key.

Next steps:

Check if I can repro this on 22.2. If I can, I'll block the release.
Check if this occurs before some of the generative split and scatter work.

msbutler · 2023-03-17T02:08:59Z

I've hit the bug on 22.2 as well.

msbutler · 2023-03-17T14:06:38Z

On the sha before the generative split and scatter processor merged, this test is passing under stress. So, I suspect one of the commits described in this backport is leading to this potential data corruption.

itsbilal · 2023-03-20T16:58:30Z

Nice find, thanks for digging into this @msbutler ! Feel free to ping me again if you need more storage eyes on this.

…rtSpans Currently generateAndSendImportSpans has a bug in which, on resume, the first spans in the covering may be missing files. This happens because when the algorithm iteratively generates a covering for the current span, it needs to also append the covering files from the previous cover span. This becomes an issue when spans below the watermark are skipped, thus not allowing the files covering those span to be considered as a "previous" cover span. This patch fixes the issue by never skipping spans even in the presence of a watermark. For the case where generateAndSendImportSpans generates a cover whose span is below the watermark, it just chooses to not send it to the output channel. Fixes cockroachdb#98779 Release note (bug fix): fixed a bug introduced in 22.2.6 in which a restore job, on resume, can miss files for the first few spans that are being restored.

Set the default value of bulkio.restore.use_simple_import_spans to true as the generative import spans algorithm has a bug that causes it to emit import spans with missing files on resume. Fixes: cockroachdb#98779 Release note (bug fix): sets bulkio.restore.use_simple_import_spans to true. If the setting is false, restore can emit miss files from the first few spans on job resume.

msbutler · 2023-03-27T14:19:06Z

removing release blocker as Rui's tests are green #99304

Previously, restore roachtests had little ability to detect data corruption regressions across runs. This patch introduces this ability. Specifically, this commit allows the restore roachtest writer to easily run a stripped fingerprint after a restore, and assert a match to the hardcoded fingerprint in the test spec. For now, the fingerprint check is only run on the restore roachtests that restore 15GB of data. The check takes about the same amount of time it takes to run the restore (around 3 minutes), so before we use it on larger tests, we ought to consider adding performance improvements to the fingerprinting tool. These tests include: - restore/nodeShutdown/coordinator - restore/pause/tpce/15GB/aws/nodes=4/cpus=8 (used to restore 80GB) - restore/tpce/15GB/aws/nodes=4/cpus=8 (new test) - restore/nodeShutdown/worker (used to restore 80GB) - restore/nodeShutdown/coordinator (used to restore 80GB) This patch also changes the node shutdown tests and the paused restore test to run the smaller 15GB tpce fixture, as it speeds the test run up. Informs cockroachdb#98779 Release note: none

…ryCover Add some testing with randomized completed spans to TestRestoreEntryCover. This testing should demonstrate the correctness of generateAndSendImportSpans in the presence of completed spans. Informs: cockroachdb#98779 Release note: None

Previously, restore roachtests had little ability to detect data corruption regressions across runs. This patch introduces this ability. Specifically, this commit allows the restore roachtest writer to easily run a stripped fingerprint after a restore, and assert a match to the hardcoded fingerprint in the test spec. For now, the fingerprint check is only run on the restore roachtests that restore 15GB of data. The check takes about the same amount of time it takes to run the restore (around 3 minutes), so before we use it on larger tests, we ought to consider adding performance improvements to the fingerprinting tool. These tests include: - restore/nodeShutdown/coordinator - restore/pause/tpce/15GB/aws/nodes=4/cpus=8 (used to restore 80GB) - restore/tpce/15GB/aws/nodes=4/cpus=8 (new test) - restore/nodeShutdown/worker (used to restore 80GB) - restore/nodeShutdown/coordinator (used to restore 80GB) This patch also changes the node shutdown tests and the paused restore test to run the smaller 15GB tpce fixture, as it speeds the test run up. Informs cockroachdb#98779 Release note: none

98983: server: only set default tenant if login successful r=knz a=dhartunian Previously, we would always set the default tenant cookie to the default tenant cluster setting regardless of what tenants the user logged-in to successfully. This change ensures that the default tenant selection is only used when the successful logins include that tenant. Otherwise, we select the first tenant from the list of successful logins. Epic: CRDB-12100 Release note: None 99792: backupccl: fingerprint 15GB restore roachtests r=rhu713 a=msbutler Previously, restore roachtests had little ability to detect data corruption regressions across runs. This patch introduces this ability. Specifically, this commit allows the restore roachtest writer to easily run a stripped fingerprint after a restore, and assert a match to the hardcoded fingerprint in the test spec. For now, the fingerprint check is only run on the restore roachtests that restore 15GB of data. The check takes about the same amount of time it takes to run the restore (around 3 minutes), so before we use it on larger tests, we ought to consider adding performance improvements to the fingerprinting tool. These tests include: - restore/nodeShutdown/coordinator - restore/pause/tpce/15GB/aws/nodes=4/cpus=8 (used to restore 80GB) - restore/tpce/15GB/aws/nodes=4/cpus=8 (new test) - restore/nodeShutdown/worker (used to restore 80GB) - restore/nodeShutdown/coordinator (used to restore 80GB) This patch also changes the node shutdown tests and the paused restore test to run the smaller 15GB tpce fixture, as it speeds the test run up. Informs #98779 Release note: none Co-authored-by: David Hartunian <[email protected]> Co-authored-by: Michael Butler <[email protected]>

Previously, restore roachtests had little ability to detect data corruption regressions across runs. This patch introduces this ability. Specifically, this commit allows the restore roachtest writer to easily run a stripped fingerprint after a restore, and assert a match to the hardcoded fingerprint in the test spec. For now, the fingerprint check is only run on the restore roachtests that restore 15GB of data. The check takes about the same amount of time it takes to run the restore (around 3 minutes), so before we use it on larger tests, we ought to consider adding performance improvements to the fingerprinting tool. These tests include: - restore/nodeShutdown/coordinator - restore/pause/tpce/15GB/aws/nodes=4/cpus=8 (used to restore 80GB) - restore/tpce/15GB/aws/nodes=4/cpus=8 (new test) - restore/nodeShutdown/worker (used to restore 80GB) - restore/nodeShutdown/coordinator (used to restore 80GB) This patch also changes the node shutdown tests and the paused restore test to run the smaller 15GB tpce fixture, as it speeds the test run up. Informs #98779 Release note: none

…ryCover Add some testing with randomized completed spans to TestRestoreEntryCover. This testing should demonstrate the correctness of generateAndSendImportSpans in the presence of completed spans. Informs: cockroachdb#98779 Release note: None

99304: backupccl: add test with randomized completed spans to TestRestoreEntryCover r=rhu713 a=rhu713 Add some testing with randomized completed spans to TestRestoreEntryCover. This testing should demonstrate the correctness of generateAndSendImportSpans in the presence of completed spans. Informs: #98779 Release note: None 100287: kvserver: add `leases.requests.latency` metric r=erikgrinaker a=erikgrinaker This patch adds a histogram of lease request latencies. It includes all request types (acquisitions, transfers, and extensions) and all outcomes (successes and errors), but only considers the coalesced lease requests regardless of the number of waiters and how long they have been waiting for. Epic: none Release note (ops change): Added a metric `leases.requests.latency` recording a histogram of lease request latencies. 100450: roachtest: unskip `acceptance/gossip/peerings` r=erikgrinaker a=erikgrinaker Addressed by bfed880. Resolves #96091. Touches #100213. Epic: none Release note: None Co-authored-by: Rui Hu <[email protected]> Co-authored-by: Erik Grinaker <[email protected]>

…ryCover Add some testing with randomized completed spans to TestRestoreEntryCover. This testing should demonstrate the correctness of generateAndSendImportSpans in the presence of completed spans. Informs: #98779 Release note: None

msbutler added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery GA-blocker T-storage Storage Team branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 labels Mar 16, 2023

msbutler assigned itsbilal Mar 16, 2023

blathers-crl bot added A-storage Relating to our storage engine (Pebble) on-disk storage. A-disaster-recovery labels Mar 16, 2023

msbutler changed the title ~~backupccl: restore/pause/* roachtest fails on false positive checkForKeyCollisions error~~ backupccl: restore/pause/* roachtest fails on checkForKeyCollisions error Mar 16, 2023

msbutler self-assigned this Mar 17, 2023

msbutler unassigned itsbilal Mar 17, 2023

rhu713 mentioned this issue Mar 20, 2023

release-22.2: backupccl: fix missing cover entries on resume in generateAndSendImportSpans #99046

Merged

rhu713 mentioned this issue Mar 20, 2023

release-22.2: revert slim manifests #99066

Merged

rhu713 mentioned this issue Mar 20, 2023

release-22.2: backupccl: default bulkio.restore.use_simple_import_spans to true #99068

Merged

celiala added the blocks-23.1.0-beta.1 label Mar 21, 2023

msbutler removed the blocks-23.1.0-beta.1 label Mar 27, 2023

msbutler mentioned this issue Mar 28, 2023

backupccl: fingerprint 15GB restore roachtests #99792

Merged

blathers-crl bot mentioned this issue Mar 30, 2023

release-23.1: backupccl: fingerprint 15GB restore roachtests #100124

Merged

blathers-crl bot mentioned this issue Apr 3, 2023

release-23.1: backupccl: add test with randomized completed spans to TestRestoreEntryCover #100473

Merged

dt closed this as completed Apr 5, 2023

msbutler added the S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading. label May 5, 2023

rytaft added C-technical-advisory Caused a technical advisory branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 and removed branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 GA-blocker labels Dec 6, 2023

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: restore/pause/* roachtest fails on checkForKeyCollisions error #98779

backupccl: restore/pause/* roachtest fails on checkForKeyCollisions error #98779

msbutler commented Mar 16, 2023 •

edited

Loading

blathers-crl bot commented Mar 16, 2023

msbutler commented Mar 17, 2023 •

edited

Loading

msbutler commented Mar 17, 2023

msbutler commented Mar 17, 2023 •

edited

Loading

itsbilal commented Mar 20, 2023

msbutler commented Mar 27, 2023

backupccl: restore/pause/* roachtest fails on checkForKeyCollisions error #98779

backupccl: restore/pause/* roachtest fails on checkForKeyCollisions error #98779

Comments

msbutler commented Mar 16, 2023 • edited Loading

blathers-crl bot commented Mar 16, 2023

msbutler commented Mar 17, 2023 • edited Loading

msbutler commented Mar 17, 2023

msbutler commented Mar 17, 2023 • edited Loading

itsbilal commented Mar 20, 2023

msbutler commented Mar 27, 2023

msbutler commented Mar 16, 2023 •

edited

Loading

msbutler commented Mar 17, 2023 •

edited

Loading

msbutler commented Mar 17, 2023 •

edited

Loading