-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: restore/pause/* roachtest fails on checkForKeyCollisions error #98779
Comments
cc @cockroachdb/disaster-recovery |
with the slimmer restore with pause test, and with a more verbose checkForKeyCollision error, I've verified the colliding key has a different value than the one initially ingested. On the fifth run of:
This could be real bad. It implies that the restore_data_processor does not always return the latest mvcc value for a given key. Next steps:
|
I've hit the bug on 22.2 as well. |
Nice find, thanks for digging into this @msbutler ! Feel free to ping me again if you need more storage eyes on this. |
…rtSpans Currently generateAndSendImportSpans has a bug in which, on resume, the first spans in the covering may be missing files. This happens because when the algorithm iteratively generates a covering for the current span, it needs to also append the covering files from the previous cover span. This becomes an issue when spans below the watermark are skipped, thus not allowing the files covering those span to be considered as a "previous" cover span. This patch fixes the issue by never skipping spans even in the presence of a watermark. For the case where generateAndSendImportSpans generates a cover whose span is below the watermark, it just chooses to not send it to the output channel. Fixes cockroachdb#98779 Release note (bug fix): fixed a bug introduced in 22.2.6 in which a restore job, on resume, can miss files for the first few spans that are being restored.
Set the default value of bulkio.restore.use_simple_import_spans to true as the generative import spans algorithm has a bug that causes it to emit import spans with missing files on resume. Fixes: cockroachdb#98779 Release note (bug fix): sets bulkio.restore.use_simple_import_spans to true. If the setting is false, restore can emit miss files from the first few spans on job resume.
removing release blocker as Rui's tests are green #99304 |
Previously, restore roachtests had little ability to detect data corruption regressions across runs. This patch introduces this ability. Specifically, this commit allows the restore roachtest writer to easily run a stripped fingerprint after a restore, and assert a match to the hardcoded fingerprint in the test spec. For now, the fingerprint check is only run on the restore roachtests that restore 15GB of data. The check takes about the same amount of time it takes to run the restore (around 3 minutes), so before we use it on larger tests, we ought to consider adding performance improvements to the fingerprinting tool. These tests include: - restore/nodeShutdown/coordinator - restore/pause/tpce/15GB/aws/nodes=4/cpus=8 (used to restore 80GB) - restore/tpce/15GB/aws/nodes=4/cpus=8 (new test) - restore/nodeShutdown/worker (used to restore 80GB) - restore/nodeShutdown/coordinator (used to restore 80GB) This patch also changes the node shutdown tests and the paused restore test to run the smaller 15GB tpce fixture, as it speeds the test run up. Informs cockroachdb#98779 Release note: none
…ryCover Add some testing with randomized completed spans to TestRestoreEntryCover. This testing should demonstrate the correctness of generateAndSendImportSpans in the presence of completed spans. Informs: cockroachdb#98779 Release note: None
Previously, restore roachtests had little ability to detect data corruption regressions across runs. This patch introduces this ability. Specifically, this commit allows the restore roachtest writer to easily run a stripped fingerprint after a restore, and assert a match to the hardcoded fingerprint in the test spec. For now, the fingerprint check is only run on the restore roachtests that restore 15GB of data. The check takes about the same amount of time it takes to run the restore (around 3 minutes), so before we use it on larger tests, we ought to consider adding performance improvements to the fingerprinting tool. These tests include: - restore/nodeShutdown/coordinator - restore/pause/tpce/15GB/aws/nodes=4/cpus=8 (used to restore 80GB) - restore/tpce/15GB/aws/nodes=4/cpus=8 (new test) - restore/nodeShutdown/worker (used to restore 80GB) - restore/nodeShutdown/coordinator (used to restore 80GB) This patch also changes the node shutdown tests and the paused restore test to run the smaller 15GB tpce fixture, as it speeds the test run up. Informs cockroachdb#98779 Release note: none
98983: server: only set default tenant if login successful r=knz a=dhartunian Previously, we would always set the default tenant cookie to the default tenant cluster setting regardless of what tenants the user logged-in to successfully. This change ensures that the default tenant selection is only used when the successful logins include that tenant. Otherwise, we select the first tenant from the list of successful logins. Epic: CRDB-12100 Release note: None 99792: backupccl: fingerprint 15GB restore roachtests r=rhu713 a=msbutler Previously, restore roachtests had little ability to detect data corruption regressions across runs. This patch introduces this ability. Specifically, this commit allows the restore roachtest writer to easily run a stripped fingerprint after a restore, and assert a match to the hardcoded fingerprint in the test spec. For now, the fingerprint check is only run on the restore roachtests that restore 15GB of data. The check takes about the same amount of time it takes to run the restore (around 3 minutes), so before we use it on larger tests, we ought to consider adding performance improvements to the fingerprinting tool. These tests include: - restore/nodeShutdown/coordinator - restore/pause/tpce/15GB/aws/nodes=4/cpus=8 (used to restore 80GB) - restore/tpce/15GB/aws/nodes=4/cpus=8 (new test) - restore/nodeShutdown/worker (used to restore 80GB) - restore/nodeShutdown/coordinator (used to restore 80GB) This patch also changes the node shutdown tests and the paused restore test to run the smaller 15GB tpce fixture, as it speeds the test run up. Informs #98779 Release note: none Co-authored-by: David Hartunian <[email protected]> Co-authored-by: Michael Butler <[email protected]>
Previously, restore roachtests had little ability to detect data corruption regressions across runs. This patch introduces this ability. Specifically, this commit allows the restore roachtest writer to easily run a stripped fingerprint after a restore, and assert a match to the hardcoded fingerprint in the test spec. For now, the fingerprint check is only run on the restore roachtests that restore 15GB of data. The check takes about the same amount of time it takes to run the restore (around 3 minutes), so before we use it on larger tests, we ought to consider adding performance improvements to the fingerprinting tool. These tests include: - restore/nodeShutdown/coordinator - restore/pause/tpce/15GB/aws/nodes=4/cpus=8 (used to restore 80GB) - restore/tpce/15GB/aws/nodes=4/cpus=8 (new test) - restore/nodeShutdown/worker (used to restore 80GB) - restore/nodeShutdown/coordinator (used to restore 80GB) This patch also changes the node shutdown tests and the paused restore test to run the smaller 15GB tpce fixture, as it speeds the test run up. Informs #98779 Release note: none
…ryCover Add some testing with randomized completed spans to TestRestoreEntryCover. This testing should demonstrate the correctness of generateAndSendImportSpans in the presence of completed spans. Informs: cockroachdb#98779 Release note: None
99304: backupccl: add test with randomized completed spans to TestRestoreEntryCover r=rhu713 a=rhu713 Add some testing with randomized completed spans to TestRestoreEntryCover. This testing should demonstrate the correctness of generateAndSendImportSpans in the presence of completed spans. Informs: #98779 Release note: None 100287: kvserver: add `leases.requests.latency` metric r=erikgrinaker a=erikgrinaker This patch adds a histogram of lease request latencies. It includes all request types (acquisitions, transfers, and extensions) and all outcomes (successes and errors), but only considers the coalesced lease requests regardless of the number of waiters and how long they have been waiting for. Epic: none Release note (ops change): Added a metric `leases.requests.latency` recording a histogram of lease request latencies. 100450: roachtest: unskip `acceptance/gossip/peerings` r=erikgrinaker a=erikgrinaker Addressed by bfed880. Resolves #96091. Touches #100213. Epic: none Release note: None Co-authored-by: Rui Hu <[email protected]> Co-authored-by: Erik Grinaker <[email protected]>
…ryCover Add some testing with randomized completed spans to TestRestoreEntryCover. This testing should demonstrate the correctness of generateAndSendImportSpans in the presence of completed spans. Informs: #98779 Release note: None
The
restore/pause/tpce/15GB/aws/nodes=4/cpus=8
test on this commit failed after a restore resumed and re-ingested some keys that were already added before the pause. These keys should be allowed to collide when thedisallowShaddowingBelow
option is set to a TS of 1. . This failure is also likely reproducible on the largerrestore/pause/tpce/80GB/aws/nodes=4/cpus=8
test on master. Not I was testing a crdb sha before adding the checkpoint frontier in #97862@itsbilal mentioned that some updates were made to checkForKeyCollision code in the past week, so there may have been a bug introduced in these changes.
A debug zip is here.
Jira issue: CRDB-25516
The text was updated successfully, but these errors were encountered: