backupccl: investigate if restore on retry gradually slows down because of CheckForKeyCollisions #81116

adityamaru · 2022-05-06T22:41:11Z

In a support issue, we noticed CPU profiles from several nodes in a cluster (21.2.10) running a restore of a few TB, being dominated by the CheckForKeyCollisions.

Restore ingests into an empty keyspace and the expectation is that this check is essentially a no-op. If however, the restore were to hit a transient error or a condition that would make it retry from the last checkpoint, then we might end up in a situation where we are ingesting data into a non-empty keyspace. In that case, this check has the potential to become increasingly expensive. In the support issue, we noticed the restore slow to a crawl.

We have investigated similar slowdowns in out-of-order imports #66410, and are working on optimizations for the same #80980. Since restore and import share the ingestion pipeline this issue will benefit from the work being done for import, but we should attempt to reproduce what we saw in the support issue by forcing a restore to retry at different stages and seeing if it has a long tail because of the SST collision check.

Jira issue: CRDB-15138

Epic CRDB-20916

The text was updated successfully, but these errors were encountered:

blathers-crl · 2022-05-06T22:41:14Z

cc @cockroachdb/bulk-io

mwang1026 · 2022-05-09T16:21:35Z

Seemed to complete then they took away secondary indexes. Maybe a red herring -- could point to secondary index overlaps and/or nefarious behavior near disk full and/or something else completely. @smcvey were we able to pull a CPU profile during the modified IMPORT?

adityamaru · 2022-05-23T13:30:48Z

I started looking into this as part of support#1539 and I think there is some smoke here. Just dumping some initial findings.

Basic Repro Steps

# v21.2.7
./build/builder.sh mkrelease;
roachprod create -n 3
roachprod put adityamaru-test cockroach-linux-2.6.32-gnu-amd64 cockroach
roachprod start adityamaru-test
roachprod sql adityamaru-test:1

create database bank;
RESTORE bank.bank FROM 'gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=100,ranges=10,rows=100000000,seed=1/bank?AUTH=implicit';

Vanilla Run:
00:01:53

Run with 2 pauses at 25% and 50%:
00:11:26

I instrumented the binary to better understand restore's checkpointing logic in particular https://github.com/cockroachdb/cockroach/blob/v21.2.7/pkg/ccl/backupccl/restore_job.go#L518. In short, whenever a RestoreSpanEntry is ingested we receive a progress update. At the time of planning, each span entry is assigned a progress idx, and when the span completes it marks its progress idx in a slice as completed. We then iterate over this slice and find the maximum progress idx of a contiguous run i.e. the high watermark of this restore job. On resume, the restore will use this high watermark as the low watermark below which no spans read from the backup need to be reprocessed.

In a vanilla run:

[n1,job=764273104410968065] 606 amaru: resuming with import spans 389

We start with 389 spans.

220523 12:15:55.594275 7057 ccl/backupccl/restore_job.go:525 ⋮ [n1,job=764273104410968065] 2795 amaru: 389 / 389 requests completed
220523 12:15:55.594320 7057 ccl/backupccl/restore_job.go:538 ⋮ [n1,job=764273104410968065] 2797 amaru: missed count is now 340

340 progress updates of the 389 spans do not result in the high watermark being forwarded because of the order in which they were received. While this might not be very useful as is, it seems to indicate that we can easily get into a situation where we've ingested far more than we can progress the high watermark to. Let's see how this behaves in a paused and resumed restore.

In a run paused a couple of times:

[n1,job=764274209991655425] 606 amaru: resuming with import spans 389

Starts with 389 spans.

220523 12:20:15.092579 31051 ccl/backupccl/restore_job.go:525 ⋮ [n1,job=764274209991655425] 3835 amaru: 144 / 389 requests completed
220523 12:20:15.092598 31051 ccl/backupccl/restore_job.go:537 ⋮ [n1,job=764274209991655425] 3836 amaru: high water is now 6
220523 12:20:15.092609 31051 ccl/backupccl/restore_job.go:538 ⋮ [n1,job=764274209991655425] 3837 amaru: missed count is now 138

At the first pause, we've received 144 updates but only 6 contiguous ones and so our high watermark is only 6. So we're effectively throwing away 138 restore spans that have been ingested?

220523 12:21:36.199695 42063 ccl/backupccl/restore_job.go:431 ⋮ [n1,job=764274209991655425] 3913 amaru: resuming with import spans 383
220523 12:25:48.092775 42097 ccl/backupccl/restore_job.go:525 ⋮ [n1,job=764274209991655425] 5028 amaru: 167 / 383 requests completed
220523 12:25:48.092790 42097 ccl/backupccl/restore_job.go:537 ⋮ [n1,job=764274209991655425] 5029 amaru: high water is now 104
220523 12:25:48.092800 42097 ccl/backupccl/restore_job.go:538 ⋮ [n1,job=764274209991655425] 5030 amaru: missed count is now 142

At the second pause, we received 167 progress updates but only progressed high watermark to 104.

220523 12:25:54.195965 54352 ccl/backupccl/restore_job.go:431 ⋮ [n1,job=764274209991655425] 5043 amaru: resuming with import spans 301

I have to step away for a bit but the next step for me is to instrument checkForKeyCollisions to aggregate the time we are spending across pauses. Eyeballing some durations we go from the order of spending microseconds in checkForKeyCollisions to seconds after a pause which could explain a part of the slowdown.

adityamaru · 2022-05-23T20:54:03Z

I disabled disallowShadowing and the restore after a pause took 3min45s, as compared to 7m31s when disallowShadowing is enabled.

Every restore span we ingest uses its own SSTBatcher, here are some entries before the pause:

220523 20:11:15.055326 40482 kv/bulk/sst_batcher.go:276 ⋮ [n3] 1091 amaru: sendWait ‹198.204222ms›
220523 20:11:15.055353 40482 kv/bulk/sst_batcher.go:277 ⋮ [n3] 1092 amaru: cfkWait ‹120.252µs›

Here are some entries after. I'm still curious where the other part of the sendWait is coming from, maybe the concurrent AddSSTable throttler?

20523 20:17:38.970415 48746 kv/bulk/sst_batcher.go:276 ⋮ [n3] 1813 amaru: sendWait ‹6.174582174s›
220523 20:17:38.970456 48746 kv/bulk/sst_batcher.go:277 ⋮ [n3] 1814 amaru: cfkWait ‹3.648934212s›

220523 20:17:48.016196 48746 kv/bulk/sst_batcher.go:276 ⋮ [n3] 1826 amaru: sendWait ‹7.013382731s›
220523 20:17:48.016262 48746 kv/bulk/sst_batcher.go:277 ⋮ [n3] 1827 amaru: cfkWait ‹3.654787309s›

adityamaru · 2022-05-24T16:34:01Z

Proposal:

We switch restore progress to use a span frontier instead of the slice of progress indexes that it uses today. The first time we run a restore job we will compute the importSpans based on the backups that we are restoring. These spans will be added to a span frontier with their timestamp set to 0. This frontier will essentially "watch" the spans we need to restore.

Every time we receive a progress update from the restore data processor for a particular span, we will forward its timestamp in the frontier to the restore's AOST. Essentially marking it as restored until that timestamp. On subsequent job resumptions we then fetch the frontier entries that have a ts less than the restore aost and consider those as the remaining work to do. This way we prevent redoing work just because a lagging restore span has held up the high watermark in our current implementation.

Thoughts?

This change flips the `disallowShadowing` boolean in the SST batcher used by RESTORE to false, in release builds. disallowShadowing is set to `false` because RESTORE, who is the sole user of this SSTBatcher, is expected to be ingesting into an empty keyspace. If a restore job is resumed, the un-checkpointed spans that are re-ingested will perfectly shadow (equal key, value and ts) the already ingested keys. disallowShadowing used to be set to `true` because it was believed that the even across resumptions of a restore job, `checkForKeyCollisions` would be inexpensive because of our frequent job checkpointing. Further investigation in cockroachdb#81116 revealed that our progress checkpointing could significantly lag behind the spans we have ingested, making a resumed restore spend a lot of time in `checkForKeyCollisions` leading to severely degraded performance. We have *never* seen a restore fail because of the invariant enforced when `disallowShadowing` is set to true, and so we feel comfortable flipping this check to false. A future version will work on fixing our progress checkpointing so that we do not have a buildup of un-checkpointed work, at which point we can reassess flipping `disallowShadowing` to true. Release note: None

This roachtest pauses the restore job every 10 minutes and resumes it after it has reached a paused state. This was an area of testing we were lacking and allowed for performance degredations like cockroachdb#81116 to go unnoticed. Informs: cockroachdb#81116 Release note: None

disallowShadowingBelow is set to an empty hlc.Timestamp in release builds i.e. allow all shadowing without AddSSTable having to check for overlapping keys. This is because RESTORE is expected to ingest into an empty keyspace. If a restore job is resumed, the un-checkpointed spans that are re-ingested will perfectly shadow (equal key, value and ts) the already ingested keys. disallowShadowingBelow used to be unconditionally set to logical=1. This permissive value would allow shadowing in case the RESTORE has to retry ingestions but served to force evaluation of AddSSTable to check for overlapping keys. It was believed that even across resumptions of a restore job, `checkForKeyCollisions` would be inexpensive because of our frequent job checkpointing. Further investigation in cockroachdb#81116 revealed that our progress checkpointing could significantly lag behind the spans we have ingested, making a resumed restore spend a lot of time in `checkForKeyCollisions` leading to severely degraded performance. We have *never* seen a restore fail because of the invariant enforced by setting `disallowShadowingBelow` to a non-empty value, and so we feel comfortable disabling this check entirely. A future release will work on fixing our progress checkpointing so that we do not have a buildup of un-checkpointed work, at which point we can reassess reverting to logical=1. Informs: cockroachdb#81116 Release note: None

This change flips the `disallowShadowing` boolean in the SST batcher used by RESTORE to false, in release builds. disallowShadowing is set to `false` because RESTORE, who is the sole user of this SSTBatcher, is expected to be ingesting into an empty keyspace. If a restore job is resumed, the un-checkpointed spans that are re-ingested will perfectly shadow (equal key, value and ts) the already ingested keys. disallowShadowing used to be set to `true` because it was believed that the even across resumptions of a restore job, `checkForKeyCollisions` would be inexpensive because of our frequent job checkpointing. Further investigation in cockroachdb#81116 revealed that our progress checkpointing could significantly lag behind the spans we have ingested, making a resumed restore spend a lot of time in `checkForKeyCollisions` leading to severely degraded performance. We have *never* seen a restore fail because of the invariant enforced when `disallowShadowing` is set to true, and so we feel comfortable flipping this check to false. A future version will work on fixing our progress checkpointing so that we do not have a buildup of un-checkpointed work, at which point we can reassess flipping `disallowShadowing` to true. Informs: cockroachdb#81116 Release note: None

Fixes: cockroachdb#81116, cockroachdb#87843 Release note (performance improvement): Previously, whenever a user resumed a paused `RESTORE` job the checkpointing mechanism would potentially not account for completed work. This change allows completed spans to be skipped over when restoring.

msbutler · 2023-03-14T14:03:51Z

this should no longer be an issue with our new checkpointing procedure #97862

adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery labels May 6, 2022

blathers-crl bot added the T-disaster-recovery label May 6, 2022

livlobo assigned rhu713 and stevendanna May 17, 2022

jlinder added sync-me and removed sync-me labels May 20, 2022

msbutler mentioned this issue May 23, 2022

backupccl/importer: flush all job progress before pausing bulk job #81671

Closed

adityamaru assigned adityamaru and unassigned stevendanna and rhu713 May 26, 2022

adityamaru mentioned this issue May 31, 2022

release-21.2: bulk: set disallowShadowing to false for RESTORE #82177

Merged

adityamaru mentioned this issue Jun 2, 2022

release-22.1: bulk: set disallowShadowingBelow to empty for RESTORE #82357

Merged

adityamaru mentioned this issue Jun 7, 2022

roachtest: add restore roachtest with multiple pause-resume #82542

Closed

msbutler mentioned this issue Dec 15, 2022

roachtest: 22.1 restore 2TB/nodes=10/with-pause failed -- unavailable range #93718

Closed

rail unassigned baoalvin1 Jan 4, 2023

exalate-issue-sync bot assigned msbutler Jan 5, 2023

exalate-issue-sync bot unassigned msbutler Jan 25, 2023

msbutler closed this as completed Mar 14, 2023

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: investigate if restore on retry gradually slows down because of CheckForKeyCollisions #81116

backupccl: investigate if restore on retry gradually slows down because of CheckForKeyCollisions #81116

adityamaru commented May 6, 2022 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented May 6, 2022

mwang1026 commented May 9, 2022

adityamaru commented May 23, 2022 •

edited

Loading

adityamaru commented May 23, 2022

adityamaru commented May 24, 2022

msbutler commented Mar 14, 2023

backupccl: investigate if restore on retry gradually slows down because of CheckForKeyCollisions #81116

backupccl: investigate if restore on retry gradually slows down because of CheckForKeyCollisions #81116

Comments

adityamaru commented May 6, 2022 • edited by exalate-issue-sync bot Loading

blathers-crl bot commented May 6, 2022

mwang1026 commented May 9, 2022

adityamaru commented May 23, 2022 • edited Loading

Basic Repro Steps

In a run paused a couple of times:

adityamaru commented May 23, 2022

adityamaru commented May 24, 2022

msbutler commented Mar 14, 2023

adityamaru commented May 6, 2022 •

edited by exalate-issue-sync bot

Loading

adityamaru commented May 23, 2022 •

edited

Loading