Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: investigate if restore on retry gradually slows down because of CheckForKeyCollisions #81116

Closed
adityamaru opened this issue May 6, 2022 · 6 comments
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery

Comments

@adityamaru
Copy link
Contributor

adityamaru commented May 6, 2022

In a support issue, we noticed CPU profiles from several nodes in a cluster (21.2.10) running a restore of a few TB, being dominated by the CheckForKeyCollisions.

Screen Shot 2022-05-06 at 6 12 03 PM

Screen Shot 2022-05-06 at 6 11 49 PM

Restore ingests into an empty keyspace and the expectation is that this check is essentially a no-op. If however, the restore were to hit a transient error or a condition that would make it retry from the last checkpoint, then we might end up in a situation where we are ingesting data into a non-empty keyspace. In that case, this check has the potential to become increasingly expensive. In the support issue, we noticed the restore slow to a crawl.

We have investigated similar slowdowns in out-of-order imports #66410, and are working on optimizations for the same #80980. Since restore and import share the ingestion pipeline this issue will benefit from the work being done for import, but we should attempt to reproduce what we saw in the support issue by forcing a restore to retry at different stages and seeing if it has a long tail because of the SST collision check.

Jira issue: CRDB-15138

Epic CRDB-20916

@adityamaru adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery labels May 6, 2022
@blathers-crl
Copy link

blathers-crl bot commented May 6, 2022

cc @cockroachdb/bulk-io

@mwang1026
Copy link

Seemed to complete then they took away secondary indexes. Maybe a red herring -- could point to secondary index overlaps and/or nefarious behavior near disk full and/or something else completely. @smcvey were we able to pull a CPU profile during the modified IMPORT?

@adityamaru
Copy link
Contributor Author

adityamaru commented May 23, 2022

I started looking into this as part of support#1539 and I think there is some smoke here. Just dumping some initial findings.

Basic Repro Steps

# v21.2.7
./build/builder.sh mkrelease;
roachprod create -n 3
roachprod put adityamaru-test cockroach-linux-2.6.32-gnu-amd64 cockroach
roachprod start adityamaru-test
roachprod sql adityamaru-test:1
create database bank;
RESTORE bank.bank FROM 'gs://cockroach-fixtures/workload/bank/version=1.0.0,payload-bytes=100,ranges=10,rows=100000000,seed=1/bank?AUTH=implicit';

Vanilla Run:
00:01:53

Run with 2 pauses at 25% and 50%:
00:11:26

I instrumented the binary to better understand restore's checkpointing logic in particular https://github.com/cockroachdb/cockroach/blob/v21.2.7/pkg/ccl/backupccl/restore_job.go#L518. In short, whenever a RestoreSpanEntry is ingested we receive a progress update. At the time of planning, each span entry is assigned a progress idx, and when the span completes it marks its progress idx in a slice as completed. We then iterate over this slice and find the maximum progress idx of a contiguous run i.e. the high watermark of this restore job. On resume, the restore will use this high watermark as the low watermark below which no spans read from the backup need to be reprocessed.

In a vanilla run:

[n1,job=764273104410968065] 606 amaru: resuming with import spans 389

We start with 389 spans.

220523 12:15:55.594275 7057 ccl/backupccl/restore_job.go:525 ⋮ [n1,job=764273104410968065] 2795 amaru: 389 / 389 requests completed
220523 12:15:55.594320 7057 ccl/backupccl/restore_job.go:538 ⋮ [n1,job=764273104410968065] 2797 amaru: missed count is now 340

340 progress updates of the 389 spans do not result in the high watermark being forwarded because of the order in which they were received. While this might not be very useful as is, it seems to indicate that we can easily get into a situation where we've ingested far more than we can progress the high watermark to. Let's see how this behaves in a paused and resumed restore.

In a run paused a couple of times:

[n1,job=764274209991655425] 606 amaru: resuming with import spans 389

Starts with 389 spans.

220523 12:20:15.092579 31051 ccl/backupccl/restore_job.go:525 ⋮ [n1,job=764274209991655425] 3835 amaru: 144 / 389 requests completed
220523 12:20:15.092598 31051 ccl/backupccl/restore_job.go:537 ⋮ [n1,job=764274209991655425] 3836 amaru: high water is now 6
220523 12:20:15.092609 31051 ccl/backupccl/restore_job.go:538 ⋮ [n1,job=764274209991655425] 3837 amaru: missed count is now 138

At the first pause, we've received 144 updates but only 6 contiguous ones and so our high watermark is only 6. So we're effectively throwing away 138 restore spans that have been ingested?

220523 12:21:36.199695 42063 ccl/backupccl/restore_job.go:431 ⋮ [n1,job=764274209991655425] 3913 amaru: resuming with import spans 383
220523 12:25:48.092775 42097 ccl/backupccl/restore_job.go:525 ⋮ [n1,job=764274209991655425] 5028 amaru: 167 / 383 requests completed
220523 12:25:48.092790 42097 ccl/backupccl/restore_job.go:537 ⋮ [n1,job=764274209991655425] 5029 amaru: high water is now 104
220523 12:25:48.092800 42097 ccl/backupccl/restore_job.go:538 ⋮ [n1,job=764274209991655425] 5030 amaru: missed count is now 142

At the second pause, we received 167 progress updates but only progressed high watermark to 104.

220523 12:25:54.195965 54352 ccl/backupccl/restore_job.go:431 ⋮ [n1,job=764274209991655425] 5043 amaru: resuming with import spans 301

I have to step away for a bit but the next step for me is to instrument checkForKeyCollisions to aggregate the time we are spending across pauses. Eyeballing some durations we go from the order of spending microseconds in checkForKeyCollisions to seconds after a pause which could explain a part of the slowdown.

@adityamaru
Copy link
Contributor Author

I disabled disallowShadowing and the restore after a pause took 3min45s, as compared to 7m31s when disallowShadowing is enabled.

Every restore span we ingest uses its own SSTBatcher, here are some entries before the pause:

220523 20:11:15.055326 40482 kv/bulk/sst_batcher.go:276 ⋮ [n3] 1091 amaru: sendWait ‹198.204222ms›
220523 20:11:15.055353 40482 kv/bulk/sst_batcher.go:277 ⋮ [n3] 1092 amaru: cfkWait ‹120.252µs›

Here are some entries after. I'm still curious where the other part of the sendWait is coming from, maybe the concurrent AddSSTable throttler?

20523 20:17:38.970415 48746 kv/bulk/sst_batcher.go:276 ⋮ [n3] 1813 amaru: sendWait ‹6.174582174s›
220523 20:17:38.970456 48746 kv/bulk/sst_batcher.go:277 ⋮ [n3] 1814 amaru: cfkWait ‹3.648934212s›

220523 20:17:48.016196 48746 kv/bulk/sst_batcher.go:276 ⋮ [n3] 1826 amaru: sendWait ‹7.013382731s›
220523 20:17:48.016262 48746 kv/bulk/sst_batcher.go:277 ⋮ [n3] 1827 amaru: cfkWait ‹3.654787309s›

@adityamaru
Copy link
Contributor Author

Proposal:

We switch restore progress to use a span frontier instead of the slice of progress indexes that it uses today. The first time we run a restore job we will compute the importSpans based on the backups that we are restoring. These spans will be added to a span frontier with their timestamp set to 0. This frontier will essentially "watch" the spans we need to restore.

Every time we receive a progress update from the restore data processor for a particular span, we will forward its timestamp in the frontier to the restore's AOST. Essentially marking it as restored until that timestamp. On subsequent job resumptions we then fetch the frontier entries that have a ts less than the restore aost and consider those as the remaining work to do. This way we prevent redoing work just because a lagging restore span has held up the high watermark in our current implementation.

Thoughts?

@adityamaru adityamaru assigned adityamaru and unassigned stevendanna and rhu713 May 26, 2022
adityamaru added a commit to adityamaru/cockroach that referenced this issue May 31, 2022
This change flips the `disallowShadowing` boolean in the SST batcher
used by RESTORE to false, in release builds.

disallowShadowing is set to `false` because RESTORE, who is the sole user
of this SSTBatcher, is expected to be ingesting into an empty keyspace.
If a restore job is resumed, the un-checkpointed spans that are
re-ingested will perfectly shadow (equal key, value and ts) the already
ingested keys.

disallowShadowing used to be set to `true` because it was believed
that the even across resumptions of a restore job,
`checkForKeyCollisions` would be inexpensive because of our frequent job
checkpointing. Further investigation in
cockroachdb#81116 revealed that our
progress checkpointing could significantly lag behind the spans we have
ingested, making a resumed restore spend a lot of time in
`checkForKeyCollisions` leading to severely degraded performance. We have
*never* seen a restore fail because of the invariant enforced when
`disallowShadowing` is set to true, and so we feel comfortable flipping
this check to false. A future version will work on fixing our progress
checkpointing so that we do not have a buildup of un-checkpointed work,
at which point we can reassess flipping `disallowShadowing` to true.

Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Jun 1, 2022
This roachtest pauses the restore job every 10 minutes and
resumes it after it has reached a paused state. This was an area
of testing we were lacking and allowed for performance degredations
like cockroachdb#81116 to go unnoticed.

Informs: cockroachdb#81116

Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Jun 2, 2022
disallowShadowingBelow is set to an empty hlc.Timestamp in release builds
i.e. allow all shadowing without AddSSTable having to check for overlapping
keys. This is because RESTORE is expected to ingest into an empty keyspace.
If a restore job is resumed, the un-checkpointed spans that are re-ingested
will perfectly shadow (equal key, value and ts) the already ingested keys.

disallowShadowingBelow used to be unconditionally set to logical=1.
This permissive value would allow shadowing in case the RESTORE has to
retry ingestions but served to force evaluation of AddSSTable to check for
overlapping keys. It was believed that even across resumptions of a restore
job, `checkForKeyCollisions` would be inexpensive because of our frequent
job checkpointing. Further investigation in
cockroachdb#81116 revealed that our
progress checkpointing could significantly lag behind the spans we have
ingested, making a resumed restore spend a lot of time in
`checkForKeyCollisions` leading to severely degraded performance. We have
*never* seen a restore fail because of the invariant enforced by setting
`disallowShadowingBelow` to a non-empty value, and so we feel comfortable
disabling this check entirely. A future release will work on fixing our
progress checkpointing so that we do not have a buildup of un-checkpointed
work, at which point we can reassess reverting to logical=1.

Informs: cockroachdb#81116

Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Jun 2, 2022
This change flips the `disallowShadowing` boolean in the SST batcher
used by RESTORE to false, in release builds.

disallowShadowing is set to `false` because RESTORE, who is the sole user
of this SSTBatcher, is expected to be ingesting into an empty keyspace.
If a restore job is resumed, the un-checkpointed spans that are
re-ingested will perfectly shadow (equal key, value and ts) the already
ingested keys.

disallowShadowing used to be set to `true` because it was believed
that the even across resumptions of a restore job,
`checkForKeyCollisions` would be inexpensive because of our frequent job
checkpointing. Further investigation in
cockroachdb#81116 revealed that our
progress checkpointing could significantly lag behind the spans we have
ingested, making a resumed restore spend a lot of time in
`checkForKeyCollisions` leading to severely degraded performance. We have
*never* seen a restore fail because of the invariant enforced when
`disallowShadowing` is set to true, and so we feel comfortable flipping
this check to false. A future version will work on fixing our progress
checkpointing so that we do not have a buildup of un-checkpointed work,
at which point we can reassess flipping `disallowShadowing` to true.

Informs: cockroachdb#81116

Release note: None
baoalvin1 added a commit to baoalvin1/cockroach that referenced this issue Dec 7, 2022
Fixes: cockroachdb#81116, cockroachdb#87843

Release note (performance improvement): Previously, whenever a user resumed a paused `RESTORE` job
the checkpointing mechanism would potentially not account for completed work. This change
allows completed spans to be skipped over when restoring.
baoalvin1 added a commit to baoalvin1/cockroach that referenced this issue Dec 15, 2022
Fixes: cockroachdb#81116, cockroachdb#87843

Release note (performance improvement): Previously, whenever a user resumed a paused `RESTORE` job
the checkpointing mechanism would potentially not account for completed work. This change
allows completed spans to be skipped over when restoring.
baoalvin1 added a commit to baoalvin1/cockroach that referenced this issue Dec 19, 2022
Fixes: cockroachdb#81116, cockroachdb#87843

Release note (performance improvement): Previously, whenever a user resumed a paused `RESTORE` job
the checkpointing mechanism would potentially not account for completed work. This change
allows completed spans to be skipped over when restoring.
baoalvin1 added a commit to baoalvin1/cockroach that referenced this issue Dec 20, 2022
Fixes: cockroachdb#81116, cockroachdb#87843

Release note (performance improvement): Previously, whenever a user resumed a paused `RESTORE` job
the checkpointing mechanism would potentially not account for completed work. This change
allows completed spans to be skipped over when restoring.
baoalvin1 added a commit to baoalvin1/cockroach that referenced this issue Dec 21, 2022
Fixes: cockroachdb#81116, cockroachdb#87843

Release note (performance improvement): Previously, whenever a user resumed a paused `RESTORE` job
the checkpointing mechanism would potentially not account for completed work. This change
allows completed spans to be skipped over when restoring.
@msbutler
Copy link
Collaborator

this should no longer be an issue with our new checkpointing procedure #97862

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants