Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/backupccl: add memory monitor to external SST iterators in restore #93324

Merged
merged 1 commit into from
May 11, 2023

Conversation

rhu713
Copy link
Contributor

@rhu713 rhu713 commented Dec 9, 2022

Previously, there was no limit on the amount of memory that can be used while
constructing edternal SST iterators during restore. This patch adds a memory
monitor to limit the amount of memory that can be used to construct external
SST iterators. If a restore processor fails to acquire enough memory to open
the next file for a restore span, it will send the iterator for all of the open
files it has accumulated so far, and wait until it can acquire the memory to
resume constructing the iterator for the remaining files.

The memory limit can be controlled via the new cluster setting
bulkio.restore.per_processor_memory_limit. Regardless of the setting,
however, the amount of memory used will not exceed
COCKROACH_RESTORE_MEM_FRACTION * max SQL memory. The new environment
variable COCKROACH_RESTORE_MEM_FRACTION defaults to 0.5.

Benchmarking using 10 iterators of 100 files each, each file is 24MiB in size.

./dev bench ./pkg/ccl/backupccl/  --filter BenchmarkIteratorMemory/fileCount=100$/iterCount=10$/rows=200000$/enc=false  --bench-mem -- --test_env=COCKROACH_GCS_SST_DIR=gs://rui-backup-test/bench --test_env=COCKROACH_S3_SST_DIR=s3://rui-crl/bench (plus credentials environment variables...)

GCS unencrypted: 5.0 MiB/file  
  5246162544 B/op

GCS encrypted: 8.9 MiB/file
  9404394496 B/op

S3 unencrypted: 1.4 MiB/file 
  1511477488 B/op

S3 encrypted: 1.7 MiB/file
  1834325232 B/op

Fixes: #102722

Release note: None

@rhu713 rhu713 requested review from a team as code owners December 9, 2022 15:48
@rhu713 rhu713 requested review from benbardin and removed request for a team December 9, 2022 15:48
@cockroach-teamcity
Copy link
Member

This change is Reviewable

pkg/ccl/backupccl/restore_data_processor.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_data_processor.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_data_processor.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_data_processor.go Outdated Show resolved Hide resolved
@shermanCRL
Copy link
Contributor

What are next steps here?

@rhu713 rhu713 requested review from dt and adityamaru and removed request for dt February 14, 2023 21:34
@rhu713 rhu713 force-pushed the restore-processor-mem branch from 0177995 to 99ef476 Compare February 16, 2023 22:12
@rhu713 rhu713 requested a review from a team as a code owner February 16, 2023 22:12
@rhu713 rhu713 requested review from cucaroach and benbardin and removed request for a team and cucaroach February 16, 2023 22:12
@rhu713 rhu713 force-pushed the restore-processor-mem branch 3 times, most recently from 216a2ef to d00cee9 Compare February 23, 2023 15:31
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_data_processor.go Show resolved Hide resolved
pkg/ccl/backupccl/restore_data_processor.go Outdated Show resolved Hide resolved
@rhu713 rhu713 force-pushed the restore-processor-mem branch 2 times, most recently from 11f6b83 to 134b1c5 Compare February 27, 2023 21:58
@rhu713 rhu713 force-pushed the restore-processor-mem branch from 134b1c5 to 7f4b46c Compare March 1, 2023 22:41
@rhu713 rhu713 force-pushed the restore-processor-mem branch from 7f4b46c to 6585756 Compare March 2, 2023 18:13
@rhu713 rhu713 force-pushed the restore-processor-mem branch from d9ea94b to fca9148 Compare April 3, 2023 20:37
package backuputils

import (
"context"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why is the tab here 8 characters and not 2? here and in the rest of this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, I don't see the 8 character tab in the editor now?

pkg/ccl/backupccl/bench_test.go Outdated Show resolved Hide resolved
Comment on lines +223 to +241
// First we reserve minWorkerMemReservation for each restore worker, and
// making sure that we always have enough memory for at least one worker. The
// maximum number of workers is based on the cluster setting. If the cluster
// setting is updated, the job should be PAUSEd and RESUMEd for the new
// setting to take effect.
numWorkers, err := reserveRestoreWorkerMemory(ctx, rd.flowCtx.Cfg.Settings, rd.qp)
if err != nil {
log.Warningf(ctx, "cannot reserve restore worker memory: %v", err)
rd.MoveToDraining(err)
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I think I had a comment about this logic - don't you think we need to always succeed with at least one worker? even if there is no memory at all? it doesn't make sense to me that if some heavy query is running just now, we would fail a restore.

there is another issue here that maybe worth mentioning - that if a heavy sql query is running we might pick a single worker for the next 2 hours of restore, even if the query would finish within a minute. But that we can solve in some other PR if this is an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the reserve memory happen before the processor does any work is a way to at least make sure the job fails fast if we suspect that there's not enough memory in the node to continue the restore. Restore is something that can take up a lot of cluster resources and I think failing the restore is before it starts is better than failing some heavy SQL query that's in progress. This pattern also exist in our backup job currently where we reserve the worker memory in the beginning, and fail the job if we are not able to do so.

We do have an issue here if the job replans at an unfortunate time and some other queries in the cluster prevents the job from resuming due to lack of memory, but I think in that case it's still better to fail the restore since it wasn't going to be able to continue anyways, and the user can always have an option to pause on error should this become a recurring problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this PR point to an issue? I'd like to understand the motivation for this change better. I can imagine 2 different motivations: one is that the current restore frequently causes issues such as OOMing nodes or causing queries to fail, and another is that everything is pretty much ok except for some rare failures but we want to increase the number of workers and therefore we need some memory limits. I guess I have the latter in my mind but maybe I'm wrong.
cc @dt to see what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we agreed that this approach of reserving mem and failing the restore if we can't reserve is okay. I added the number of restore workers to the "starting restore data" log line for help in debugging issues that may resolve from picking too low of a worker number.

pkg/ccl/backupccl/backuputils/memory_backed_quota_pool.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/backuputils/memory_backed_quota_pool.go Outdated Show resolved Hide resolved
@rhu713 rhu713 force-pushed the restore-processor-mem branch 4 times, most recently from 30da21d to ec97494 Compare April 12, 2023 14:11
Copy link
Contributor

@lidorcarmel lidorcarmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good except for that one issue.

related to that issue - can you explain the motivation for backporting this change?

Comment on lines +223 to +241
// First we reserve minWorkerMemReservation for each restore worker, and
// making sure that we always have enough memory for at least one worker. The
// maximum number of workers is based on the cluster setting. If the cluster
// setting is updated, the job should be PAUSEd and RESUMEd for the new
// setting to take effect.
numWorkers, err := reserveRestoreWorkerMemory(ctx, rd.flowCtx.Cfg.Settings, rd.qp)
if err != nil {
log.Warningf(ctx, "cannot reserve restore worker memory: %v", err)
rd.MoveToDraining(err)
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this PR point to an issue? I'd like to understand the motivation for this change better. I can imagine 2 different motivations: one is that the current restore frequently causes issues such as OOMing nodes or causing queries to fail, and another is that everything is pretty much ok except for some rare failures but we want to increase the number of workers and therefore we need some memory limits. I guess I have the latter in my mind but maybe I'm wrong.
cc @dt to see what you think.

@rhu713 rhu713 force-pushed the restore-processor-mem branch from ec97494 to 68c1a5d Compare April 25, 2023 16:13
Copy link
Contributor Author

@rhu713 rhu713 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The summary of offline thread is that

  1. we think that reserving worker memory for restore ahead of time (and failing the restore if we fail to reserve the minimum amount of memory is okay). and
  2. there will be a future PR to be smarter about the actual minimum number of workers

w.r.t. the backport, that was initially added so it can go along with the 22.2 backport for slim manifests. I removed the backport label now since we've reverted the slim manifests backport as well.

Comment on lines +223 to +241
// First we reserve minWorkerMemReservation for each restore worker, and
// making sure that we always have enough memory for at least one worker. The
// maximum number of workers is based on the cluster setting. If the cluster
// setting is updated, the job should be PAUSEd and RESUMEd for the new
// setting to take effect.
numWorkers, err := reserveRestoreWorkerMemory(ctx, rd.flowCtx.Cfg.Settings, rd.qp)
if err != nil {
log.Warningf(ctx, "cannot reserve restore worker memory: %v", err)
rd.MoveToDraining(err)
return
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we agreed that this approach of reserving mem and failing the restore if we can't reserve is okay. I added the number of restore workers to the "starting restore data" log line for help in debugging issues that may resolve from picking too low of a worker number.

@rhu713 rhu713 force-pushed the restore-processor-mem branch 2 times, most recently from 7f767ad to 65d3943 Compare April 25, 2023 17:35
@lidorcarmel
Copy link
Contributor

  1. there will be a future PR to be smarter about the actual minimum number of workers

unfortunately I think we need a minimum in this pr, because without it we might run a restore with large machines using a single thread which will be a slow restore, and this would be a regression. I do agree that we don't need to be too smart about the min and max numbers. how about half of the threads is the min? meaning, if today we use N workers, with this PR we will only run restore of we can reserver max(1, N/2) workers. WDYT?

@rhu713 rhu713 force-pushed the restore-processor-mem branch 2 times, most recently from a048862 to 4fdaa5b Compare May 2, 2023 14:12
@rhu713 rhu713 requested a review from lidorcarmel May 3, 2023 20:26
@rhu713
Copy link
Contributor Author

rhu713 commented May 4, 2023

with this PR we will only run restore of we can reserve max(1, N/2) workers. WDYT?

Okay, I've changed the PR so that our min workers is numRestoreWorkers/2 . Since by default numRestoreWorkers setting is GOMAXPROCS - 1, I think this should suffice as something that can be controlled by the number of cores but also something that can be overridden by a setting.

I've also made a small change to reduce the minimum worker memory to 15MB, which is still more than the memory required by 1 file. This is so that we can pass unit tests if the metamorphic value makes restore choose a minimum of 4 workers (by choosing defaultNumWorkers=8)

Copy link
Contributor

@lidorcarmel lidorcarmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix the test failure, but other than that, lgtm!

@rhu713 rhu713 force-pushed the restore-processor-mem branch from 4fdaa5b to 0ed12ed Compare May 8, 2023 18:40
Previously, there was no limit on the amount of memory that can be used while
constructing edternal SST iterators during restore. This patch adds a memory
monitor to limit the amount of memory that can be used to construct external
SST iterators. If a restore processor fails to acquire enough memory to open
the next file for a restore span, it will send the iterator for all of the open
files it has accumulated so far, and wait until it can acquire the memory to
resume constructing the iterator for the remaining files.

The memory limit can be controlled via the new cluster setting
bulkio.restore.per_processor_memory_limit. Regardless of the setting,
however, the amount of memory used will not exceed
COCKROACH_RESTORE_MEM_FRACTION * max SQL memory. The new environment
variable COCKROACH_RESTORE_MEM_FRACTION defaults to 0.5.

Release note: None
@rhu713 rhu713 force-pushed the restore-processor-mem branch from 0ed12ed to 84ed8ac Compare May 10, 2023 17:01
@rhu713
Copy link
Contributor Author

rhu713 commented May 11, 2023

bors r=lidorcarmel

@craig
Copy link
Contributor

craig bot commented May 11, 2023

Build failed:

@rhu713
Copy link
Contributor Author

rhu713 commented May 11, 2023

bors retry

@craig
Copy link
Contributor

craig bot commented May 11, 2023

Build succeeded:

@craig craig bot merged commit f04439c into cockroachdb:master May 11, 2023
msbutler pushed a commit to msbutler/cockroach that referenced this pull request May 18, 2023
Currently we see openSSTs being a bottleneck during restore when using
more than 4 workers.

This patch moves the openSSTs call into the worker itself, so that this
work can be parallelized. This is needed for later PR which will increase
the number of workers.

Also, this change simplifies the code a bit and makes it easier to
implement cockroachdb#93324, because in that PR we want to produce a few partial SSTs
that need to be processed serially. Before this patch it wasn't trivial to
make sure that the N workers will not process those partial SSTs in the wrong
order, and with this patch each worker will process a single mergedSST, and
therefore can serialize the partial SSTs created from that mergedSST.

Tested by running a roachtest (4 nodes, 8 cores) with and without this
change. The fixed version was faster: 80MB/s/node vs 60 but some of it is
noise, we do expect a perf improvement when using more workers and other
params tuned, which is the next step.

Informs: cockroachdb#98015
Epic: CRDB-20916

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory bound number of open SSTs in restore data processor
7 participants