-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ccl/backupccl: add memory monitor to external SST iterators in restore #93324
Conversation
df09a9c
to
0177995
Compare
What are next steps here? |
0177995
to
99ef476
Compare
216a2ef
to
d00cee9
Compare
11f6b83
to
134b1c5
Compare
134b1c5
to
7f4b46c
Compare
7f4b46c
to
6585756
Compare
d9ea94b
to
fca9148
Compare
package backuputils | ||
|
||
import ( | ||
"context" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why is the tab here 8 characters and not 2? here and in the rest of this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, I don't see the 8 character tab in the editor now?
// First we reserve minWorkerMemReservation for each restore worker, and | ||
// making sure that we always have enough memory for at least one worker. The | ||
// maximum number of workers is based on the cluster setting. If the cluster | ||
// setting is updated, the job should be PAUSEd and RESUMEd for the new | ||
// setting to take effect. | ||
numWorkers, err := reserveRestoreWorkerMemory(ctx, rd.flowCtx.Cfg.Settings, rd.qp) | ||
if err != nil { | ||
log.Warningf(ctx, "cannot reserve restore worker memory: %v", err) | ||
rd.MoveToDraining(err) | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I think I had a comment about this logic - don't you think we need to always succeed with at least one worker? even if there is no memory at all? it doesn't make sense to me that if some heavy query is running just now, we would fail a restore.
there is another issue here that maybe worth mentioning - that if a heavy sql query is running we might pick a single worker for the next 2 hours of restore, even if the query would finish within a minute. But that we can solve in some other PR if this is an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the reserve memory happen before the processor does any work is a way to at least make sure the job fails fast if we suspect that there's not enough memory in the node to continue the restore. Restore is something that can take up a lot of cluster resources and I think failing the restore is before it starts is better than failing some heavy SQL query that's in progress. This pattern also exist in our backup job currently where we reserve the worker memory in the beginning, and fail the job if we are not able to do so.
We do have an issue here if the job replans at an unfortunate time and some other queries in the cluster prevents the job from resuming due to lack of memory, but I think in that case it's still better to fail the restore since it wasn't going to be able to continue anyways, and the user can always have an option to pause on error should this become a recurring problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this PR point to an issue? I'd like to understand the motivation for this change better. I can imagine 2 different motivations: one is that the current restore frequently causes issues such as OOMing nodes or causing queries to fail, and another is that everything is pretty much ok except for some rare failures but we want to increase the number of workers and therefore we need some memory limits. I guess I have the latter in my mind but maybe I'm wrong.
cc @dt to see what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we agreed that this approach of reserving mem and failing the restore if we can't reserve is okay. I added the number of restore workers to the "starting restore data" log line for help in debugging issues that may resolve from picking too low of a worker number.
30da21d
to
ec97494
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good except for that one issue.
related to that issue - can you explain the motivation for backporting this change?
// First we reserve minWorkerMemReservation for each restore worker, and | ||
// making sure that we always have enough memory for at least one worker. The | ||
// maximum number of workers is based on the cluster setting. If the cluster | ||
// setting is updated, the job should be PAUSEd and RESUMEd for the new | ||
// setting to take effect. | ||
numWorkers, err := reserveRestoreWorkerMemory(ctx, rd.flowCtx.Cfg.Settings, rd.qp) | ||
if err != nil { | ||
log.Warningf(ctx, "cannot reserve restore worker memory: %v", err) | ||
rd.MoveToDraining(err) | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this PR point to an issue? I'd like to understand the motivation for this change better. I can imagine 2 different motivations: one is that the current restore frequently causes issues such as OOMing nodes or causing queries to fail, and another is that everything is pretty much ok except for some rare failures but we want to increase the number of workers and therefore we need some memory limits. I guess I have the latter in my mind but maybe I'm wrong.
cc @dt to see what you think.
ec97494
to
68c1a5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The summary of offline thread is that
- we think that reserving worker memory for restore ahead of time (and failing the restore if we fail to reserve the minimum amount of memory is okay). and
- there will be a future PR to be smarter about the actual minimum number of workers
w.r.t. the backport, that was initially added so it can go along with the 22.2 backport for slim manifests. I removed the backport label now since we've reverted the slim manifests backport as well.
// First we reserve minWorkerMemReservation for each restore worker, and | ||
// making sure that we always have enough memory for at least one worker. The | ||
// maximum number of workers is based on the cluster setting. If the cluster | ||
// setting is updated, the job should be PAUSEd and RESUMEd for the new | ||
// setting to take effect. | ||
numWorkers, err := reserveRestoreWorkerMemory(ctx, rd.flowCtx.Cfg.Settings, rd.qp) | ||
if err != nil { | ||
log.Warningf(ctx, "cannot reserve restore worker memory: %v", err) | ||
rd.MoveToDraining(err) | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we agreed that this approach of reserving mem and failing the restore if we can't reserve is okay. I added the number of restore workers to the "starting restore data" log line for help in debugging issues that may resolve from picking too low of a worker number.
7f767ad
to
65d3943
Compare
unfortunately I think we need a minimum in this pr, because without it we might run a restore with large machines using a single thread which will be a slow restore, and this would be a regression. I do agree that we don't need to be too smart about the min and max numbers. how about half of the threads is the min? meaning, if today we use N workers, with this PR we will only run restore of we can reserver |
a048862
to
4fdaa5b
Compare
Okay, I've changed the PR so that our min workers is I've also made a small change to reduce the minimum worker memory to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please fix the test failure, but other than that, lgtm!
4fdaa5b
to
0ed12ed
Compare
Previously, there was no limit on the amount of memory that can be used while constructing edternal SST iterators during restore. This patch adds a memory monitor to limit the amount of memory that can be used to construct external SST iterators. If a restore processor fails to acquire enough memory to open the next file for a restore span, it will send the iterator for all of the open files it has accumulated so far, and wait until it can acquire the memory to resume constructing the iterator for the remaining files. The memory limit can be controlled via the new cluster setting bulkio.restore.per_processor_memory_limit. Regardless of the setting, however, the amount of memory used will not exceed COCKROACH_RESTORE_MEM_FRACTION * max SQL memory. The new environment variable COCKROACH_RESTORE_MEM_FRACTION defaults to 0.5. Release note: None
0ed12ed
to
84ed8ac
Compare
bors r=lidorcarmel |
Build failed: |
bors retry |
Build succeeded: |
Currently we see openSSTs being a bottleneck during restore when using more than 4 workers. This patch moves the openSSTs call into the worker itself, so that this work can be parallelized. This is needed for later PR which will increase the number of workers. Also, this change simplifies the code a bit and makes it easier to implement cockroachdb#93324, because in that PR we want to produce a few partial SSTs that need to be processed serially. Before this patch it wasn't trivial to make sure that the N workers will not process those partial SSTs in the wrong order, and with this patch each worker will process a single mergedSST, and therefore can serialize the partial SSTs created from that mergedSST. Tested by running a roachtest (4 nodes, 8 cores) with and without this change. The fixed version was faster: 80MB/s/node vs 60 but some of it is noise, we do expect a perf improvement when using more workers and other params tuned, which is the next step. Informs: cockroachdb#98015 Epic: CRDB-20916 Release note: None
Previously, there was no limit on the amount of memory that can be used while
constructing edternal SST iterators during restore. This patch adds a memory
monitor to limit the amount of memory that can be used to construct external
SST iterators. If a restore processor fails to acquire enough memory to open
the next file for a restore span, it will send the iterator for all of the open
files it has accumulated so far, and wait until it can acquire the memory to
resume constructing the iterator for the remaining files.
The memory limit can be controlled via the new cluster setting
bulkio.restore.per_processor_memory_limit
. Regardless of the setting,however, the amount of memory used will not exceed
COCKROACH_RESTORE_MEM_FRACTION
*max SQL memory.
The new environmentvariable
COCKROACH_RESTORE_MEM_FRACTION
defaults to 0.5.Benchmarking using 10 iterators of 100 files each, each file is 24MiB in size.
Fixes: #102722
Release note: None