-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-22.2: kv,rangefeed: integrate catchup scans with elastic cpu #92439
release-22.2: kv,rangefeed: integrate catchup scans with elastic cpu #92439
Conversation
Pure code movement/renaming. We also renamed a cluster setting kv.store.admission.provisioned_bandwidth to kvadmission.store.provisioned_bandwidth. Release note (general note): We renamed a cluster setting kv.store.admission.provisioned_bandwidth to kvadmission.store.provisioned_bandwidth.
There's only ever one URL. Release note: None
Part of cockroachdb#65957. Changefeed backfills, given their scan-heavy nature, can be fairly CPU-intensive. In cockroachdb#89656 we introduced a roachtest demonstrating the latency impact backfills can have on a moderately CPU-saturated cluster. Similar to what we saw for backups, this CPU heavy nature can elevate Go scheduling latencies which in turn translates to foreground latency impact. This commit integrates rangefeed catchup scan with the elastic CPU limiter we introduced in cockroachdb#86638; this is one of two optional halves of changefeed backfills. The second half is the initial scan -- scan requests issued over some keyspan as of some timestamp. For that we simply rely on the existing slots mechanism but now setting a lower priority bit (BulkNormalPri) -- cockroachdb#88733. Experimentally we observed that during initial scans the encoding routines in changefeedccl are the most impactful CPU-wise, something cockroachdb#89589 can help with. We leave admission integration of parallel worker goroutines to future work (cockroachdb#90089). Unlike export requests rangefeed catchup scans are non-premptible. The rangefeed RPC is a streaming one, and the catchup scan is done during stream setup. So we don't have resumption tokens to propagate up to the caller like we did for backups. We still want CPU-bound work going through admission control to only use 100ms of CPU time, to exercise better control over scheduling latencies. To that end, we introduce the following component used within the rangefeed catchup iterator. // Pacer is used in tight loops (CPU-bound) for non-premptible // elastic work. Callers are expected to invoke Pace() every loop // iteration and Close() once done. Internally this type integrates // with elastic CPU work queue, acquiring tokens for the CPU work // being done, and blocking if tokens are unavailable. This allows // for a form of cooperative scheduling with elastic CPU granters. type Pacer struct func (p *Pacer) Pace(ctx context.Context) error { ... } func (p *Pacer) Close() { ... } Release note: None
.. catch-up scans, introduce a private cluster setting (kvadmission.rangefeed_catchup_scan_elastic_control.enabled) to selectively switch off catch-up scan integration if needed, and plumb kvadmission.Pacer in explicitly to rangefeed catchup scan loop instead of opaquely through the surrounding context. Release note: None
Thanks for opening a backport. Please check the backport criteria before merging:
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
Add a brief release justification to the body of your PR to justify this backport. Some other things to consider:
|
We want this to be disabled by default in 22.2, used only selectively. Release note: None
This is a fairly large PR; Even though this functionality turned off, do we still want to merge it? |
@miretskiy Well, we agree that it’s a fairly universal problem of CPU spikes and therefore SQL latency on changefeed backfill. Your question is about the value/risk of the backport, I assume? We could give more bake time on v23.1 and then re-evaluate the backport. Or @irfansharif can make the case on risk/return. |
|
Cool -- I just want to make sure we have existing pains we are trying to solve. |
Backport 4/4 commits from #89709.
/cc @cockroachdb/release
Part of #65957. First commit is from #89656.
Changefeed backfills, given their scan-heavy nature, can be fairly
CPU-intensive. In #89656 we introduced a roachtest demonstrating the
latency impact backfills can have on a moderately CPU-saturated cluster.
Similar to what we saw for backups, this CPU heavy nature can elevate Go
scheduling latencies which in turn translates to foreground latency
impact. This commit integrates rangefeed catchup scan with the elastic
CPU limiter we introduced in #86638; this is one of two optional halves
of changefeed backfills. The second half is the initial scan -- scan
requests issued over some keyspan as of some timestamp. For that we
simply rely on the existing slots mechanism but now setting a lower
priority bit (BulkNormalPri). Experimentally we observed that during
initial scans the encoding routines in changefeedccl are the most
impactful CPU-wise, something #89589 can help with. We leave admission
integration of parallel worker goroutines to future work.
Unlike export requests, rangefeed catchup scans are non-premptible. The
rangefeed RPC is a streaming one, and the catchup scan is done during
stream setup. So we don't have resumption tokens to propagate up to the
caller like we did for backups. We still want CPU-bound work going
through admission control to use at most 100ms of CPU time (it makes
for an ineffective controller otherwise), and to that end we introduce
the following component used within the rangefeed catchup iterator:
Release note: None
Release justification: Non-production code. In 22.2 this machinery is all switched off by default.