admission: investigate TPC-E online index creation problem #85641

sumeerbhola · 2022-08-04T22:30:36Z

Motivation in experiments run by @nvanbenschoten
in https://cockroachlabs.slack.com/archives/C038JEXC5AT/p1658247509643359?thread_ts=1657630075.576439&cid=C038JEXC5AT and https://docs.google.com/document/d/1wzkBXaA3Ap_daMV1oY1AhQqlnAjO3pIVLZTXY53m0Xk/edit#heading=h.fhho371lyula
@nvanbenschoten has packaged this in a roachtest #85002

Summary: AC was able to keep read amplification in check but foreground throughput and latency were severely affected. Our assumption is that the rate at which foreground work is being admitted has dropped.

We expect some improvement based on pending PRs for better byte token estimation, and tracking follower writes that bypass admission control. We should wait for them to merge.
To diagnose this we need to add labeled admission latency and admission rate metrics with a label per priority admission: additional observability #82743
We may still see an isolation failure due to follower writes bypassing admission control, so we should set admission.kv.pause_replication_io_threshold = 0.8.

@irfansharif @nvanbenschoten

Jira issue: CRDB-18354

The text was updated successfully, but these errors were encountered:

irfansharif · 2022-10-03T20:15:05Z

x-linking #82556 + #89208.

andrewbaptist · 2022-10-21T18:01:35Z

Some additional results from running the TPC-E index backfill experiments on the 9 node cluster:
TL/DR - Dropping the L0 file count overload to 10 and patching to disable AC on user requests resulted in at least a 20x drop in latency during index creation. With those settings on TPC-E the P99 latency only went from 25ms to 62ms, vs 3700ms with the default setting.

Specifically, I was changing two settings

L0 file count overload
Patch to disable overload on user requests (only keep for bulk requests).

The patch itself is not ready for all customers, especially multi-tenant as if the setting is enabled there are two possible drawbacks:

In a multi-tenant case, we won’t prioritize correctly
In a single-tenant case where the user traffic causes LSM inversion elastic work may get starved. (I’m not sure if we have any customers that would be in this scenario)

Based on how encouraging the results are it might be something that we backport to 22.2 in its current state (with a flag to enable/disable) and work on a better patch for 23.1. Additional work needs to happen to better understand why the default setting results are so bad. I suspect that we greatly overshoot the L0 file limit and it takes seconds of compaction to recover in some cases. During this time all work is halted and we grow large queues. Once we come under the limit we do all the KV work first but finish that pretty quickly. Then we process another large batch of AddSST, which causes it to overshoot again. One reason this hypothesis seems reasonable is that if I change the L0 setting below the current level it is at, I am unable to run any operations against the system until compaction has dropped it to the rate I expect.

The summary of results is here: https://docs.google.com/spreadsheets/d/12cE0tL5zihm8ymN_r07ag3-LzeOgRHamD8vb3RSdcR0/edit?usp=sharing (edited)

SQL Statements during index creation with current settings:

Current code / default settings:

Note the correlation between when the IO admission is red vs when SQL statement count drops. When in overload (LSM count > 1000), no new transactions can begin, when it is “at” overload, only user jobs run since they are higher priority, when it is “below” overload index AddSST also runs.

SQL Statement during the run with the patch:

Statement bundle showing a slow TX:
Note:
26899.272ms 26866.009ms event:util/admission/work_queue.go:672 [n9] admitted, waited in kv queue for 26.865980677s
stmt-bundle-807086135621877766.zip

IO Load logs on the "hot" node.
io_load.log

andrewbaptist · 2022-10-21T18:10:32Z

Logs from the run:
https://drive.google.com/drive/folders/1i5cBIaIdAat_cMXhQeyvhtWGmOfsymP9

informs: cockroachdb#85641 Only leave admission control in place for elastic / bulk requests. Release note: None

informs: cockroachdb#85641 Allow disabling admission control for non-elastic / bulk requests. Release note (ops change): Add a new config parameter `admission.kv.bulk_only.enabled` to allow partially disabling admission control. It is disabled by default but can be enabled in cases where bulk operations are causing high user latency due to io_overload.

informs: cockroachdb#85641 Temporary knob to allow disabling admission control for non-elastic / bulk requests. Release note: None

90448: admission: Disable admission control for user requests r=andrewbaptist a=andrewbaptist informs: #85641 Only leave admission control in place for elastic / bulk requests. Release note: None 91146: storage: enable MVCC range tombstones by default r=erikgrinaker a=erikgrinaker Resolves #88708. Release note: None Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: Erik Grinaker <[email protected]>

informs: #85641 Temporary knob to allow disabling admission control for non-elastic / bulk requests. Release note: None

informs: cockroachdb#85641 Temporary knob to allow disabling admission control for non-elastic / bulk requests. Release note: None

irfansharif · 2023-01-20T15:31:19Z

Going to de-dup in favor of #95563.

sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control labels Aug 4, 2022

sumeerbhola mentioned this issue Aug 25, 2022

admission: enable disk bandwidth bottleneck resource #86857

Closed

4 tasks

irfansharif mentioned this issue Oct 3, 2022

kv,bulkio: throttle per-store column/index backfill requests #82556

Closed

irfansharif mentioned this issue Oct 3, 2022

admission: add integration tests #89208

Open

21 tasks

andrewbaptist mentioned this issue Oct 21, 2022

admission: temporary knob to disable AC for user requests #90448

Merged

andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Oct 21, 2022

admission: Disable AC for user requests

fa7fb3a

informs: cockroachdb#85641 Only leave admission control in place for elastic / bulk requests. Release note: None

andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Nov 2, 2022

admission: Disable AC for user requests

c8c5375

informs: cockroachdb#85641 Temporary knob to allow disabling admission control for non-elastic / bulk requests. Release note: None

andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Nov 3, 2022

admission: temporary knob to disable AC for user requests

c8c8535

informs: cockroachdb#85641 Temporary knob to allow disabling admission control for non-elastic / bulk requests. Release note: None

blathers-crl bot mentioned this issue Nov 3, 2022

release-22.2: admission: temporary knob to disable AC for user requests #91200

Merged

blathers-crl bot pushed a commit that referenced this issue Nov 3, 2022

admission: temporary knob to disable AC for user requests

10944b2

informs: #85641 Temporary knob to allow disabling admission control for non-elastic / bulk requests. Release note: None

andrewbaptist mentioned this issue Nov 15, 2022

release-22.1: admission: temporary knob to disable AC for user requests #91928

Merged

andrewbaptist added a commit to andrewbaptist/cockroach that referenced this issue Nov 15, 2022

admission: temporary knob to disable AC for user requests

aa11fbf

informs: cockroachdb#85641 Temporary knob to allow disabling admission control for non-elastic / bulk requests. Release note: None

irfansharif mentioned this issue Jan 19, 2023

*: implement replication admission control #95563

Closed

10 tasks

irfansharif closed this as not planned Won't fix, can't repro, duplicate, stale Jan 20, 2023

exalate-issue-sync bot closed this as completed Jan 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

admission: investigate TPC-E online index creation problem #85641

admission: investigate TPC-E online index creation problem #85641

sumeerbhola commented Aug 4, 2022 •

edited by cockroach-jira-scripts

Loading

irfansharif commented Oct 3, 2022

andrewbaptist commented Oct 21, 2022

andrewbaptist commented Oct 21, 2022

irfansharif commented Jan 20, 2023

admission: investigate TPC-E online index creation problem #85641

admission: investigate TPC-E online index creation problem #85641

Comments

sumeerbhola commented Aug 4, 2022 • edited by cockroach-jira-scripts Loading

irfansharif commented Oct 3, 2022

andrewbaptist commented Oct 21, 2022

andrewbaptist commented Oct 21, 2022

irfansharif commented Jan 20, 2023

sumeerbhola commented Aug 4, 2022 •

edited by cockroach-jira-scripts

Loading