admission: CPU slot adjustment and utilization metrics #95007

sumeerbhola · 2023-01-10T16:52:14Z

Our existing metrics are gauges (total and used slots) which don't give us insight into what is happening at smaller time scales. This creates uncertainty when we observe admission queueing but the gauge samples show total slots consistenly greater than used slots. Additionally, if total slots is steady during queuing, it doesn't tell us whether that was because of roughly matching increments or decrements, or no increments/decrements.

The following metrics are added:

admission.granter.slots_exhausted_duration.kv: cumulative duration when the slots were exhausted. This can give insight into how much exhaustion was occurring. It is insufficient to tell us whether 0.5sec/sec of exhaustion is due to a long 500ms of exhaustion and then non-exhaustion or alternating 1ms of exhaustion and non-exhaustion. But this is an improvement over what we have.
admission.granter.slot_adjuster_{increments,decrements}.kv: Counts the increments and decrements of the total slots.
admission.granter.cpu_load_{short,long}_period_duration.kv: cumulative duration of short and long ticks, as indicated by the period in the CPULoad callback. We don't expect long period ticks when admission control is active (and we explicitly disable enforcement during long period ticks), but it helps us eliminate some hypothesis during incidents (e.g. long period ticks alternating with short period ticks causing a slow down in how fast we increment slots). Additionally, the sum of the rate of these two, if significantly < 1, would indicate that CPULoad frequency is lower than expected, say due to CPU overload.

Fixes #92673

Epic: none

Release note: None

cockroach-teamcity · 2023-01-10T16:52:41Z

This change is

irfansharif

If you haven't sanity checked these metrics by hand, next week you could look at the results of the weekly admission-control/multitenant-fairness/read-heavy tests on the prometheus instance the test-eng folks have running. Or actually, that test spits out a prometheus dump in the weekly artifacts (I'm not sure if they get nuked on successful runs).

irfansharif · 2023-01-17T22:47:52Z

pkg/util/admission/granter.go

+	}
+	kvSlotAdjusterIncrements = metric.Metadata{
+		Name:        "admission.granter.slot_adjuster_increments.kv",
+		Help:        "Number of increments of the total KV slots ",


Extra space at the end of this help text and the one below.

irfansharif · 2023-01-17T22:48:55Z

pkg/util/admission/grant_coordinator.go

+		KVTotalSlots: metric.NewGauge(totalSlots),
+		KVUsedSlots:  metric.NewGauge(addName(workKindString(KVWork), usedSlots)),
+		// TODO(sumeer): remove moderate load slots and soft slots code and
+		// metrics.


Refer to #88032? Which will be closed once removed.

sumeerbhola

Thanks for the pointer.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)

pkg/util/admission/grant_coordinator.go line 976 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Refer to #88032? Which will be closed once removed.

Done

pkg/util/admission/granter.go line 760 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Extra space at the end of this help text and the one below.

Done

Our existing metrics are gauges (total and used slots) which don't give us insight into what is happening at smaller time scales. This creates uncertainty when we observe admission queueing but the gauge samples show total slots consistenly greater than used slots. Additionally, if total slots is steady during queuing, it doesn't tell us whether that was because of roughly matching increments or decrements, or no increments/decrements. The following metrics are added: - admission.granter.slots_exhausted_duration.kv: cumulative duration when the slots were exhausted. This can give insight into how much exhaustion was occurring. It is insufficient to tell us whether 0.5sec/sec of exhaustion is due to a long 500ms of exhaustion and then non-exhaustion or alternating 1ms of exhaustion and non-exhaustion. But this is an improvement over what we have. - admission.granter.slot_adjuster_{increments,decrements}.kv: Counts the increments and decrements of the total slots. - admission.granter.cpu_load_{short,long}_period_duration.kv: cumulative duration of short and long ticks, as indicated by the period in the CPULoad callback. We don't expect long period ticks when admission control is active (and we explicitly disable enforcement during long period ticks), but it helps us eliminate some hypothesis during incidents (e.g. long period ticks alternating with short period ticks causing a slow down in how fast we increment slots). Additionally, the sum of the rate of these two, if significantly < 1, would indicate that CPULoad frequency is lower than expected, say due to CPU overload. Fixes cockroachdb#92673 Epic: none Release note: None

sumeerbhola · 2023-01-20T22:09:22Z

bors r=irfansharif

craig · 2023-01-20T23:53:22Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-01-20T23:53:26Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 774df8d to blathers/backport-release-22.1-95007: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.

error creating merge commit from 774df8d to blathers/backport-release-22.2-95007: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

95590: admission: remove soft/moderate load slots r=irfansharif a=irfansharif We originally introduced these notions in admission control (#78519) for additional threads for Pebble compaction compression. We envisioned granting these "squishy" slots to background activities and permit work only under periods of low load. In working through #86638 (as part of \#75066), we observed experimentally that the moderate-slots count was not sensitive enough to scheduling latency, and consequently latency observed by foreground traffic. Elastic CPU tokens, the kind now being used for backups, offer an alternative to soft slots. We've since replaced uses of soft slots with elastic CPU tokens. This PR just removes the now dead-code code around soft/moderate load slots (it's better to minimize the number of mechanisms in the admission package). Fixes #88032. Release note: None --- First commit is from #95007. Co-authored-by: irfan sharif <[email protected]>

sumeerbhola requested a review from irfansharif January 10, 2023 16:52

sumeerbhola requested a review from a team as a code owner January 10, 2023 16:52

sumeerbhola force-pushed the ac_obs branch from 062a31e to 343b76a Compare January 10, 2023 23:26

irfansharif approved these changes Jan 17, 2023

View reviewed changes

sumeerbhola force-pushed the ac_obs branch from 343b76a to 8db3e1d Compare January 18, 2023 15:43

sumeerbhola commented Jan 18, 2023

View reviewed changes

sumeerbhola force-pushed the ac_obs branch from 8db3e1d to f07755d Compare January 19, 2023 18:52

sumeerbhola requested a review from a team January 19, 2023 18:52

irfansharif added backport-22.1.x labels Jan 19, 2023

sumeerbhola force-pushed the ac_obs branch from f07755d to bef86de Compare January 19, 2023 22:38

sumeerbhola force-pushed the ac_obs branch from bef86de to 774df8d Compare January 20, 2023 02:12

irfansharif mentioned this pull request Jan 20, 2023

admission: remove soft/moderate load slots #95590

Merged

craig bot merged commit 1b79102 into cockroachdb:master Jan 20, 2023

irfansharif mentioned this pull request Feb 1, 2023

release-22.2: admission: CPU slot adjustment and utilization metrics #96356

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

admission: CPU slot adjustment and utilization metrics #95007

admission: CPU slot adjustment and utilization metrics #95007

sumeerbhola commented Jan 10, 2023 •

edited

Loading

cockroach-teamcity commented Jan 10, 2023

irfansharif left a comment

irfansharif Jan 17, 2023

irfansharif Jan 17, 2023

sumeerbhola left a comment

sumeerbhola commented Jan 20, 2023

craig bot commented Jan 20, 2023

blathers-crl bot commented Jan 20, 2023

admission: CPU slot adjustment and utilization metrics #95007

admission: CPU slot adjustment and utilization metrics #95007

Conversation

sumeerbhola commented Jan 10, 2023 • edited Loading

cockroach-teamcity commented Jan 10, 2023

irfansharif left a comment

Choose a reason for hiding this comment

irfansharif Jan 17, 2023

Choose a reason for hiding this comment

irfansharif Jan 17, 2023

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola commented Jan 20, 2023

craig bot commented Jan 20, 2023

blathers-crl bot commented Jan 20, 2023

sumeerbhola commented Jan 10, 2023 •

edited

Loading