Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admission: CPU slot adjustment and utilization metrics #95007

Merged
merged 1 commit into from
Jan 20, 2023

Conversation

sumeerbhola
Copy link
Collaborator

@sumeerbhola sumeerbhola commented Jan 10, 2023

Our existing metrics are gauges (total and used slots) which don't give us insight into what is happening at smaller time scales. This creates uncertainty when we observe admission queueing but the gauge samples show total slots consistenly greater than used slots. Additionally, if total slots is steady during queuing, it doesn't tell us whether that was because of roughly matching increments or decrements, or no increments/decrements.

The following metrics are added:

  • admission.granter.slots_exhausted_duration.kv: cumulative duration when the slots were exhausted. This can give insight into how much exhaustion was occurring. It is insufficient to tell us whether 0.5sec/sec of exhaustion is due to a long 500ms of exhaustion and then non-exhaustion or alternating 1ms of exhaustion and non-exhaustion. But this is an improvement over what we have.
  • admission.granter.slot_adjuster_{increments,decrements}.kv: Counts the increments and decrements of the total slots.
  • admission.granter.cpu_load_{short,long}_period_duration.kv: cumulative duration of short and long ticks, as indicated by the period in the CPULoad callback. We don't expect long period ticks when admission control is active (and we explicitly disable enforcement during long period ticks), but it helps us eliminate some hypothesis during incidents (e.g. long period ticks alternating with short period ticks causing a slow down in how fast we increment slots). Additionally, the sum of the rate of these two, if significantly < 1, would indicate that CPULoad frequency is lower than expected, say due to CPU overload.

Fixes #92673

Epic: none

Release note: None

@sumeerbhola sumeerbhola requested a review from a team as a code owner January 10, 2023 16:52
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you haven't sanity checked these metrics by hand, next week you could look at the results of the weekly admission-control/multitenant-fairness/read-heavy tests on the prometheus instance the test-eng folks have running. Or actually, that test spits out a prometheus dump in the weekly artifacts (I'm not sure if they get nuked on successful runs).

}
kvSlotAdjusterIncrements = metric.Metadata{
Name: "admission.granter.slot_adjuster_increments.kv",
Help: "Number of increments of the total KV slots ",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra space at the end of this help text and the one below.

KVTotalSlots: metric.NewGauge(totalSlots),
KVUsedSlots: metric.NewGauge(addName(workKindString(KVWork), usedSlots)),
// TODO(sumeer): remove moderate load slots and soft slots code and
// metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to #88032? Which will be closed once removed.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)


pkg/util/admission/grant_coordinator.go line 976 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Refer to #88032? Which will be closed once removed.

Done


pkg/util/admission/granter.go line 760 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Extra space at the end of this help text and the one below.

Done

Our existing metrics are gauges (total and used slots) which don't give us
insight into what is happening at smaller time scales. This creates
uncertainty when we observe admission queueing but the gauge samples show
total slots consistenly greater than used slots. Additionally, if total
slots is steady during queuing, it doesn't tell us whether that was because
of roughly matching increments or decrements, or no increments/decrements.

The following metrics are added:
- admission.granter.slots_exhausted_duration.kv: cumulative duration when
  the slots were exhausted. This can give insight into how much exhaustion
  was occurring. It is insufficient to tell us whether 0.5sec/sec of
  exhaustion is due to a long 500ms of exhaustion and then non-exhaustion
  or alternating 1ms of exhaustion and non-exhaustion. But this is an
  improvement over what we have.
- admission.granter.slot_adjuster_{increments,decrements}.kv: Counts the
  increments and decrements of the total slots.
- admission.granter.cpu_load_{short,long}_period_duration.kv: cumulative
  duration of short and long ticks, as indicated by the period in the
  CPULoad callback. We don't expect long period ticks when admission
  control is active (and we explicitly disable enforcement during long
  period ticks), but it helps us eliminate some hypothesis during
  incidents (e.g. long period ticks alternating with short period ticks
  causing a slow down in how fast we increment slots). Additionally, the
  sum of the rate of these two, if significantly < 1, would indicate that
  CPULoad frequency is lower than expected, say due to CPU overload.

Fixes cockroachdb#92673

Epic: none

Release note: None
@sumeerbhola
Copy link
Collaborator Author

bors r=irfansharif

@craig craig bot merged commit 1b79102 into cockroachdb:master Jan 20, 2023
@craig
Copy link
Contributor

craig bot commented Jan 20, 2023

Build succeeded:

@blathers-crl
Copy link

blathers-crl bot commented Jan 20, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 774df8d to blathers/backport-release-22.1-95007: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


error creating merge commit from 774df8d to blathers/backport-release-22.2-95007: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

craig bot pushed a commit that referenced this pull request Jan 23, 2023
95590: admission: remove soft/moderate load slots r=irfansharif a=irfansharif

We originally introduced these notions in admission control (#78519) for additional threads for Pebble compaction compression. We envisioned granting these "squishy" slots to background activities and permit work only under periods of low load. In working through #86638 (as part of \#75066), we observed experimentally that the moderate-slots count was not sensitive enough to scheduling latency, and consequently latency observed by foreground traffic. Elastic CPU tokens, the kind now being used for backups, offer an alternative to soft slots. We've since replaced uses of soft slots with elastic CPU tokens.

This PR just removes the now dead-code code around soft/moderate load slots (it's better to minimize the number of mechanisms in the admission package). Fixes #88032.

Release note: None

---

First commit is from #95007.

Co-authored-by: irfan sharif <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

admission: better observability of slot adjustment behavior
3 participants