-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
admission: CPU slot adjustment and utilization metrics #95007
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you haven't sanity checked these metrics by hand, next week you could look at the results of the weekly admission-control/multitenant-fairness/read-heavy
tests on the prometheus instance the test-eng folks have running. Or actually, that test spits out a prometheus dump in the weekly artifacts (I'm not sure if they get nuked on successful runs).
pkg/util/admission/granter.go
Outdated
} | ||
kvSlotAdjusterIncrements = metric.Metadata{ | ||
Name: "admission.granter.slot_adjuster_increments.kv", | ||
Help: "Number of increments of the total KV slots ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra space at the end of this help text and the one below.
KVTotalSlots: metric.NewGauge(totalSlots), | ||
KVUsedSlots: metric.NewGauge(addName(workKindString(KVWork), usedSlots)), | ||
// TODO(sumeer): remove moderate load slots and soft slots code and | ||
// metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refer to #88032? Which will be closed once removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)
pkg/util/admission/grant_coordinator.go
line 976 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Refer to #88032? Which will be closed once removed.
Done
pkg/util/admission/granter.go
line 760 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Extra space at the end of this help text and the one below.
Done
Our existing metrics are gauges (total and used slots) which don't give us insight into what is happening at smaller time scales. This creates uncertainty when we observe admission queueing but the gauge samples show total slots consistenly greater than used slots. Additionally, if total slots is steady during queuing, it doesn't tell us whether that was because of roughly matching increments or decrements, or no increments/decrements. The following metrics are added: - admission.granter.slots_exhausted_duration.kv: cumulative duration when the slots were exhausted. This can give insight into how much exhaustion was occurring. It is insufficient to tell us whether 0.5sec/sec of exhaustion is due to a long 500ms of exhaustion and then non-exhaustion or alternating 1ms of exhaustion and non-exhaustion. But this is an improvement over what we have. - admission.granter.slot_adjuster_{increments,decrements}.kv: Counts the increments and decrements of the total slots. - admission.granter.cpu_load_{short,long}_period_duration.kv: cumulative duration of short and long ticks, as indicated by the period in the CPULoad callback. We don't expect long period ticks when admission control is active (and we explicitly disable enforcement during long period ticks), but it helps us eliminate some hypothesis during incidents (e.g. long period ticks alternating with short period ticks causing a slow down in how fast we increment slots). Additionally, the sum of the rate of these two, if significantly < 1, would indicate that CPULoad frequency is lower than expected, say due to CPU overload. Fixes cockroachdb#92673 Epic: none Release note: None
bors r=irfansharif |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 774df8d to blathers/backport-release-22.1-95007: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. error creating merge commit from 774df8d to blathers/backport-release-22.2-95007: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
95590: admission: remove soft/moderate load slots r=irfansharif a=irfansharif We originally introduced these notions in admission control (#78519) for additional threads for Pebble compaction compression. We envisioned granting these "squishy" slots to background activities and permit work only under periods of low load. In working through #86638 (as part of \#75066), we observed experimentally that the moderate-slots count was not sensitive enough to scheduling latency, and consequently latency observed by foreground traffic. Elastic CPU tokens, the kind now being used for backups, offer an alternative to soft slots. We've since replaced uses of soft slots with elastic CPU tokens. This PR just removes the now dead-code code around soft/moderate load slots (it's better to minimize the number of mechanisms in the admission package). Fixes #88032. Release note: None --- First commit is from #95007. Co-authored-by: irfan sharif <[email protected]>
Our existing metrics are gauges (total and used slots) which don't give us insight into what is happening at smaller time scales. This creates uncertainty when we observe admission queueing but the gauge samples show total slots consistenly greater than used slots. Additionally, if total slots is steady during queuing, it doesn't tell us whether that was because of roughly matching increments or decrements, or no increments/decrements.
The following metrics are added:
Fixes #92673
Epic: none
Release note: None