Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.2: admission: CPU slot adjustment and utilization metrics #96356

Merged

Conversation

irfansharif
Copy link
Contributor

Backport 1/1 commits from #95007. Clean backport (some benign merge conflicts around struct initialization due to other added metrics + linter exclusions due to other added linter exclusions).

/cc @cockroachdb/release


Our existing metrics are gauges (total and used slots) which don't give us insight into what is happening at smaller time scales. This creates uncertainty when we observe admission queueing but the gauge samples show total slots consistenly greater than used slots. Additionally, if total slots is steady during queuing, it doesn't tell us whether that was because of roughly matching increments or decrements, or no increments/decrements.

The following metrics are added:

  • admission.granter.slots_exhausted_duration.kv: cumulative duration when the slots were exhausted. This can give insight into how much exhaustion was occurring. It is insufficient to tell us whether 0.5sec/sec of exhaustion is due to a long 500ms of exhaustion and then non-exhaustion or alternating 1ms of exhaustion and non-exhaustion. But this is an improvement over what we have.
  • admission.granter.slot_adjuster_{increments,decrements}.kv: Counts the increments and decrements of the total slots.
  • admission.granter.cpu_load_{short,long}_period_duration.kv: cumulative duration of short and long ticks, as indicated by the period in the CPULoad callback. We don't expect long period ticks when admission control is active (and we explicitly disable enforcement during long period ticks), but it helps us eliminate some hypothesis during incidents (e.g. long period ticks alternating with short period ticks causing a slow down in how fast we increment slots). Additionally, the sum of the rate of these two, if significantly < 1, would indicate that CPULoad frequency is lower than expected, say due to CPU overload.

Fixes #92673

Epic: none

Release note: None

Our existing metrics are gauges (total and used slots) which don't give us
insight into what is happening at smaller time scales. This creates
uncertainty when we observe admission queueing but the gauge samples show
total slots consistenly greater than used slots. Additionally, if total
slots is steady during queuing, it doesn't tell us whether that was because
of roughly matching increments or decrements, or no increments/decrements.

The following metrics are added:
- admission.granter.slots_exhausted_duration.kv: cumulative duration when
  the slots were exhausted. This can give insight into how much exhaustion
  was occurring. It is insufficient to tell us whether 0.5sec/sec of
  exhaustion is due to a long 500ms of exhaustion and then non-exhaustion
  or alternating 1ms of exhaustion and non-exhaustion. But this is an
  improvement over what we have.
- admission.granter.slot_adjuster_{increments,decrements}.kv: Counts the
  increments and decrements of the total slots.
- admission.granter.cpu_load_{short,long}_period_duration.kv: cumulative
  duration of short and long ticks, as indicated by the period in the
  CPULoad callback. We don't expect long period ticks when admission
  control is active (and we explicitly disable enforcement during long
  period ticks), but it helps us eliminate some hypothesis during
  incidents (e.g. long period ticks alternating with short period ticks
  causing a slow down in how fast we increment slots). Additionally, the
  sum of the rate of these two, if significantly < 1, would indicate that
  CPULoad frequency is lower than expected, say due to CPU overload.

Fixes cockroachdb#92673

Epic: none

Release note: None
@irfansharif irfansharif requested a review from a team as a code owner February 1, 2023 16:13
@irfansharif irfansharif requested a review from a team February 1, 2023 16:13
@blathers-crl
Copy link

blathers-crl bot commented Feb 1, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for backporting!
:lgtm:

Reviewed 7 of 7 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)

@irfansharif irfansharif merged commit 1b80198 into cockroachdb:release-22.2 Feb 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants