Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More seat metrics for APF #105873

Merged
merged 2 commits into from
Nov 10, 2021
Merged

Conversation

MikeSpreitzer
Copy link
Member

@MikeSpreitzer MikeSpreitzer commented Oct 25, 2021

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds some more metrics to API Priority and Fairness regarding seat usage.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

This PR is two commits, one adding the seat_count metrics and one adding the watch_count metrics, so that they can be viewed separately.

Does this PR introduce a user-facing change?

This PR adds the following metrics for API Priority and Fairness.
- **apiserver_flowcontrol_priority_level_seat_count_samples**: histograms of seats occupied by executing requests (both regular and final-delay phases included), broken down by priority_level; the observations are taken once per millisecond.
- **apiserver_flowcontrol_priority_level_seat_count_watermarks**: histograms of high and low watermarks of number of seats occupied by executing requests (both regular and final-delay phases included), broken down by priority_level.
- **apiserver_flowcontrol_watch_count_samples**: histograms of number of watches relevant to a given mutating request, broken down by that request's priority_level and flow_schema.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


/sig api-machinery
/sig instrumentation
/cc @wojtek-t
/cc @deads2k
/cc @tkashem
/cc @lavalamp

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Oct 25, 2021
@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/apiserver area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Oct 25, 2021
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MikeSpreitzer - can you split the commits into separate PRs?
I like the second commit, but i would like to discuss the first a bit more.

name string // varies in tests of fighting controllers
clock clock.PassiveClock
queueSetFactory fq.QueueSetFactory
reqsObsPairGenerator metrics.TimedObserverPairGenerator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we're touching this code - I would like to ask questions that I never asked before.

Why do we need this "TimedObservers" (and related) concepts?

Why the P&F code can't just use metrics as everything else is using by simply hard-coding metrics and exposing them as everything else does?

I think this code would get much easier to follow (not just this PR, but the whole P&F code), if we would simplify it. And I've never really understood why it has to be so complicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat related question:

Maybe I'm missing something - but why do we even try to implement watermark metrics ourselves in the code?
Conceptually - if we want to get highest/lowest value over time - it's not something that should be done via metrics themselves. It should be the task for metrics engine/processor etc.
i.e. we report the metric that shows current value, and we can easily compute the higest/lowest based on that (all metrics agents expose queries for that).

If I'm not missing something above - I would really like us to get rid of those metrics and not do the job of metrics engine as part of P&F...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watermarking has been there since before I got involved. I found that the max-in-flight filter was already watermarking the number in flight. That makes sense, because this is a gauge that can vary much more quickly than we can expect scrapes to happen.

Oddly, the watermarking that the max-in-flight filter has been doing is itself something that surely will not be scraped frequently enough. The watermarking in that filter is only over the last second, and we certainly can expect the apiserver metrics scraping period to be longer than one second!

We could try to pick a better watermarking period, but that would be a difficult exercise in satisfying everybody when we do not even know who everybody is. What the watermark histogram does is take observations of watermarks, so that no scraping period misses watermark observations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully agree, but I don't want to block this PR on it either (sorry for sitting so long on it so far).
It's not introducing anything new - and we should discuss how to improve it separately.

I opened #106302 to discuss that further.

@caesarxuchao
Copy link
Member

/assign @tkashem
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 26, 2021
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 1, 2021
@MikeSpreitzer
Copy link
Member Author

The force-push to 154bf6a is a rebase onto master.

@wojtek-t
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 10, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MikeSpreitzer, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 10, 2021
@k8s-ci-robot k8s-ci-robot merged commit 9351ea2 into kubernetes:master Nov 10, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Nov 10, 2021
@MikeSpreitzer MikeSpreitzer deleted the more-seat-metrics branch November 11, 2021 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants