More seat metrics for APF #105873

MikeSpreitzer · 2021-10-25T07:41:04Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds some more metrics to API Priority and Fairness regarding seat usage.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

This PR is two commits, one adding the seat_count metrics and one adding the watch_count metrics, so that they can be viewed separately.

Does this PR introduce a user-facing change?

This PR adds the following metrics for API Priority and Fairness.
- **apiserver_flowcontrol_priority_level_seat_count_samples**: histograms of seats occupied by executing requests (both regular and final-delay phases included), broken down by priority_level; the observations are taken once per millisecond.
- **apiserver_flowcontrol_priority_level_seat_count_watermarks**: histograms of high and low watermarks of number of seats occupied by executing requests (both regular and final-delay phases included), broken down by priority_level.
- **apiserver_flowcontrol_watch_count_samples**: histograms of number of watches relevant to a given mutating request, broken down by that request's priority_level and flow_schema.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

/sig api-machinery
/sig instrumentation
/cc @wojtek-t
/cc @deads2k
/cc @tkashem
/cc @lavalamp

wojtek-t

@MikeSpreitzer - can you split the commits into separate PRs?
I like the second commit, but i would like to discuss the first a bit more.

wojtek-t · 2021-10-25T08:16:45Z

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/apf_controller.go

+	name                  string // varies in tests of fighting controllers
+	clock                 clock.PassiveClock
+	queueSetFactory       fq.QueueSetFactory
+	reqsObsPairGenerator  metrics.TimedObserverPairGenerator


Given we're touching this code - I would like to ask questions that I never asked before.

Why do we need this "TimedObservers" (and related) concepts?

Why the P&F code can't just use metrics as everything else is using by simply hard-coding metrics and exposing them as everything else does?

I think this code would get much easier to follow (not just this PR, but the whole P&F code), if we would simplify it. And I've never really understood why it has to be so complicated.

Somewhat related question:

Maybe I'm missing something - but why do we even try to implement watermark metrics ourselves in the code?
Conceptually - if we want to get highest/lowest value over time - it's not something that should be done via metrics themselves. It should be the task for metrics engine/processor etc.
i.e. we report the metric that shows current value, and we can easily compute the higest/lowest based on that (all metrics agents expose queries for that).

If I'm not missing something above - I would really like us to get rid of those metrics and not do the job of metrics engine as part of P&F...

Watermarking has been there since before I got involved. I found that the max-in-flight filter was already watermarking the number in flight. That makes sense, because this is a gauge that can vary much more quickly than we can expect scrapes to happen.

Oddly, the watermarking that the max-in-flight filter has been doing is itself something that surely will not be scraped frequently enough. The watermarking in that filter is only over the last second, and we certainly can expect the apiserver metrics scraping period to be longer than one second!

We could try to pick a better watermarking period, but that would be a difficult exercise in satisfying everybody when we do not even know who everybody is. What the watermark histogram does is take observations of watermarks, so that no scraping period misses watermark observations.

I don't fully agree, but I don't want to block this PR on it either (sorry for sitting so long on it so far).
It's not introducing anything new - and we should discuss how to improve it separately.

I opened #106302 to discuss that further.

caesarxuchao · 2021-10-26T20:05:30Z

/assign @tkashem
/triage accepted

MikeSpreitzer · 2021-11-01T20:04:18Z

The force-push to 154bf6a is a rebase onto master.

wojtek-t · 2021-11-10T09:19:47Z

/lgtm
/approve

k8s-ci-robot · 2021-11-10T09:20:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MikeSpreitzer, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/apiserver/OWNERS~~ [wojtek-t]
~~test/integration/apiserver/OWNERS~~ [wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Oct 25, 2021

k8s-ci-robot requested review from deads2k, lavalamp, tkashem and wojtek-t October 25, 2021 07:41

wojtek-t reviewed Oct 25, 2021

View reviewed changes

k8s-ci-robot assigned tkashem Oct 26, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 26, 2021

MikeSpreitzer added 2 commits November 1, 2021 15:59

Add sample-and-watermark for seats occupied during all of execution

945f960

Add metrics about watch counts seen by APF

154bf6a

MikeSpreitzer force-pushed the more-seat-metrics branch from a4c9075 to 154bf6a Compare November 1, 2021 20:03

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 1, 2021

MikeSpreitzer mentioned this pull request Nov 8, 2021

apf: kubemark-500 scale test fails with mutating request estimator enabled #105804

Closed

k8s-ci-robot assigned wojtek-t Nov 10, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 10, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 10, 2021

k8s-ci-robot merged commit 9351ea2 into kubernetes:master Nov 10, 2021

k8s-ci-robot added this to the v1.23 milestone Nov 10, 2021

MikeSpreitzer deleted the more-seat-metrics branch November 11, 2021 02:52

MikeSpreitzer mentioned this pull request Nov 11, 2021

[WIP] [TEST ONLY] run kubemark-500 test with watch enabled #105768

Closed

MikeSpreitzer mentioned this pull request Dec 6, 2021

apf: exempt request does not note flowschema and prioritylevelconfiguration in the response header #106826

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More seat metrics for APF #105873

More seat metrics for APF #105873

MikeSpreitzer commented Oct 25, 2021 •

edited

Loading

wojtek-t left a comment

wojtek-t Oct 25, 2021

wojtek-t Oct 27, 2021

MikeSpreitzer Oct 29, 2021

wojtek-t Nov 10, 2021

caesarxuchao commented Oct 26, 2021

MikeSpreitzer commented Nov 1, 2021

wojtek-t commented Nov 10, 2021

k8s-ci-robot commented Nov 10, 2021

More seat metrics for APF #105873

More seat metrics for APF #105873

Conversation

MikeSpreitzer commented Oct 25, 2021 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

wojtek-t left a comment

Choose a reason for hiding this comment

wojtek-t Oct 25, 2021

Choose a reason for hiding this comment

wojtek-t Oct 27, 2021

Choose a reason for hiding this comment

MikeSpreitzer Oct 29, 2021

Choose a reason for hiding this comment

wojtek-t Nov 10, 2021

Choose a reason for hiding this comment

caesarxuchao commented Oct 26, 2021

MikeSpreitzer commented Nov 1, 2021

wojtek-t commented Nov 10, 2021

k8s-ci-robot commented Nov 10, 2021

MikeSpreitzer commented Oct 25, 2021 •

edited

Loading