-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert commit "Add a namespace label to admission metrics and expand histogram range to 0-10s" #104033
Revert commit "Add a namespace label to admission metrics and expand histogram range to 0-10s" #104033
Conversation
…am range to 0-10s"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/triage accepted |
@@ -45,8 +45,8 @@ const ( | |||
) | |||
|
|||
var ( | |||
// Use buckets ranging from 5 ms to 10 seconds (admission webhooks timeout at 30 seconds by default). | |||
latencyBuckets = []float64{0.005, 0.025, 0.1, 0.5, 2.5, 5.0, 10.0} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was the issue the namespace label addition or the extra buckets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we didn't identify, but the additional buckets for sure amplified the cardinality issue even more. mostly the namespace label is the biggest cardinality contributor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Primarily, the namespace label..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many series in total did you observe this added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was the churn from e2e tests. They basically create namespaces on a per-test basis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We even have a test which creates 100 namespaces and deletes them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @kubernetes/sig-instrumentation-approvers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
@@ -45,8 +45,8 @@ const ( | |||
) | |||
|
|||
var ( | |||
// Use buckets ranging from 5 ms to 10 seconds (admission webhooks timeout at 30 seconds by default). | |||
latencyBuckets = []float64{0.005, 0.025, 0.1, 0.5, 2.5, 5.0, 10.0} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Primarily, the namespace label..
/approve |
/approve |
please pick to the release-1.22 branch as well and give the @kubernetes/release-managers a heads up that this is incoming |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, liggitt, logicalhan, s-urbaniak, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I gave the slack release-managers a ping and CCed them on the PR. |
…04033-upstream-release-1.22 Automated cherry pick of #104033: Revert "Add a namespace label to admission metrics and expand
/kind bug
What this PR does / why we need it:
By adding a namespace label to admission metrics we found that prometheus will be overwhelmed with out of memory errors within seconds due to amplified cardinality issues. This caused OOMs, raised memory usage in Prometheus from ~1,5GiB RAM steady usage to ~8GiB RAM usage (note, this is for OpenShift).
Which issue(s) this PR fixes:
Fixes #104008
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Revert addition of Add a namespace label to admission metrics and expand histogram range to 0-10s