Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2371: clarify plan for cAdvisor and target at Beta for 1.24 #3003

Merged
merged 5 commits into from
Feb 2, 2022

Conversation

haircommander
Copy link
Contributor

Signed-off-by: Peter Hunt [email protected]

  • Issue link:
  • Other comments:

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 4, 2021
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Oct 4, 2021
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Oct 4, 2021
@@ -225,7 +225,7 @@ As such, this proposal will be broken up into what needs to be done for each of
2. Add CRI Pod Level Stats message to CRI protobuf that includes all [Pod Level Stats](#summary-pod-stats-object) metrics from Summary API.
3. Add support for the new CRI additions in supported container runtimes (CRI-O and containerd).
4. Switch Kubelet's CRI stats provider from querying container and pod level stats from cAdvisor to newly added CRI pod and container level stats
5. cAdvisor should stop collecting container and pod level stats. If any other components need container or pod level stats from cAdvisor, the CRI implementation should be queried instead.
5. cAdvisor should be updated to support no longer collecting stats that are duplicated with CRI implementation. Any client that requires the metrics that are reported by the CRI should gather them from the CRI instead of cAdvisor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this somewhat go against the eventual goal of removing /metrics/cadvisor/externalizing metrics, at least unless we add a step 6 equivalent to the existing step 5?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm yeah we're going to have to figure out how to merge the two. I believe there are metrics that cadvisor provides that are out of scope for CRI that should be provided by cadvisor, but there are a set that should no longer be provided by cadvisor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts @bobbypage

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, my thinking is that the high level goal is to provide ability for clients to obtain metrics from CRI (i.e. summary api) rather than cAdvisor.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2022
Some pieces that were initially slated for Alpha have been pushed to Beta

Signed-off-by: Peter Hunt <[email protected]>
@haircommander haircommander changed the title KEP-2371: clarify plan for cAdvisor KEP-2371: clarify plan for cAdvisor and target at Beta for 1.24 Jan 25, 2022
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 25, 2022
@haircommander
Copy link
Contributor Author

I've taken over my own PR and updated it to cover its status movign to beta in 1.24

@haircommander
Copy link
Contributor Author

/cc @bobbypage

- CRI implementations should report any difficulties collecting the stats, and by Beta the CRI stats implementation should perform better than they did with CRI+cAdvisor.
- CRI implementations should support their equivalent of `/metrics/cadvisor`, passing the performance and conformance suite created in Alpha.
- Conduct research to find the set of metrics from `/metrics/cadvisor` that compliant CRI implementations must expose.
- Conformance tests for the fields in `/metrics/cadvisor` should be created.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full k8s conformance tests will be significant effort, perhaps instead we should just ensure that the fields we want to support have overlap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will we continually verify that without testing? especially as time goes on. If we consider these fields a sort of API of kube, then we should probably do something to verify they're there

- CRI implementations should support their equivalent of `/metrics/cadvisor`, passing the performance and conformance suite created in Alpha.
- Conduct research to find the set of metrics from `/metrics/cadvisor` that compliant CRI implementations must expose.
- Conformance tests for the fields in `/metrics/cadvisor` should be created.
- Performance tests for CPU/memory usage between Kubelet/cAdvisor/CRI implementation should be added.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think k8s already has OSS perf tests, perhaps we should rephrase this to say "Validate performance impact of this feature"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated!

Kubelet should fail early if the CRI implementation won't report the expected metrics.
- For Beta/GA releases, components that rely on `/metrics/cadvisor` should take the decided action (use `/stats/summary`, or use the Kubelet provided `/metrics/cadvisor`).
Kubelet should fallback to use cAdvisor if the CRI implementation won't report the expected metrics.
- For Beta/GA releases, components that rely on `/metrics/cadvisor` should take the decided action (use `/stats/summary`, or use the CRI provided `/metrics/cadvisor`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename, CRI provided -> CRI provided endpoint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reworded

@@ -651,12 +668,17 @@ _This section must be completed when targeting beta graduation to a release._
checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.

The source of `/metrics/cadvisor` is the CRI implementation, not cAdvisor. Further, if the CRI implementation was using the
Copy link
Member

@bobbypage bobbypage Jan 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe avoid providing name for endpoint directly...

The source of the "new" prometheus endpoint will be CRI ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworeded a bit

@@ -669,6 +691,9 @@ the health of the service?**
job creation time) for cron job <= 10%
- 99,9% of /health requests per day finish with 200 code

- 100% coverage of metrics reported by cAdvisor reported by CRI
Copy link
Member

@bobbypage bobbypage Jan 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some of the metrics we said may not be needed; so promising 100% coverage may be a bit too optimistic :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright I dropped it, though I think we may need to provide close to it..

@haircommander
Copy link
Contributor Author

updated with the feedback!

@bobbypage
Copy link
Member

Thank you for updating!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 28, 2022
@ehashman
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2022
@k8s-ci-robot k8s-ci-robot requested a review from sftim January 31, 2022 21:31
@sftim
Copy link
Contributor

sftim commented Jan 31, 2022

/uncc

@k8s-ci-robot k8s-ci-robot removed the request for review from sftim January 31, 2022 22:51
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 1, 2022
Copy link
Member

@ehashman ehashman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
for PRR

@mrunalp
Copy link
Contributor

mrunalp commented Feb 2, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 2, 2022
Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for updates.

the <2% metric should be qualified a little further when we do testing for particular density of pods.

/approve
/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, ehashman, haircommander

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 2, 2022
@k8s-ci-robot k8s-ci-robot merged commit 4530b43 into kubernetes:master Feb 2, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

9 participants