Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove/propose different "no metrics known for pod" log #349

Closed
serathius opened this issue Nov 5, 2019 · 12 comments
Closed

Remove/propose different "no metrics known for pod" log #349

serathius opened this issue Nov 5, 2019 · 12 comments
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@serathius
Copy link
Contributor

serathius commented Nov 5, 2019

There are a lot of issues where users are seeing metrics-server reporting error "no metrics known for pod" and asking for help.

To my understanding this error is expected to occur in normal healthy metrics-server.
Metrics Server periodically scrapes all nodes to gather metrics and populate it's internal cache. When there is a request to Metrics API, metrics-server reaches to this cache and looks for existing value for pod. If there is no value for existing pod in k8s, metrics server reports error "no metrics known for pod". This means this error can happen in situation when:

  • fresh metrics-server is deployed with clean cache
  • query is about fresh pod/node that was not yet scraped

Providing better information to users would greatly reduce throughput of tickets.

@serathius serathius added this to the v0.4.0 milestone Nov 5, 2019
@serathius
Copy link
Contributor Author

/help

@k8s-ci-robot
Copy link
Contributor

@serathius:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Nov 5, 2019
@tomknock
Copy link

There is a same metric-server problem in my cluster: unable to fetch pod metrics for pod kube-system/POD_NAME:no metrics known for pod.
Also,Dashboard couldn't be deployed successfully on k8s:v1.16.2 without metric/heapster: No metric client provided. Skipping metrics.

@zhangyu84848245
Copy link

@serathius

I had the same problem before。
add

resources:
limits:
cpu: 100m
memory: 300Mi
requests:
cpu: 5m
memory: 50Mi
I can run it, You try。

@serathius
Copy link
Contributor Author

Hey @zhangyu84848245,
This message is related metrics-server not having cache pre-filled. Increasing resources like you suggested can reduce chances of log message appearing, but will not fully remove it (additional cpu will shorten time needed to generate self-signed cert)

@serathius serathius added the kind/bug Categorizes issue or PR as related to a bug. label Dec 12, 2019
@serathius
Copy link
Contributor Author

/assign

@serathius serathius added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. and removed kind/bug Categorizes issue or PR as related to a bug. labels Feb 7, 2020
@serathius
Copy link
Contributor Author

serathius commented Mar 15, 2020

Possible solutions (in order of implementation complexity):

  1. [Preferred] Don't treat missing metrics as errors (remove log calls)
  2. Dont' report error for newly created nodes/containers (time.Now().Sub(startTime) < metricResolution + cAdvisorHousekeepingTime)
  3. Don't report error if node with metrics was not scraped before container start (keep start time per node and check scrapeTime.Sub(startTime) < cAdvisorHousekeepingTime)

Where cAdvisorHousekeepingTime = 15s

I propose to remove error logs, as those error logs can be caused by:

  • Failure in scraping. Still they don't provide useful for debugging, still requiring to read scrape failure logs.
  • Metric availability delay. Misinforming that there is a problem with pipeline, instead informing about this being expected behavior.

Other ways we can improve visibility in metric availability delay:

  • Document what is expected delay of metrics from freshly created containers and nodes.
  • Create histogram metric that measures freshness of served metrics.

Both option 2 & 3 try to add logic to guess health of metrics pipeline. They complicate code without providing any additional benefits. Measuring health of pipeline should be done via defining proper metrics and defining externally monitored SLOs.

/cc @s-urbaniak
Do you agree with this approach?

@serathius
Copy link
Contributor Author

/cc @kawych

@JoseThen
Copy link

Thank you for this @serathius , I was concerned about the issue but noticed it would stop logging after some time, everything is so far looking good 🙇

@serathius serathius added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 5, 2020
@serathius
Copy link
Contributor Author

ping @s-urbaniak

@s-urbaniak
Copy link
Contributor

I agree with just going forward with option 1. I think options 2. and 3. should be solved via a higher level alerting system.

@serathius
Copy link
Contributor Author

Looks like work was done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

6 participants