Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet: pod metrics not visible with Docker 18 (19.03 works) #94281

Closed
pmyjavec opened this issue Aug 27, 2020 · 19 comments
Closed

kubelet: pod metrics not visible with Docker 18 (19.03 works) #94281

pmyjavec opened this issue Aug 27, 2020 · 19 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@pmyjavec
Copy link

pmyjavec commented Aug 27, 2020

What happened:

When deploying a cluster on v1.19.0 using kubeadm, we're no longer seeing kubelet reporting pod metrics from the node, this breaking metrics-server to some degree as we're missing pod statistics:

{                    
 "node": {               
  "nodeName": "k8s-node-0",
  "systemContainers": [
   {
    "name": "kubelet",                
    "startTime": "2020-08-27T11:17:49Z",
    "cpu": {                      
     "time": "2020-08-27T12:19:43Z",
     "usageNanoCores": 18925650,        
     "usageCoreNanoSeconds": 71340541215
    },       
    "memory": {                   
     "time": "2020-08-27T12:19:43Z",
     "usageBytes": 40591360,
     "workingSetBytes": 39964672,
     "rssBytes": 39510016,
     "pageFaults": 345372,
     "majorPageFaults": 10
    }
   },
   {       
    "name": "runtime",                 
    "startTime": "2020-08-27T04:23:40Z",
    "cpu": {
     "time": "2020-08-27T12:19:51Z",
     "usageNanoCores": 13500270,
     "usageCoreNanoSeconds": 339260029980
    },
    "memory": {
     "time": "2020-08-27T12:19:51Z",
     "usageBytes": 686555136,
     "workingSetBytes": 172347392,
     "rssBytes": 58580992,
     "pageFaults": 349662,
     "majorPageFaults": 309
    }
   },
   {
    "name": "runtime",
    "startTime": "2020-08-27T04:23:40Z",
    "cpu": {
     "time": "2020-08-27T12:19:51Z",
     "usageNanoCores": 13500270,
     "usageCoreNanoSeconds": 339260029980
    },
    "memory": {
     "time": "2020-08-27T12:19:51Z",
     "usageBytes": 686555136,
     "workingSetBytes": 172347392,
     "rssBytes": 58580992,
     "pageFaults": 349662,
     "majorPageFaults": 309
    }
   },
   {
    "name": "pods",
    "startTime": "2020-08-27T04:23:40Z",
    "cpu": {
     "time": "2020-08-27T12:19:48Z",
     "usageNanoCores": 53850241,
     "usageCoreNanoSeconds": 1162308046776
    },
    "memory": {
     "time": "2020-08-27T12:19:48Z",
     "availableBytes": 16232984576,
     "usageBytes": 168726528,
     "workingSetBytes": 107012096,
     "rssBytes": 36741120,
     "pageFaults": 0,
     "majorPageFaults": 0
    }
   }
  ],
  "startTime": "2020-08-27T04:22:56Z",
  "cpu": {
   "time": "2020-08-27T12:19:48Z",
   "usageNanoCores": 85723273,
   "usageCoreNanoSeconds": 2118444201905
  },
  "memory": {
   "time": "2020-08-27T12:19:48Z",
   "availableBytes": 15762960384,
   "usageBytes": 1376034816,
   "workingSetBytes": 577036288,
   "rssBytes": 213475328,
   "pageFaults": 38512,
   "majorPageFaults": 564
  }
 },
 "pods": []

Note the empty pods section

What you expected to happen:

We'd see a populated list of pod metrics as we did before that looks like:

 "pods": [                                                                                                                                                                                                
  {                                                                                                                                                                                                       
   "podRef": {                                                                                       
    "name": "metrics-server-64b57fd654-bcmlv",                                                       
    "namespace": "kube-system",                   
    "uid": "8f08ed66-2f6f-4ded-89d7-b49a717f2127"
   },                                                                                                                                                                                                     
   "startTime": "2020-08-27T11:53:27Z",
   "containers": [
    {
     "name": "metrics-server",
     "startTime": "2020-08-27T11:53:28Z",
     "cpu": {
      "time": "2020-08-27T12:23:00Z",
      "usageNanoCores": 178862,
      "usageCoreNanoSeconds": 1451974591
     },
     "memory": {
      "time": "2020-08-27T12:23:00Z",
      "usageBytes": 42545152,
      "workingSetBytes": 15671296,
      "rssBytes": 12148736,
      "pageFaults": 5633,
      "majorPageFaults": 223
     }
    }
   ],
   "cpu": {
    "time": "2020-08-27T12:23:04Z",
    "usageNanoCores": 348505,
    "usageCoreNanoSeconds": 1496841316
   },
   "memory": {
    "time": "2020-08-27T12:23:04Z",
    "usageBytes": 44687360,
    "workingSetBytes": 17813504,
    "rssBytes": 12185600,
    "pageFaults": 0,
    "majorPageFaults": 0
   }
  }]

How to reproduce it (as minimally and precisely as possible):

Use kubeadm init on 1.19.0 to create a new cluster and query the stats endpoint using:

curl -X GET  https://10.250.0.5:10250/stats/summary?only_cpu_and_memory=true --header "Authorization: Bearer $TOKEN" --insecure 

We can ran the same configuration on v1.18.2 and didn't hit the issue.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0", GitCommit:"e19964183377d0ec2052d1f1fa930c4d7575bd50", GitTreeState:"clean", BuildDate:"2020-08-26T14:30:33Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
  AWS EC2 Instance: c5.2xlarge
  • OS (e.g: cat /etc/os-release):
NAME="Ubuntu"
VERSION="16.04.7 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.7 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
  • Kernel (e.g. uname -a):
Linux k8s-master-0 4.4.0-1109-aws #120-Ubuntu SMP Fri Jun 5 01:26:57 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: deb
  • Network plugin and version (if this is a network-related bug):
  • Others:
    N/A
@pmyjavec pmyjavec added the kind/bug Categorizes issue or PR as related to a bug. label Aug 27, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Aug 27, 2020
@pmyjavec
Copy link
Author

/sig node
/sig instrumentation

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 27, 2020
@pmyjavec
Copy link
Author

pmyjavec commented Aug 27, 2020

We've narrowed this down to it being a problem with Docker 18, it looks like once we upgraded to Docker 19.03, please feel free to close this off if it's not longer relevant.

I'm wondering if this would be better documented somewhere? Maybe it is and we missed it?

Thanks.

@neolit123
Copy link
Member

hi,

We can ran the same configuration on v1.18.2 and didn't hit the issue.

^ you've mentioned in the OP that this does not happen in 1.18.2.

We've narrowed this down to it being a problem with Docker 18, it looks like once we upgraded to Docker 19.03, please feel free to close this off if it's not longer relevant.

to clarify, with 1.19.0 and Docker 18 the problem happens, but not with 1.19 and Docker 19.03?

@pmyjavec
Copy link
Author

@neolit123,

to clarify, with 1.19.0 and Docker 18 the problem happens, but not with 1.19 and Docker 19.03?

Yes sorry, if that wasn't clear.

@neolit123
Copy link
Member

ok, this is very odd. hopefully someone from SIG node sees this.
if you want to raise attention you could post in the #sig-node channel on k8s slack.

@neolit123
Copy link
Member

/retitle kubelet: pod metrics not visible with Docker 18 (19.03 works)

@k8s-ci-robot k8s-ci-robot changed the title No longer receiveing pod stats from stats/summary since 1.19.0 kubelet: pod metrics not visible with Docker 18 (19.03 works) Aug 27, 2020
@serathius
Copy link
Contributor

/cc @dashpole

@ialidzhikov
Copy link
Contributor

I can also confirm that issue reproduces with [email protected], [email protected] and docker://18.6.3.

$ k -n kube-system top po
W0901 13:13:52.018978   35484 top_pod.go:274] Metrics not available for pod kube-system/calico-kube-controllers-68657c7d94-727n2, age: 2m25.018968s
error: Metrics not available for pod kube-system/calico-kube-controllers-68657c7d94-727n2, age: 2m25.018968s


$ k get no -o wide
NAME                                        STATUS   ROLES    AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                    KERNEL-VERSION    CONTAINER-RUNTIME
ip-10-222-0-22.eu-west-1.compute.internal   Ready    <none>   80s     v1.19.0   10.222.0.22   <none>        Container Linux by CoreOS 2512.3.0 (Oklo)   4.19.123-coreos   docker://18.6.3
ip-10-222-0-37.eu-west-1.compute.internal   Ready    <none>   4m32s   v1.19.0   10.222.0.37   <none>        Container Linux by CoreOS 2512.3.0 (Oklo)   4.19.123-coreos   docker://18.6.3

$ k -n kube-system logs metrics-server-68498c577c-bvtpc
I0901 10:12:21.118498       1 manager.go:95] Scraping metrics from 0 sources
I0901 10:12:21.118580       1 manager.go:148] ScrapeMetrics: time: 908ns, nodes: 0, pods: 0
I0901 10:12:21.221635       1 secure_serving.go:116] Serving securely on [::]:8443
E0901 10:13:19.026582       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/coredns-97dfb4f59-zht6w: no metrics known for pod
E0901 10:13:19.026615       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/coredns-97dfb4f59-x4wfv: no metrics known for pod
I0901 10:13:21.118663       1 manager.go:95] Scraping metrics from 2 sources
I0901 10:13:21.126702       1 manager.go:120] Querying source: kubelet_summary:ip-10-222-0-22.eu-west-1.compute.internal
I0901 10:13:21.130055       1 manager.go:120] Querying source: kubelet_summary:ip-10-222-0-37.eu-west-1.compute.internal
I0901 10:13:21.189014       1 manager.go:148] ScrapeMetrics: time: 70.170272ms, nodes: 2, pods: 0
E0901 10:13:49.538888       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/coredns-97dfb4f59-x4wfv: no metrics known for pod
E0901 10:13:49.538906       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/coredns-97dfb4f59-zht6w: no metrics known for pod
E0901 10:13:51.747200       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/node-problem-detector-7hqhq: no metrics known for pod
E0901 10:13:51.747219       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/csi-driver-node-ps2xk: no metrics known for pod
E0901 10:13:51.747224       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/node-problem-detector-c5bdf: no metrics known for pod
E0901 10:13:51.747229       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/coredns-97dfb4f59-zht6w: no metrics known for pod
E0901 10:13:51.747234       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/calico-typha-vertical-autoscaler-55b85db9c9-9p9px: no metrics known for pod
E0901 10:13:51.747239       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/coredns-97dfb4f59-x4wfv: no metrics known for pod
E0901 10:13:51.747244       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/calico-typha-deploy-7bc68bcd86-9ln87: no metrics known for pod
E0901 10:13:51.747249       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/calico-kube-controllers-68657c7d94-727n2: no metrics known for pod
E0901 10:13:51.747254       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/kube-proxy-2sgdh: no metrics known for pod
E0901 10:13:51.747259       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/calico-node-lkrdd: no metrics known for pod
E0901 10:13:51.747264       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/csi-driver-node-bctws: no metrics known for pod
E0901 10:13:51.747268       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/calico-node-g8bvp: no metrics known for pod
E0901 10:13:51.747272       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/kube-proxy-tghqn: no metrics known for pod
E0901 10:13:51.747278       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/calico-typha-horizontal-autoscaler-86cfb97885-8462k: no metrics known for pod
E0901 10:13:51.747283       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/vpn-shoot-7666fbf977-hnljr: no metrics known for pod
E0901 10:13:51.747287       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/metrics-server-68498c577c-bvtpc: no metrics known for pod
E0901 10:13:51.747291       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/calico-node-vertical-autoscaler-5788bdd9cb-vwjjs: no metrics known for pod
E0901 10:13:53.360448       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/csi-driver-node-bctws: no metrics known for pod
E0901 10:13:53.360463       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/calico-typha-vertical-autoscaler-55b85db9c9-9p9px: no metrics known for pod
E0901 10:13:53.360467       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/coredns-97dfb4f59-x4wfv: no metrics known for pod
E0901 10:13:53.360470       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/calico-typha-deploy-7bc68bcd86-9ln87: no metrics known for pod
E0901 10:13:53.360474       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/calico-kube-controllers-68657c7d94-727n2: no metrics known for pod
E0901 10:13:53.360477       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/kube-proxy-2sgdh: no metrics known for pod
E0901 10:13:53.360480       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/calico-node-lkrdd: no metrics known for pod
E0901 10:13:53.360483       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/kube-proxy-tghqn: no metrics known for pod
E0901 10:13:53.360498       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/calico-node-g8bvp: no metrics known for pod
E0901 10:13:53.360501       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/calico-node-vertical-autoscaler-5788bdd9cb-vwjjs: no metrics known for pod
E0901 10:13:53.360514       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/calico-typha-horizontal-autoscaler-86cfb97885-8462k: no metrics known for pod
E0901 10:13:53.360517       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/vpn-shoot-7666fbf977-hnljr: no metrics known for pod
E0901 10:13:53.360521       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/metrics-server-68498c577c-bvtpc: no metrics known for pod
E0901 10:13:53.360524       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/coredns-97dfb4f59-zht6w: no metrics known for pod
E0901 10:13:53.360527       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/node-problem-detector-7hqhq: no metrics known for pod
E0901 10:13:53.360530       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/csi-driver-node-ps2xk: no metrics known for pod
E0901 10:13:53.360533       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/node-problem-detector-c5bdf: no metrics known for pod
E0901 10:14:20.051957       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/coredns-97dfb4f59-x4wfv: no metrics known for pod
E0901 10:14:20.051978       1 reststorage.go:160] unable to fetch pod metrics for pod kube-system/coredns-97dfb4f59-zht6w: no metrics known for pod
I0901 10:14:21.118610       1 manager.go:95] Scraping metrics from 2 sources
I0901 10:14:21.129795       1 manager.go:120] Querying source: kubelet_summary:ip-10-222-0-37.eu-west-1.compute.internal
I0901 10:14:21.133783       1 manager.go:120] Querying source: kubelet_summary:ip-10-222-0-22.eu-west-1.compute.internal
I0901 10:14:21.146267       1 manager.go:148] ScrapeMetrics: time: 27.627598ms, nodes: 2, pods: 0

/priority important-soon

@dashpole
Copy link
Contributor

Can anyone grab metrics from the kubelet on one of those nodes?

@saul-data
Copy link

I am experiencing the same issue with new Kubernetes 1.19 and docker 18.9.9 update on Digitalocean

(base) ➜  ~ kubectl -n kube-system top po
W1102 21:03:51.720671   71950 top_pod.go:266] Metrics not available for pod kube-system/cilium-2g625, age: 31h31m35.72066s
error: Metrics not available for pod kube-system/cilium-2g625, age: 31h31m35.72066s
(base) ➜  ~ kubectl get no -o wide
NAME              STATUS   ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP      OS-IMAGE                       KERNEL-VERSION    CONTAINER-RUNTIME
xxx   Ready    <none>   31h   v1.19.3   10.106.xxx    178.128.xxx   Debian GNU/Linux 10 (buster)   4.19.0-11-amd64   docker://18.9.9
xxx  Ready    <none>   32h   v1.19.3   10.106.xxx    46.101.xxx    Debian GNU/Linux 10 (buster)   4.19.0-11-amd64   docker://18.9.9

@ghouscht
Copy link
Contributor

ghouscht commented Nov 3, 2020

Same issue here on our cluster running k8s 1.19.3 with docker 18.9.8. node metrics still work, but there are no pod metrics:

#> k get --raw /apis/metrics.k8s.io/v1beta1/pods
{"kind":"PodMetricsList","apiVersion":"metrics.k8s.io/v1beta1","metadata":{"selfLink":"/apis/metrics.k8s.io/v1beta1/pods"},"items":[]}

On a different cluster also running k8s 1.19.3 with docker 19.3.12 the pod and node metrics are ok.

Edit:
Had a quick look at the kubelet stats api and as far as I can tell the code there seems to be ok. So this is probably related to the docker client being updated by commit: 8e8e153. As the commit states, this points to the tag 19.03.8 in docker/docker.

This led me to check the kubelet logs and look what I found there:

Nov 03 10:08:21 e1-k8s-alsu120 kubelet[31240]: I1103 10:08:21.689487 31240 factory.go:161] Registration of the docker container factory failed: failed to validate Docker info: failed to detect Docker info: Error response from daemon: client version 1.40 is too new. Maximum supported API version is 1.39

So the kubelet expects a docker server with api version >=1.40 which was introduced with docker 19 (docker 18 is on api version 1.39). I always thaught that docker 18 is still supported by kubernetes but probably it's worth upgrading docker now... Anyway I'm not 100% confident that this is caused by the referenced commit, but at least it seems possible to me.

The Kubernetes release notes list which versions of Docker are compatible with that version of Kubernetes.
https://kubernetes.io/docs/setup/production-environment/container-runtimes/#docker

Can't find such an info in https://kubernetes.io/docs/setup/release/notes/ can someone clarify which versions of docker are supported by k8s 1.19.X?

Edit 2:
Bisected through a few commits and can confirm that 8e8e153 is the cause for this.

@dims
Copy link
Member

dims commented Nov 6, 2020

@JornShen please see #89687 (review)

image (3)

please try WithAPIVersionNegotiation

@JornShen
Copy link
Member

JornShen commented Nov 6, 2020

@dims thanks for guide. I know one solution to deal with the problem inspiring from your advise.
I will take PR to try to fix it.

/assign

@JornShen
Copy link
Member

The fix PR google/cadvisor#2714 has been merged.

So we wait for cadvisor' owner to update k/k dependency of cadvisor to v0.38 in 1.20 k8s? And cherry pick the dependency to v1.19(k8s) ? @dims

I see there have been a issuse #96287 to trace this thing.

@dims
Copy link
Member

dims commented Nov 10, 2020

@JornShen yes someone will pick up #96287 soon. @bobbypage are we ready to cut a release of cadvisor?

@bobbypage
Copy link
Member

ack, I will cut new release of cadvisor shortly.

@JornShen
Copy link
Member

@dims @bobbypage all right, thanks!

@JornShen
Copy link
Member

@pmyjavec @saul-gush @ialidzhikov @serathius @ghouscht

update to the newest v1.19-release, problem had been solve by #96849, or update the v1.20 alph version after 2020/11/14 by #96425 merged if you use docker verison under v19.X

@k8s-ci-robot
Copy link
Contributor

@JornShen: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests