Metrics-server failing to scrape nodes due to timeout and DNS lookup errors #14708

fbozic · 2022-12-02T10:43:45Z

/kind bug

I'm trying to set up a new cluster on GCE with metrics-server enabled. I'm facing two issues, which are not related to each other.

The first issue is that metrics-server is running on nodes and it can not scrape masters. I've fixed this issue by adding tcp:10250 to nodes-to-master firewall rule by hand.
The second issue is that metrics-server can not resolve some hosts by node name. Behaviour is inconsistent regarding which host can not be resolved on each rerun (delete, create). And also sometimes one metrics-server pod can resolve all hosts while the second one can not. I've noticed it because when running k top no multiple times results will be different regarding which pod servers the request. I think I've managed to fix it by setting --kubelet-preferred-address-types=InternalIP based on this comment: Metrics server issue with hostname resolution of kubelet and apiserver unable to communicate with metric-server clusterIP kubernetes-sigs/metrics-server#131. I've checked dns-controller pod and it looks okay to me (no errors).

1. What kops version are you running? The command kops version, will display
this information.
Client version: 1.25.3 (git-v1.25.3)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:36:36Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:29:58Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

GCE

4. What commands did you run? What is the simplest way to reproduce this issue?

kops create -f kops.yaml
kops update cluster --name cluster.k8s.local --yes
kops export kubecfg --admin
kops validate cluster --wait 10m

5. What happened after the commands executed?
Cluster validation passes and all pods are running, but k top no returns values for a subset of nodes only.

6. What did you expect to happen?
k top no returns values for all nodes consistently.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  name: cluster.k8s.local
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudConfig: {}
  cloudProvider: gce
  configBase: gs://fbozic-kops-state-store/cluster.k8s.local
  etcdClusters:
    - cpuRequest: 200m
      etcdMembers:
        - instanceGroup: master-europe-west3-a
          name: a
        - instanceGroup: master-europe-west3-b
          name: b
        - instanceGroup: master-europe-west3-c
          name: c
      memoryRequest: 100Mi
      name: main
    - cpuRequest: 100m
      etcdMembers:
        - instanceGroup: master-europe-west3-a
          name: a
        - instanceGroup: master-europe-west3-b
          name: b
        - instanceGroup: master-europe-west3-c
          name: c
      memoryRequest: 100Mi
      name: events
    - cpuRequest: 100m
      etcdMembers:
        - instanceGroup: master-europe-west3-a
          name: a
        - instanceGroup: master-europe-west3-b
          name: b
        - instanceGroup: master-europe-west3-c
          name: c
      manager:
        env:
          - name: ETCD_AUTO_COMPACTION_MODE
            value: revision
          - name: ETCD_AUTO_COMPACTION_RETENTION
            value: "2500"
      memoryRequest: 100Mi
      name: cilium
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeProxy:
    enabled: false
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
  kubernetesApiAccess:
    - 0.0.0.0/0
    - ::/0
  kubernetesVersion: 1.25.4
  masterPublicName: api.cluster.k8s.local
  metricsServer:
    enabled: true
    insecure: true
  networkID: fbozic
  networking:
    cilium:
      enableNodePort: true
      etcdManaged: true
  nonMasqueradeCIDR: 100.64.0.0/10
  project: fbozic-ops
  subnets:
    - cidr: 10.0.32.0/20
      name: europe-west3
      region: europe-west3
      type: Private
  topology:
    dns:
      type: Public
    masters: private
    nodes: private

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: null
  labels:
    kops.k8s.io/cluster: cluster.k8s.local
  name: master-europe-west3-a
spec:
  image: ubuntu-os-cloud/ubuntu-2004-focal-v20221018
  machineType: n1-standard-4
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
    - europe-west3
  zones:
    - europe-west3-a

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: null
  labels:
    kops.k8s.io/cluster: cluster.k8s.local
  name: master-europe-west3-b
spec:
  image: ubuntu-os-cloud/ubuntu-2004-focal-v20221018
  machineType: n1-standard-4
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
    - europe-west3
  zones:
    - europe-west3-b

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: null
  labels:
    kops.k8s.io/cluster: cluster.k8s.local
  name: master-europe-west3-c
spec:
  image: ubuntu-os-cloud/ubuntu-2004-focal-v20221018
  machineType: n1-standard-4
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
    - europe-west3
  zones:
    - europe-west3-c

---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: null
  labels:
    kops.k8s.io/cluster: cluster.k8s.local
  name: nodes-europe-west3
spec:
  image: ubuntu-os-cloud/ubuntu-2004-focal-v20221018
  machineType: n1-standard-4
  maxSize: 6
  minSize: 3
  role: Node
  subnets:
    - europe-west3
  zones:
    - europe-west3-a
    - europe-west3-b
    - europe-west3-c

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

Initial k top no (no masters and one node is missing).

❯ k top no
NAME                         CPU(cores)   CPU%        MEMORY(bytes)   MEMORY%     
nodes-europe-west3-634r      54m          1%          1338Mi          8%          
nodes-europe-west3-jmjn      35m          0%          1290Mi          8%          
master-europe-west3-a-dlc1   <unknown>    <unknown>   <unknown>       <unknown>   
master-europe-west3-b-xh98   <unknown>    <unknown>   <unknown>       <unknown>   
master-europe-west3-c-5b0n   <unknown>    <unknown>   <unknown>       <unknown>   
nodes-europe-west3-5vhx      <unknown>    <unknown>   <unknown>       <unknown>

After firewall fix k top no (one pod can scrape all hosts, one pod can not scrape two nodes)

❯ date +%c
Fri 02 Dec 2022 11:25:31 AM CET
❯ k top no
NAME                         CPU(cores)   CPU%        MEMORY(bytes)   MEMORY%     
master-europe-west3-b-xh98   206m         5%          2293Mi          15%         
master-europe-west3-c-5b0n   246m         6%          2194Mi          14%         
nodes-europe-west3-634r      49m          1%          1326Mi          8%          
nodes-europe-west3-jmjn      45m          1%          1331Mi          8%          
nodes-europe-west3-5vhx      <unknown>    <unknown>   <unknown>       <unknown>   
master-europe-west3-a-dlc1   <unknown>    <unknown>   <unknown>       <unknown>   
❯ date +%c
Fri 02 Dec 2022 11:25:35 AM CET
❯ k top no
NAME                         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
master-europe-west3-a-dlc1   242m         6%     2245Mi          15%       
master-europe-west3-b-xh98   219m         5%     2293Mi          15%       
master-europe-west3-c-5b0n   226m         5%     2194Mi          14%       
nodes-europe-west3-5vhx      51m          1%     1293Mi          8%        
nodes-europe-west3-634r      57m          1%     1328Mi          8%        
nodes-europe-west3-jmjn      48m          1%     1329Mi          8%        
❯ date +%c
Fri 02 Dec 2022 11:25:39 AM CET

After --kubelet-preferred-address-types fix k top no works as expected and there are no errors in logs.

Logs from the metrics-server pod before --kubelet-preferred-address-types fix. Two kinds of Failed to scrape node errors: context deadline exceeded and no such host. I've pasted only the last 50 lines of log from each pod.
Pod a:

E1202 09:55:09.936739       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-a-dlc1"
E1202 09:55:09.936809       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-b-xh98:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-b-xh98"
E1202 09:55:24.937930       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-a-dlc1"
E1202 09:55:24.937930       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-b-xh98:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-b-xh98"
E1202 09:55:24.937932       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-c-5b0n:10250/metrics/resource\": dial tcp 10.0.32.5:10250: i/o timeout" node="master-europe-west3-c-5b0n"
E1202 09:55:26.495356       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-b-xh98:10250/metrics/resource\": dial tcp: lookup master-europe-west3-b-xh98 on 100.64.0.10:53: no such host" node="master-europe-west3-b-xh98"
E1202 09:55:39.937046       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-a-dlc1"
E1202 09:55:39.938211       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-c-5b0n:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-c-5b0n"
E1202 09:55:41.518207       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-c-5b0n:10250/metrics/resource\": dial tcp: lookup master-europe-west3-c-5b0n on 100.64.0.10:53: no such host" node="master-europe-west3-c-5b0n"
E1202 09:55:54.937643       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-b-xh98:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-b-xh98"
E1202 09:55:54.937657       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-a-dlc1"
E1202 09:55:56.472195       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-b-xh98:10250/metrics/resource\": dial tcp: lookup master-europe-west3-b-xh98 on 100.64.0.10:53: no such host" node="master-europe-west3-b-xh98"
E1202 09:56:09.937864       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp 10.0.32.7:10250: i/o timeout" node="master-europe-west3-a-dlc1"
E1202 09:56:09.937906       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-c-5b0n:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-c-5b0n"
E1202 09:56:24.937906       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-c-5b0n:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-c-5b0n"
E1202 09:56:24.937972       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp 10.0.32.7:10250: i/o timeout" node="master-europe-west3-a-dlc1"
E1202 09:56:24.937972       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-b-xh98:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-b-xh98"
E1202 09:56:26.456015       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-b-xh98:10250/metrics/resource\": dial tcp: lookup master-europe-west3-b-xh98 on 100.64.0.10:53: no such host" node="master-europe-west3-b-xh98"
E1202 09:56:39.937383       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-a-dlc1"
E1202 09:56:39.937391       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-c-5b0n:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-c-5b0n"
E1202 09:56:41.504596       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-c-5b0n:10250/metrics/resource\": dial tcp: lookup master-europe-west3-c-5b0n on 100.64.0.10:53: no such host" node="master-europe-west3-c-5b0n"
E1202 09:56:54.937714       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-b-xh98:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-b-xh98"
E1202 09:56:54.937714       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp 10.0.32.7:10250: i/o timeout" node="master-europe-west3-a-dlc1"
E1202 09:56:56.487095       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-b-xh98:10250/metrics/resource\": dial tcp: lookup master-europe-west3-b-xh98 on 100.64.0.10:53: no such host" node="master-europe-west3-b-xh98"
E1202 09:56:56.501547       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-c-5b0n:10250/metrics/resource\": dial tcp: lookup master-europe-west3-c-5b0n on 100.64.0.10:53: no such host" node="master-europe-west3-c-5b0n"
E1202 09:57:09.938403       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-a-dlc1"
E1202 09:57:11.456769       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-c-5b0n:10250/metrics/resource\": dial tcp: lookup master-europe-west3-c-5b0n on 100.64.0.10:53: no such host" node="master-europe-west3-c-5b0n"
E1202 09:57:11.465082       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-b-xh98:10250/metrics/resource\": dial tcp: lookup master-europe-west3-b-xh98 on 100.64.0.10:53: no such host" node="master-europe-west3-b-xh98"
E1202 09:57:24.937385       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-a-dlc1"
E1202 09:57:39.937478       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp 10.0.32.7:10250: i/o timeout" node="master-europe-west3-a-dlc1"
E1202 09:57:39.937574       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-c-5b0n:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-c-5b0n"
E1202 09:57:39.937574       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-b-xh98:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-b-xh98"
E1202 09:57:41.501771       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-b-xh98:10250/metrics/resource\": dial tcp: lookup master-europe-west3-b-xh98 on 100.64.0.10:53: no such host" node="master-europe-west3-b-xh98"
E1202 09:57:54.937897       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-c-5b0n:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-c-5b0n"
E1202 09:57:54.937957       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-a-dlc1"
E1202 09:58:09.937237       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-c-5b0n:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-c-5b0n"
E1202 09:58:09.937375       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-b-xh98:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-b-xh98"
E1202 09:58:09.938606       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-a-dlc1"
E1202 09:58:11.475042       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-c-5b0n:10250/metrics/resource\": dial tcp: lookup master-europe-west3-c-5b0n on 100.64.0.10:53: no such host" node="master-europe-west3-c-5b0n"
E1202 09:58:11.495190       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-b-xh98:10250/metrics/resource\": dial tcp: lookup master-europe-west3-b-xh98 on 100.64.0.10:53: no such host" node="master-europe-west3-b-xh98"
E1202 09:58:24.938687       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-a-dlc1"
E1202 09:58:39.937680       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-c-5b0n:10250/metrics/resource\": dial tcp 10.0.32.5:10250: i/o timeout" node="master-europe-west3-c-5b0n"
E1202 09:58:39.937681       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-a-dlc1"
E1202 09:58:39.937800       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-b-xh98:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-b-xh98"
E1202 09:58:41.507949       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-c-5b0n:10250/metrics/resource\": dial tcp: lookup master-europe-west3-c-5b0n on 100.64.0.10:53: no such host" node="master-europe-west3-c-5b0n"
E1202 09:58:54.937612       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-a-dlc1"
E1202 09:58:54.937688       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-b-xh98:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-b-xh98"
E1202 09:59:09.938045       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-b-xh98:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-b-xh98"
E1202 09:59:09.938159       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-a-dlc1"
E1202 09:59:09.938221       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-c-5b0n:10250/metrics/resource\": dial tcp 10.0.32.5:10250: i/o timeout" node="master-europe-west3-c-5b0n"
E1202 09:59:11.492392       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-b-xh98:10250/metrics/resource\": dial tcp: lookup master-europe-west3-b-xh98 on 100.64.0.10:53: no such host" node="master-europe-west3-b-xh98"
E1202 09:59:24.938333       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-c-5b0n:10250/metrics/resource\": context deadline exceeded" node="master-europe-west3-c-5b0n"

Pod b:

E1202 10:14:28.118341       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:14:43.080964       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:14:43.108814       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:14:58.082350       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:14:58.112073       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:15:13.084207       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:15:13.121976       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:15:28.078819       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:15:28.107244       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:15:43.102142       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:15:43.111225       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:15:58.090505       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:15:58.134285       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:16:13.102149       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:16:13.104975       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:16:28.080763       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:16:28.107115       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:16:43.099951       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:16:43.129289       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:16:58.103055       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:16:58.104737       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:17:13.080889       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:17:13.122221       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:17:28.091874       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:17:28.101353       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:17:43.090045       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:17:43.096981       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:17:58.076076       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:17:58.078543       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:18:13.097178       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:18:13.135848       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:18:28.074794       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:18:28.089889       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:18:43.072307       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:18:43.087606       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:18:58.087551       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:18:58.098949       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:19:13.085428       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:19:13.116151       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:19:28.073051       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:19:28.075806       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:19:43.078198       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:19:43.131486       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:19:58.079025       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:19:58.119685       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:20:13.112296       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:20:13.126524       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:20:28.098628       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"
E1202 10:20:28.113788       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:20:43.079137       1 scraper.go:140] "Failed to scrape node" err="Get \"https://master-europe-west3-a-dlc1:10250/metrics/resource\": dial tcp: lookup master-europe-west3-a-dlc1 on 100.64.0.10:53: no such host" node="master-europe-west3-a-dlc1"
E1202 10:20:43.121105       1 scraper.go:140] "Failed to scrape node" err="Get \"https://nodes-europe-west3-5vhx:10250/metrics/resource\": dial tcp: lookup nodes-europe-west3-5vhx on 100.64.0.10:53: no such host" node="nodes-europe-west3-5vhx"

9. Anything else do we need to know?

The text was updated successfully, but these errors were encountered:

hakman · 2022-12-02T12:35:58Z

Cool, thanks again for the debugging effort. Will get to work on a fix soon.
/assign

fbozic · 2022-12-03T22:05:31Z

Hi, thanks for the quick fix.
I see that #14709 fixes the second issue (InternalIP for kubelet). But the first issue, where nodes can not scrape master due to missing firewall rule, is not addressed yet. And this issue has been closed. Can we re-open it until the first issue is fixed?

sichiba · 2023-03-15T13:00:18Z

hello there, i'm having kind of the same issue. I have these logs

PS C:\windows\system32> kubectl logs -f metrics-server-58b7f877fc-67txx -n kube-system
I0315 10:08:23.309378 1 serving.go:342] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
I0315 10:08:23.681150 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0315 10:08:23.681186 1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0315 10:08:23.681160 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0315 10:08:23.681197 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0315 10:08:23.681161 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0315 10:08:23.681304 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0315 10:08:23.681454 1 secure_serving.go:266] Serving securely on [::]:10250
W0315 10:08:23.681489 1 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed
I0315 10:08:23.681525 1 dynamic_serving_content.go:131] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"
I0315 10:08:23.681535 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0315 10:08:23.781829 1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController
I0315 10:08:23.781854 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0315 10:08:23.781842 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0315 10:08:42.689261 1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
I0315 10:08:52.689153 1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
E0315 10:08:53.688761 1 scraper.go:140] "Failed to scrape node" err="Get "[https://10.49.161.59:10250/metrics/resource](https://10.49.161.59:10250/metrics/resource%5C)": dial tcp 10.49.161.59:10250: i/o timeout" node="ip-10-49-161-59.eu-west-3.compute.internal"
E0315 10:08:53.691870 1 scraper.go:140] "Failed to scrape node" err="Get "[https://10.49.161.177:10250/metrics/resource](https://10.49.161.177:10250/metrics/resource%5C)": dial tcp 10.49.161.177:10250: i/o timeout" node="ip-10-49-161-177.eu-west-3.compute.internal"
E0315 10:08:53.692942 1 scraper.go:140] "Failed to scrape node" err="Get "[https://10.49.161.214:10250/metrics/resource](https://10.49.161.214:10250/metrics/resource%5C)": dial tcp 10.49.161.214:10250: i/o timeout" node="ip-10-49-161-214.eu-west-3.compute.internal"
E0315 10:08:53.708202 1 scraper.go:140] "Failed to scrape node" err="Get "https://10.49.161

metrics-server doesn't seem to see the other nodes in the cluster although they all have the same configuration and they all have the port 10250 TCP configured on the sg
PS C:\windows\system32> kubectl top nodes -n kube-system
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-10-49-161-64.eu-west-3.compute.internal 55m 1% 1959Mi 13%
ip-10-49-161-125.eu-west-3.compute.internal
ip-10-49-161-160.eu-west-3.compute.internal
ip-10-49-161-177.eu-west-3.compute.internal
ip-10-49-161-252.eu-west-3.compute.internal
ip-10-49-161-59.eu-west-3.compute.internal

I just wanna add that we're running metrics-server 0.6.1 and EKS 1.25. I've already applied all the hacks and workarrounds mentioned, like metric-resolution; preferred address-types , --kubelet-insecure-tls=true and none of them help solve the issue. please anyone here to help??

mbhattrh23 · 2023-07-24T14:00:30Z

Hi @sichiba, did you figure out this? I am having the same issue on eks 1.25.

sichiba · 2023-08-09T13:21:07Z

Hi @mbhattrh23 actually that was due to a security group. we figured out that port 10250 should be open on sg of eks cluster (both controll and data plane).

Pintus04 · 2024-09-16T06:30:43Z

Hi Team,
I'm getting a some error from kuberentes worker ( 500 : Interal server error ). how to resolve this?

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 2, 2022

k8s-ci-robot assigned hakman Dec 2, 2022

hakman added the area/provider/gcp Issues or PRs related to gcp provider label Dec 2, 2022

hakman mentioned this issue Dec 2, 2022

metrics-server: Set preferred address type to InternalIP when non AWS #14709

Merged

k8s-ci-robot closed this as completed in #14709 Dec 3, 2022

hakman reopened this Dec 3, 2022

hakman mentioned this issue Dec 4, 2022

gce: Allow metrics-server to access kubelet API #14722

Merged

k8s-ci-robot closed this as completed in #14722 Dec 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics-server failing to scrape nodes due to timeout and DNS lookup errors #14708

Metrics-server failing to scrape nodes due to timeout and DNS lookup errors #14708

fbozic commented Dec 2, 2022

hakman commented Dec 2, 2022 •

edited

Loading

fbozic commented Dec 3, 2022

sichiba commented Mar 15, 2023

mbhattrh23 commented Jul 24, 2023

sichiba commented Aug 9, 2023

Pintus04 commented Sep 16, 2024

Metrics-server failing to scrape nodes due to timeout and DNS lookup errors #14708

Metrics-server failing to scrape nodes due to timeout and DNS lookup errors #14708

Comments

fbozic commented Dec 2, 2022

hakman commented Dec 2, 2022 • edited Loading

fbozic commented Dec 3, 2022

sichiba commented Mar 15, 2023

mbhattrh23 commented Jul 24, 2023

sichiba commented Aug 9, 2023

Pintus04 commented Sep 16, 2024

hakman commented Dec 2, 2022 •

edited

Loading