Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nfd-topology-updater: can't run normally when there are empty huagepages in some numa nodes #1286

Closed
freelizhun opened this issue Jul 31, 2023 · 2 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@freelizhun
Copy link

freelizhun commented Jul 31, 2023

What happened:
nfd-topology-updater can't running normally when there are empty huagepages in some numa nodes
eg:

[root@master1 node-feature-discovery]# kubectl -n node-feature-discovery get pods -o wide
NAME                               READY   STATUS             RESTARTS       AGE     IP             NODE      NOMINATED NODE   READINESS GATES
nfd-topology-updater-mwhx2         1/1     Running            0              7m52s   10.119.1.203   node1     <none>           <none>
nfd-topology-updater-rv6kp         0/1     CrashLoopBackOff   6 (117s ago)   7m52s   10.119.0.75    master1   <none>           <none>

[root@master1 node-feature-discovery]# kubectl -n node-feature-discovery logs nfd-topology-updater-rv6kp 
I0731 08:11:40.262277       1 nfd-topology-updater.go:127] "Node Feature Discovery Topology Updater" version="v0.14.0-devel-161-ge0f10a81-dirty" nodeName="master1"
I0731 08:11:40.262395       1 util_unix.go:103] "Using this endpoint is deprecated, please consider using full URL format" endpoint="/host-var/lib/kubelet/pod-resources/kubelet.sock" URL="unix:///host-var/lib/kubelet/pod-resources/kubelet.sock"
I0731 08:11:40.262455       1 component.go:36] [core][Channel #1] Channel created
I0731 08:11:40.262473       1 component.go:36] [core][Channel #1] original dial target is: "/host-var/lib/kubelet/pod-resources/kubelet.sock"
I0731 08:11:40.262506       1 component.go:36] [core][Channel #1] parsed dial target is: {Scheme: Authority: Endpoint:host-var/lib/kubelet/pod-resources/kubelet.sock URL:{Scheme: Opaque: User: Host: Path:/host-var/lib/kubelet/pod-resources/kubelet.sock RawPath: OmitHost:false ForceQuery:false RawQuery: Fragment: RawFragment:}}
I0731 08:11:40.262519       1 component.go:36] [core][Channel #1] fallback to scheme "passthrough"
I0731 08:11:40.262539       1 component.go:36] [core][Channel #1] parsed dial target is: {Scheme:passthrough Authority: Endpoint:/host-var/lib/kubelet/pod-resources/kubelet.sock URL:{Scheme:passthrough Opaque: User: Host: Path://host-var/lib/kubelet/pod-resources/kubelet.sock RawPath: OmitHost:false ForceQuery:false RawQuery: Fragment: RawFragment:}}
I0731 08:11:40.262560       1 component.go:36] [core][Channel #1] Channel authority set to "/host-var/lib/kubelet/pod-resources/kubelet.sock"
I0731 08:11:40.262725       1 component.go:36] [core][Channel #1] Resolver state updated: {
  "Addresses": [
    {
      "Addr": "/host-var/lib/kubelet/pod-resources/kubelet.sock",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Type": 0,
      "Metadata": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
} (resolver returned new addresses)
I0731 08:11:40.262790       1 component.go:36] [core][Channel #1] Channel switches to new LB policy "pick_first"
I0731 08:11:40.262825       1 component.go:36] [core][Channel #1 SubChannel #2] Subchannel created
2023/07/31 08:11:40 Connected to '"/host-var/lib/kubelet/pod-resources/kubelet.sock"'!
I0731 08:11:40.262904       1 component.go:36] [core][Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING
I0731 08:11:40.262935       1 component.go:36] [core][Channel #1 SubChannel #2] Subchannel picks a new address "/host-var/lib/kubelet/pod-resources/kubelet.sock" to connect
I0731 08:11:40.263074       1 component.go:36] [core][Channel #1] Channel Connectivity change to CONNECTING
I0731 08:11:40.263126       1 nfd-topology-updater.go:294] "configuration file parsed" path="/etc/kubernetes/node-feature-discovery/nfd-topology-updater.conf" config=&{ExcludeList:map[]}
I0731 08:11:40.263148       1 podresourcesscanner.go:53] "watching all namespaces"
I0731 08:11:40.263366       1 component.go:36] [core][Channel #1 SubChannel #2] Subchannel Connectivity change to READY
I0731 08:11:40.263392       1 component.go:36] [core][Channel #1] Channel Connectivity change to READY
E0731 08:11:40.493194       1 main.go:71] "error while running" err="failed to obtain node resource information: open /host-sys/bus/node/devices/node1/hugepages: no such file or directory"
[root@master1 node-feature-discovery]# 
[root@master1 node-feature-discovery]# ls /sys/bus/node/devices/node1
compact  cpu10  cpu11  cpu12  cpu13  cpu14  cpu15  cpu8  cpu9  cpulist  cpumap  distance  meminfo  numastat  power  subsystem  uevent  vmstat
[root@master1 node-feature-discovery]# numactl -H
available: 16 nodes (0-15)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32272 MB
node 0 free: 24754 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32730 MB
node 2 free: 27810 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus: 32 33 34 35 36 37 38 39
node 4 size: 32730 MB
node 4 free: 28156 MB
node 5 cpus: 40 41 42 43 44 45 46 47
node 5 size: 0 MB
node 5 free: 0 MB
node 6 cpus: 48 49 50 51 52 53 54 55
node 6 size: 32730 MB
node 6 free: 30288 MB
node 7 cpus: 56 57 58 59 60 61 62 63
node 7 size: 0 MB
node 7 free: 0 MB
node 8 cpus: 64 65 66 67 68 69 70 71
node 8 size: 32666 MB
node 8 free: 24379 MB
node 9 cpus: 72 73 74 75 76 77 78 79
node 9 size: 0 MB
node 9 free: 0 MB
node 10 cpus: 80 81 82 83 84 85 86 87
node 10 size: 32730 MB
node 10 free: 26705 MB
node 11 cpus: 88 89 90 91 92 93 94 95
node 11 size: 0 MB
node 11 free: 0 MB
node 12 cpus: 96 97 98 99 100 101 102 103
node 12 size: 32707 MB
node 12 free: 27130 MB
node 13 cpus: 104 105 106 107 108 109 110 111
node 13 size: 0 MB
node 13 free: 0 MB
node 14 cpus: 112 113 114 115 116 117 118 119
node 14 size: 31665 MB
node 14 free: 29324 MB
node 15 cpus: 120 121 122 123 124 125 126 127
node 15 size: 0 MB
node 15 free: 0 MB
node distances:
node   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
  0:  10  20  40  30  20  30  50  40  100  100  100  100  100  100  100  100 
  1:  20  10  30  40  50  20  40  50  100  100  100  100  100  100  100  100 
  2:  40  30  10  20  40  50  20  30  100  100  100  100  100  100  100  100 
  3:  30  40  20  10  30  20  40  50  100  100  100  100  100  100  100  100 
  4:  20  50  40  30  10  50  30  20  100  100  100  100  100  100  100  100 
  5:  30  20  50  20  50  10  50  40  100  100  100  100  100  100  100  100 
  6:  50  40  20  40  30  50  10  30  100  100  100  100  100  100  100  100 
  7:  40  50  30  50  20  40  30  10  100  100  100  100  100  100  100  100 
  8:  100  100  100  100  100  100  100  100  10  20  40  30  20  30  50  40 
  9:  100  100  100  100  100  100  100  100  20  10  30  40  50  20  40  50 
 10:  100  100  100  100  100  100  100  100  40  30  10  20  40  50  20  30 
 11:  100  100  100  100  100  100  100  100  30  40  20  10  30  20  40  50 
 12:  100  100  100  100  100  100  100  100  20  50  40  30  10  50  30  20 
 13:  100  100  100  100  100  100  100  100  30  20  50  20  50  10  50  40 
 14:  100  100  100  100  100  100  100  100  50  40  20  40  30  50  10  30 
 15:  100  100  100  100  100  100  100  100  40  50  30  50  20  40  30  10 

What you expected to happen:
nfd-topology-updater pods running normally when there are empty huagepages in some numa nodes

How to reproduce it (as minimally and precisely as possible):
$ git clone https://github.com/kubernetes-sigs/node-feature-discovery.git
$ cd node-feature-discovery
$ kubectl apply -k deployment/overlays/topologyupdater

Environment:

  • Kubernetes version: v1.24.13
@freelizhun freelizhun added the kind/bug Categorizes issue or PR as related to a bug. label Jul 31, 2023
@freelizhun
Copy link
Author

/assign

@freelizhun freelizhun changed the title nfd-topology-updater: can't running normally when there are no memory chip in some numa nodes nfd-topology-updater: can't running normally when there are empty huagepages in some numa nodes Jul 31, 2023
@marquiz
Copy link
Contributor

marquiz commented Jul 31, 2023

Thanks @freelizhun for reporting this (and for the fix, too).

@freelizhun freelizhun changed the title nfd-topology-updater: can't running normally when there are empty huagepages in some numa nodes nfd-topology-updater: can't run normally when there are empty huagepages in some numa nodes Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants