nnf-dm + clientmountd cpu usage #117

behlendorf · 2023-12-12T23:01:12Z

We've observed that the nnf-dm and clientmountd daemons generate a surprising amount of system noise on the computes even when they should be idle. For reference, over the last 2 weeks they've been lightly used yet have wracked up ~10 minutes of cpu time each. This is compared to most other idle system daemons which report <5 seconds of cpu usage over the same time period. The usage nnf-dm and clientmountd usage is similar across compute nodes.

root       33954       1  0 Nov14 ?        00:09:44 /usr/bin/nnf-dm <...args...>
root       34061       1  0 Nov14 ?        00:10:11 /usr/bin/clientmountd <...args...>
                                           ^^^^^^^^^

Corosync which is required for gfs2 generates even more noise on the compute nodes. One possible mitigation would be to only start/stop the pacemaker service on computes when a gfs2 filesystem has been requested. This could either be done by Flux when setting up the computes or the clientmountd which is already running there.

The text was updated successfully, but these errors were encountered:

bdevcich · 2023-12-15T21:41:43Z

We've been investigating this. Early signs point to Go Garbage Collection and HTTP2 Health Checks. Both can be tuned.

bdevcich · 2023-12-19T15:27:00Z

@behlendorf which version of go are you using the build the daemons?

behlendorf · 2023-12-19T18:23:02Z

We're building with the RHEL 8.9 version of go.

# rpm -q golang
golang-1.20.10-1.module+el8.9.0+20382+04f7fe80.x86_64

# go version
go version go1.20.10 linux/amd64

bdevcich · 2024-01-02T20:25:03Z

We have found that we can reduce the CPU usage by approximately 66% by tuning garbage collection and the frequency of the HTTP2 health checks. In comparison of your original observations of 10m of CPU time over 14 days, we were able to get CPU usage down to 3m26s over 12 days (22-Dec to 02-Jan). This can be done by setting the following environment variables in the systemd unit file:

Environment=GOGC=off
Environment=GOMEMLIMIT=20MiB
Environment=GOMAXPROCS=5
Environment=HTTP2_PING_TIMEOUT_SECONDS=60

Output of systemctl stauts with CPU accounting enabled:

[root@x9000c3s0b0n0 ~]# systemctl status clientmountd
● clientmountd.service - Data Workflow Service (DWS) Client Mount Service
   Loaded: loaded (/etc/systemd/system/clientmountd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/clientmountd.service.d
           └─override.conf
   Active: active (running) since Wed 2023-12-20 13:24:23 CST; 1 weeks 5 days ago
 Main PID: 7840 (clientmountd)
    Tasks: 10 (limit: 1646821)
   Memory: 19.0M
      CPU: 3min 25.808s

bdevcich · 2024-01-09T18:24:09Z

These environment variables have been checked into master in nnf-deploy for each daemon (i.e. nnf-dm, clientmountd). Those variables can be found here.

NearNodeFlash/nnf-deploy#105

bdevcich · 2024-06-14T13:04:25Z

The current solution is to start/stop these daemons at will. Flux will do that: flux-framework/flux-coral2#166

github-project-automation bot added this to Issues Dashboard Dec 12, 2023

github-project-automation bot moved this to 📋 Open in Issues Dashboard Dec 12, 2023

bdevcich moved this from 📋 Open to 🏗 In progress in Issues Dashboard Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nnf-dm + clientmountd cpu usage #117

nnf-dm + clientmountd cpu usage #117

behlendorf commented Dec 12, 2023

bdevcich commented Dec 15, 2023

bdevcich commented Dec 19, 2023

behlendorf commented Dec 19, 2023

bdevcich commented Jan 2, 2024 •

edited

Loading

bdevcich commented Jan 9, 2024 •

edited

Loading

bdevcich commented Jun 14, 2024

nnf-dm + clientmountd cpu usage #117

nnf-dm + clientmountd cpu usage #117

Comments

behlendorf commented Dec 12, 2023

bdevcich commented Dec 15, 2023

bdevcich commented Dec 19, 2023

behlendorf commented Dec 19, 2023

bdevcich commented Jan 2, 2024 • edited Loading

bdevcich commented Jan 9, 2024 • edited Loading

bdevcich commented Jun 14, 2024

bdevcich commented Jan 2, 2024 •

edited

Loading

bdevcich commented Jan 9, 2024 •

edited

Loading