Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nnf-dm + clientmountd cpu usage #117

Open
behlendorf opened this issue Dec 12, 2023 · 6 comments
Open

nnf-dm + clientmountd cpu usage #117

behlendorf opened this issue Dec 12, 2023 · 6 comments

Comments

@behlendorf
Copy link
Collaborator

We've observed that the nnf-dm and clientmountd daemons generate a surprising amount of system noise on the computes even when they should be idle. For reference, over the last 2 weeks they've been lightly used yet have wracked up ~10 minutes of cpu time each. This is compared to most other idle system daemons which report <5 seconds of cpu usage over the same time period. The usage nnf-dm and clientmountd usage is similar across compute nodes.

root       33954       1  0 Nov14 ?        00:09:44 /usr/bin/nnf-dm <...args...>
root       34061       1  0 Nov14 ?        00:10:11 /usr/bin/clientmountd <...args...>
                                           ^^^^^^^^^

Corosync which is required for gfs2 generates even more noise on the compute nodes. One possible mitigation would be to only start/stop the pacemaker service on computes when a gfs2 filesystem has been requested. This could either be done by Flux when setting up the computes or the clientmountd which is already running there.

@bdevcich
Copy link
Contributor

We've been investigating this. Early signs point to Go Garbage Collection and HTTP2 Health Checks. Both can be tuned.

@bdevcich
Copy link
Contributor

@behlendorf which version of go are you using the build the daemons?

@behlendorf
Copy link
Collaborator Author

We're building with the RHEL 8.9 version of go.

# rpm -q golang
golang-1.20.10-1.module+el8.9.0+20382+04f7fe80.x86_64

# go version
go version go1.20.10 linux/amd64

@bdevcich bdevcich moved this from 📋 Open to 🏗 In progress in Issues Dashboard Dec 19, 2023
@bdevcich
Copy link
Contributor

bdevcich commented Jan 2, 2024

We have found that we can reduce the CPU usage by approximately 66% by tuning garbage collection and the frequency of the HTTP2 health checks. In comparison of your original observations of 10m of CPU time over 14 days, we were able to get CPU usage down to 3m26s over 12 days (22-Dec to 02-Jan). This can be done by setting the following environment variables in the systemd unit file:

Environment=GOGC=off
Environment=GOMEMLIMIT=20MiB
Environment=GOMAXPROCS=5
Environment=HTTP2_PING_TIMEOUT_SECONDS=60

Output of systemctl stauts with CPU accounting enabled:

[root@x9000c3s0b0n0 ~]# systemctl status clientmountd
● clientmountd.service - Data Workflow Service (DWS) Client Mount Service
   Loaded: loaded (/etc/systemd/system/clientmountd.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/clientmountd.service.d
           └─override.conf
   Active: active (running) since Wed 2023-12-20 13:24:23 CST; 1 weeks 5 days ago
 Main PID: 7840 (clientmountd)
    Tasks: 10 (limit: 1646821)
   Memory: 19.0M
      CPU: 3min 25.808s

@bdevcich
Copy link
Contributor

bdevcich commented Jan 9, 2024

These environment variables have been checked into master in nnf-deploy for each daemon (i.e. nnf-dm, clientmountd). Those variables can be found here.

NearNodeFlash/nnf-deploy#105

@bdevcich
Copy link
Contributor

The current solution is to start/stop these daemons at will. Flux will do that: flux-framework/flux-coral2#166

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🏗 In progress
Development

No branches or pull requests

2 participants