Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad node becoming un-responsoive #7987

Closed
raunak2004 opened this issue May 16, 2020 · 30 comments
Closed

Nomad node becoming un-responsoive #7987

raunak2004 opened this issue May 16, 2020 · 30 comments

Comments

@raunak2004
Copy link

raunak2004 commented May 16, 2020

Nomad node became unresponsive which causes no jobs to be scheduled.
The systemctl status nomad returns active but none of the nomad CLI command return any response. Since the nomad is unresponsive any service dependent on nomad also starts acting up as in this case it is zookeeper
image

image

image

Logs:

May 07 18:35:18 us02pA-zkp1 nomad[7324]: 2020-05-07T18:35:18.953Z [INFO] client.driver_mgr.docker: created container: driver=docker container_id=17ea92a789da5cb775ac62e2039df3b8e2d3ddfc66c48e103516bb9b131b5988
May 07 18:35:18 us02pA-zkp1 nomad[7324]: 2020-05-07T18:35:18.953Z [INFO] client.driver_mgr.docker: created container: driver=docker container_id=12cc065111f1cff4333e617c1b2bcd1db3129a160edaa03a6ca7e746fcfa5a48
May 07 18:35:19 us02pA-zkp1 nomad[7324]: 2020-05-07T18:35:19.393Z [INFO] client.driver_mgr.docker: started container: driver=docker container_id=17ea92a789da5cb775ac62e2039df3b8e2d3ddfc66c48e103516bb9b131b5988
May 07 18:35:19 us02pA-zkp1 nomad[7324]: 2020-05-07T18:35:19.510Z [INFO] client.driver_mgr.docker: started container: driver=docker container_id=12cc065111f1cff4333e617c1b2bcd1db3129a160edaa03a6ca7e746fcfa5a48
May 07 20:03:41 us02pA-zkp1 nomad[7324]: 2020-05-07T20:03:41.525Z [INFO] client.gc: garbage collecting allocation: alloc_id=1e9be7e0-3756-aa5a-3a85-ef0930eb798c reason="forced collection"
May 12 18:00:48 us02pA-zkp1 nomad[7324]: 2020-05-12T18:00:48.295Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=10.102.7.13:4647
May 12 18:00:48 us02pA-zkp1 nomad[7324]: 2020-05-12T18:00:48.295Z [ERROR] client: error querying node allocations: error="rpc error: EOF"
May 12 18:01:06 us02pA-zkp1 nomad[7324]: 2020-05-12T18:01:06.887Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=10.102.7.14:4647
May 12 18:01:06 us02pA-zkp1 nomad[7324]: 2020-05-12T18:01:06.887Z [ERROR] client: error querying node allocations: error="rpc error: EOF"
May 12 18:03:00 us02pA-zkp1 systemd[1]: Stopping "HashiCorp Nomad - Workload orchestrator"...
May 12 18:03:00 us02pA-zkp1 nomad[7324]: ==> Caught signal: interrupt
May 12 18:03:00 us02pA-zkp1 nomad[7324]: 2020-05-12T18:03:00.378Z [INFO] agent: requesting shutdown
May 12 18:03:00 us02pA-zkp1 nomad[7324]: 2020-05-12T18:03:00.378Z [INFO] client: shutting down
May 12 18:03:00 us02pA-zkp1 nomad[7324]: 2020-05-12T18:03:00.386Z [INFO] client.plugin: shutting down plugin manager: plugin-type=device
May 12 18:03:00 us02pA-zkp1 nomad[7324]: 2020-05-12T18:03:00.387Z [INFO] client.plugin: plugin manager finished: plugin-type=device
May 12 18:03:00 us02pA-zkp1 nomad[7324]: 2020-05-12T18:03:00.387Z [INFO] client.plugin: shutting down plugin manager: plugin-type=driver
May 12 18:03:00 us02pA-zkp1 nomad[7324]: 2020-05-12T18:03:00.394Z [INFO] client.plugin: plugin manager finished: plugin-type=driver
May 12 18:03:00 us02pA-zkp1 nomad[7324]: 2020-05-12T18:03:00.394Z [INFO] client.plugin: shutting down plugin manager: plugin-type=csi
May 12 18:03:00 us02pA-zkp1 nomad[7324]: 2020-05-12T18:03:00.394Z [INFO] client.plugin: plugin manager finished: plugin-type=csi
May 12 18:03:00 us02pA-zkp1 nomad[7324]: 2020-05-12T18:03:00.412Z [ERROR] consul.sync: failed deregistering agent check: check_id=_nomad-check-a258ff77935a905d2f76d1a6b5912b7771336321 error="Unexpected response code: 500 (Unknown check "_nomad-check-a258ff77935a905d2f76d1a6b5912b7771336321")"
May 12 18:03:00 us02pA-zkp1 nomad[7324]: 2020-05-12T18:03:00.412Z [INFO] agent: shutdown complete
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Failed with result 'exit-code'.
May 12 18:03:00 us02pA-zkp1 systemd[1]: Stopped "HashiCorp Nomad - Workload orchestrator".
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 21694 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 21695 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 21815 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 21855 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 22185 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 22247 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 108874 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 108875 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 108881 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 108998 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 109039 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 109198 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 114847 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 114848 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 114959 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 114996 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: Started "HashiCorp Nomad - Workload orchestrator".
May 12 18:03:00 us02pA-zkp1 nomad[121900]: ==> Loaded configuration from /etc/nomad.d/acl.hcl, /etc/nomad.d/client.json, /etc/nomad.d/nomad.hcl, /etc/nomad.d/nomad_vault_management.hcl, /etc/nomad.d/telemetry.json
May 12 18:03:00 us02pA-zkp1 nomad[121900]: ==> Starting Nomad agent...
May 12 18:03:06 us02pA-zkp1 nomad[121900]: ==> Nomad agent configuration:
May 12 18:03:06 us02pA-zkp1 nomad[121900]: Advertise Addrs: HTTP: *.*.*.*:4646
May 12 18:03:06 us02pA-zkp1 nomad[121900]: Bind Addrs: HTTP: 0.0.0.0:4646
May 12 18:03:06 us02pA-zkp1 nomad[121900]: Client: true
May 12 18:03:06 us02pA-zkp1 nomad[121900]: Log Level: INFO
May 12 18:03:06 us02pA-zkp1 nomad[121900]: Region: global (DC: us02pA)
May 12 18:03:06 us02pA-zkp1 nomad[121900]: Server: false
May 12 18:03:06 us02pA-zkp1 nomad[121900]: Version: 0.11.1
May 12 18:03:06 us02pA-zkp1 nomad[121900]: ==> Nomad agent started! Log data will stream in below:
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.473Z [WARN] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/opt/nomad/plugins
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.474Z [INFO] agent: detected plugin: name=java type=driver plugin_version=0.1.0
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.474Z [INFO] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.474Z [INFO] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.474Z [INFO] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.474Z [INFO] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.474Z [INFO] agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.474Z [INFO] client: using state directory: state_dir=/opt/nomad/client
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.474Z [INFO] client: using alloc directory: alloc_dir=/opt/nomad/alloc
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.476Z [INFO] client.fingerprint_mgr.cgroup: cgroups are available
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.489Z [INFO] client.fingerprint_mgr.consul: consul agent is available
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.512Z [INFO] client.fingerprint_mgr.vault: Vault is available
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:06.513Z [INFO] client.plugin: starting plugin manager: plugin-type=csi
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:06.513Z [INFO] client.plugin: starting plugin manager: plugin-type=driver
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:06.513Z [INFO] client.plugin: starting plugin manager: plugin-type=device
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:06.529Z [WARN] client.server_mgr: no servers available
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:06.529Z [INFO] client.consul: discovered following servers: servers=[*.*.*.*:4647, *.*.*.*:4647, *.*.*.*:4647]
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:06.541Z [INFO] client: node registration complete
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:06.547Z [INFO] client: started client: node_id=33933e50-8f99-8963-0d1b-00dee71ba6a1
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.558532 [INFO] (runner) creating new runner (dry: false, once: false)
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.559114 [INFO] (runner) creating watcher
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.562814 [INFO] (runner) starting
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.559576 [INFO] (runner) creating new runner (dry: false, once: false)
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.564766 [INFO] (runner) creating watcher
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.559908 [INFO] (runner) creating new runner (dry: false, once: false)
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.561780 [INFO] (runner) creating new runner (dry: false, once: false)
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.562629 [INFO] (runner) creating new runner (dry: false, once: false)
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.564988 [INFO] (runner) starting
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.565369 [INFO] (runner) creating watcher
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.565447 [INFO] (runner) starting
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.566255 [INFO] (runner) creating watcher
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.566340 [INFO] (runner) starting
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.570125 [INFO] (runner) creating watcher
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.570388 [INFO] (runner) starting
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.607259 [INFO] (runner) creating new runner (dry: false, once: false)
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.607379 [INFO] (runner) creating watcher
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.607455 [INFO] (runner) starting
May 12 18:03:15 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:15.716Z [INFO] client: node registration complete
May 12 20:12:05 us02pA-zkp1 nomad[121900]: 2020/05/12 20:12:05.564336 [WARN] (view) kv.block(core/inventory/zk_server_linux): Unexpected response code: 500 (retry attempt 1 after "250ms")
May 12 20:13:23 us02pA-zkp1 nomad[121900]: 2020/05/12 20:13:23.492408 [WARN] (view) kv.block(core/inventory/zk_server_linux): Unexpected response code: 500 (retry attempt 1 after "250ms")
May 12 20:13:23 us02pA-zkp1 nomad[121900]: 2020/05/12 20:13:23.743344 [WARN] (view) kv.block(core/inventory/zk_server_linux): Get "http://*.*.*.*:8500/v1/kv/core/inventory/zk_server_linux?index=37&stale=&wait=60000ms": dial tcp *.*.*.*:8500: connect: connection refused (retry attempt 2 after "500ms")

Configuration used for nomad:

acl {
  enabled = true
  }
{
        "client": {
                "enabled": true,
        "options":{
        "driver.raw_exec.enable":"1"
    },
                "meta": [{ "nomad_version": "0.11.1","linux":"true","zk_server_linux":"true" }]
        }
}
vault {
  enabled = true
  address = "http://vault.service.consul:8200"
}

log_level = "INFO"
data_dir = "/opt/nomad"
log_file = "/opt/nomad/nomad.log"
log_rotate_max_files = 5
log_rotate_bytes = 10000000
disable_update_check = true
datacenter = "us1234"
bind_addr = "0.0.0.0"
advertise {
  http = "*.*.*.*"
  rpc  = "*.*.*.*"
  serf = "*.*.*.*"
}
consul {
  address = "*.*.*.*:8500"
  token   = "7116df5a-921d-e2ab-5e0c-sfsdfsfsfsdf"
}
plugin "docker" {
  config {
    auth {
      config = "/etc/docker/config.json"
    }
  }
}
@tgross
Copy link
Member

tgross commented May 18, 2020

Hi @raunak2004 and thanks for reporting this! So it looks from the systemd logs that this client was restarted? There's a couple things I'm seeing here.

The systemctl status nomad returns active but none of the nomad CLI command return any response. Since the nomad is unresponsive any service dependent on nomad also starts acting up as in this case it is zookeeper

The Nomad CLI should typically be pointed to the servers, not to the clients. Are you talking only about the CLI commands that directly interact with the "stuck" client (like nomad alloc logs, etc.)?

The log-lines with the longer timestamps are from Nomad's embedded consul-template runner. This looks like it can't reach Consul:

nomad[121900]: 2020/05/12 20:12:05.564336 [WARN] (view) kv.block(core/inventory/zk_server_linux): Unexpected response code: 500 (retry attempt 1 after "250ms")

Is your Consul agent on that host showing any logs that might give us another clue?

Probably unrelated, but from the systemd process tree output, this log line suggests that we have a logmon process that didn't get reparented after the restart (but that shouldn't cause Nomad to become unresponsive at all). It might be helpful if you could share your systemd unit file.

nomad.service: Found left-over process 21694 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

(Also, I hope you don't mind, but I wrapped your logs and config in some tags to make it a bit easier to read.)

@svudutala
Copy link

@tgross Posting it for @raunak2004

Here is the systemd file

[Unit]
Description= "HashiCorp Nomad - Workload orchestrator"
Documentation=https://nomadproject.io/docs/
Requires=network-online.target
After=network-online.target
ConditionFileNotEmpty=/etc/nomad.d/nomad.hcl
StartLimitBurst=3
StartLimitIntervalSec=10

[Service]
# Service running as client needs root permissions and client runs everywhere
User=root
Group=ws1applications
ExecStart=/opt/nomad/0.11.1/nomad agent -config /etc/nomad.d
LimitNOFILE=65536
KillMode=process
KillSignal=SIGINT
Restart=on-failure
RestartSec=2
TasksMax=infinity

[Install]
WantedBy=multi-user.target

@raunak2004
Copy link
Author

Is your Consul agent on that host showing any logs that might give us another clue?

No there are no logs in consul which would indicate about the issue all the consul servers were healthy when nomad client went unhealthy. As a workaround for now we have restarted the agent.

@tgross
Copy link
Member

tgross commented May 18, 2020

all the consul servers were healthy when nomad client went unhealthy

Ok, but if we saw a 500 error in consul-template I'd expect the local Consul agent to be having an issue. Nothing in the systemd logs for that?

Also, can you describe the failure mode in a little more detail... are you talking only about the CLI commands that directly interact with the "stuck" client (like nomad alloc logs, etc.)?

@idrennanvmware
Copy link
Contributor

@tgross I didn't realize @raunak2004 had already opened this issue. Here's more details

#8085

@rkettelerij
Copy link
Contributor

rkettelerij commented Jun 4, 2020

So we have #8085, #7987 and #8038. Which al look pretty similar (at first glance).

@idrennanvmware
Copy link
Contributor

We just had this happen again today, this time on a windows node.

Powershell shows the service as running - Consul shows the Nomad healthcheck failing.

Restart the service - all returns to normal.

@idrennanvmware
Copy link
Contributor

Screen Shot 2020-06-08 at 9 40 17 AM

@tgross - here's what we see in the healthcheck on Consul when this scenario occurs.

  • Service on the linux/windows node shows a status of "running"
  • Service healthcheck is timing out like the screenshot
  • Metrics and logs stop writing and being emitted

Remediation is restarting the agent

@shoenig shoenig self-assigned this Jun 8, 2020
@shoenig
Copy link
Member

shoenig commented Jun 9, 2020

Next time this happens (to anyone), it would be super helpful to get a stack trace from while the agent is un-responsive.

This can be done by issuing SIGABRT to the Nomad agent, which will then dump the trace into stderr/stdout (or journald logs, etc.).

e.g. pkill -SIGABRT nomad

https://golang.org/pkg/os/signal/#hdr-Default_behavior_of_signals_in_Go_programs

@idrennanvmware
Copy link
Contributor

idrennanvmware commented Jun 9, 2020

@shoenig

So I just checked all our clusters dev and prod and none are currently exhibiting this behavior. As soon as this happens again we will execute the command and paste the results here. I'd be surprised if we don't see this in the next few days - it seems to happen about once a week or so

Thanks!

@idrennanvmware
Copy link
Contributor

idrennanvmware commented Jun 9, 2020

Had it happen. Logs attached :)
hashilogs.txt

@shoenig

You can see the last log at 14:19:30 today then a long period where things looked good (but they weren't) and then around 17:30 we saw the nomad client healthcheck fail with the timeout.

the command you sent restarted the agent and restored functionality

@svudutala
Copy link

Adding screenshots from Consul and Nomad along with logs @idrennanvmware attached on above thread

consul-hc failure

nomad-allocation-failure

@pznamensky
Copy link

In case a log file with the trace for server process could be helpful, I've attached it here: #8038 (comment)

@svudutala
Copy link

svudutala commented Jun 10, 2020

@shoenig Is there any specific nomad telemetry metric that you think would help if we monitor?

On the same lines, we are planning to add alert specifically on Nomad client HTTP check failing from Consul to capture if this happens next time.

@shoenig
Copy link
Member

shoenig commented Jun 15, 2020

Thanks for providing stack traces, @idrennanvmware , @pznamensky , @kung-foo

Interestingly so far I believe none of them include a goroutine in a running state, which is odd to reconcile with the common symptom of 1 core being used at 100%.

I was looking into golang/go#38933 but that seems unlikely as none of the stack traces reference startTemplateThread

One thing I'm poking at currently is that netpoll has changed substantially between go1.12 and go1.14, which correlates with our Nomad v0.10 -> v0.11 upgrade cycle, when symptoms began. I'm curious about this because towards the end of each trace is something like,

Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: goroutine 69909 [IO wait]:
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: internal/poll.runtime_pollWait(0x7f2c51e96c48, 0x72, 0xffffffffffffffff)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         runtime/netpoll.go:203 +0x55
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: internal/poll.(*pollDesc).wait(0xc000ed4d98, 0x72, 0x1000, 0x1000, 0xffffffffffffffff)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         internal/poll/fd_poll_runtime.go:87 +0x45
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: internal/poll.(*pollDesc).waitRead(...)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         internal/poll/fd_poll_runtime.go:92
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: internal/poll.(*FD).Read(0xc000ed4d80, 0xc000df7000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         internal/poll/fd_unix.go:169 +0x19b
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: net.(*netFD).Read(0xc000ed4d80, 0xc000df7000, 0x1000, 0x1000, 0xc, 0x0, 0x0)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         net/fd_unix.go:202 +0x4f
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: net.(*conn).Read(0xc0000d53e8, 0xc000df7000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         net/net.go:184 +0x8e
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: bufio.(*Reader).Read(0xc0008fd9e0, 0xc0010a14f0, 0xc, 0xc, 0x3184500, 0x15, 0xc000533ad0)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         bufio/bufio.go:226 +0x24f
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: io.ReadAtLeast(0x37b4d20, 0xc0008fd9e0, 0xc0010a14f0, 0xc, 0xc, 0xc, 0xc, 0x0, 0x0)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         io/io.go:310 +0x87
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: io.ReadFull(...)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         io/io.go:329
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).recvLoop(0xc000533ad0, 0x0, 0x0)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:458 +0xe3
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).recv(0xc000533ad0)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:437 +0x2b
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.newSession
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:113 +0x2a5
Jun 15 09:13:09  nomad[35920]: internal/poll.runtime_pollWait(0x7face04749a8, 0x72, 0xffffffffffffffff)
Jun 15 09:13:09  nomad[35920]:         runtime/netpoll.go:203 +0x55
Jun 15 09:13:09  nomad[35920]: internal/poll.(*pollDesc).wait(0xc00056e198, 0x72, 0x1000, 0x1000, 0xffffffffffffffff)
Jun 15 09:13:09  nomad[35920]:         internal/poll/fd_poll_runtime.go:87 +0x45
Jun 15 09:13:09  nomad[35920]: internal/poll.(*pollDesc).waitRead(...)
Jun 15 09:13:09  nomad[35920]:         internal/poll/fd_poll_runtime.go:92
Jun 15 09:13:09  nomad[35920]: internal/poll.(*FD).Read(0xc00056e180, 0xc0006d2000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
Jun 15 09:13:09  nomad[35920]:         internal/poll/fd_unix.go:169 +0x19b
Jun 15 09:13:09  nomad[35920]: net.(*netFD).Read(0xc00056e180, 0xc0006d2000, 0x1000, 0x1000, 0x0, 0xa000000000, 0x0)
Jun 15 09:13:09  nomad[35920]:         net/fd_unix.go:202 +0x4f
Jun 15 09:13:09  nomad[35920]: net.(*conn).Read(0xc001170158, 0xc0006d2000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
Jun 15 09:13:09  nomad[35920]:         net/net.go:184 +0x8e
Jun 15 09:13:09  nomad[35920]: bufio.(*Reader).Read(0xc000bf5080, 0xc0008de4e0, 0xc, 0xc, 0xc000256e20, 0x0, 0x0)
Jun 15 09:13:09  nomad[35920]:         bufio/bufio.go:226 +0x24f
Jun 15 09:13:09  nomad[35920]: io.ReadAtLeast(0x37b4d20, 0xc000bf5080, 0xc0008de4e0, 0xc, 0xc, 0xc, 0xc, 0x0, 0x0)
Jun 15 09:13:09  nomad[35920]:         io/io.go:310 +0x87
Jun 15 09:13:09  nomad[35920]: io.ReadFull(...)
Jun 15 09:13:09  nomad[35920]:         io/io.go:329
Jun 15 09:13:09  nomad[35920]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).recvLoop(0xc000256dc0, 0x0, 0x0)
Jun 15 09:13:09  nomad[35920]:         github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:458 +0xe3
Jun 15 09:13:09  nomad[35920]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).recv(0xc000256dc0)
Jun 15 09:13:09  nomad[35920]:         github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:437 +0x2b
Jun 15 09:13:09  nomad[35920]: created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.newSession
Jun 15 09:13:09  nomad[35920]:         github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:113 +0x2a5
Jun 10 09:03:54 srv1 nomad: internal/poll.runtime_pollWait(0x7f23c80df4e8, 0x72, 0xffffffffffffffff)
Jun 10 09:03:54 srv1 nomad: runtime/netpoll.go:203 +0x55
Jun 10 09:03:54 srv1 nomad: internal/poll.(*pollDesc).wait(0xc003902298, 0x72, 0x200, 0x205, 0xffffffffffffffff)
Jun 10 09:03:54 srv1 nomad: internal/poll/fd_poll_runtime.go:87 +0x45
Jun 10 09:03:54 srv1 nomad: internal/poll.(*pollDesc).waitRead(...)
Jun 10 09:03:54 srv1 nomad: internal/poll/fd_poll_runtime.go:92
Jun 10 09:03:54 srv1 nomad: internal/poll.(*FD).Read(0xc003902280, 0xc000edd8c0, 0x205, 0x205, 0x0, 0x0, 0x0)
Jun 10 09:03:54 srv1 nomad: internal/poll/fd_unix.go:169 +0x19b

@idrennanvmware
Copy link
Contributor

@shoenig - in our case we don't see the CPU spike that others have reported. The agent just becomes completely unresponsive. Almost like its deadlocked.

@kung-foo
Copy link

kung-foo commented Jun 17, 2020

@shoenig I have a nomad server currently hung (100% cpu, non-responsive). Anything you want from the system before I restart it?

@jrasell
Copy link
Member

jrasell commented Jun 17, 2020

Next time this happens (to anyone), it would be super helpful to get a stack trace from while the agent is un-responsive.

This can be done by issuing SIGABRT to the Nomad agent, which will then dump the trace into stderr/stdout (or journald logs, etc.).

e.g. pkill -SIGABRT nomad

https://golang.org/pkg/os/signal/#hdr-Default_behavior_of_signals_in_Go_programs

@kung-foo please see this previous comment; this information would be extremely useful. Thanks!

@kung-foo
Copy link

@jorgemarey
Copy link
Contributor

@kung-foo that seems to match with what's described in golang/go#38051 (comment)

@kung-foo
Copy link

@jorgemarey agreed. especially given some of the output here: #8163 (comment)

@rkettelerij
Copy link
Contributor

rkettelerij commented Jun 17, 2020

See #8163 (comment) for yet another stacktrace of a stuck Nomad server running a 100% cpu (Nomad 0.11.1) .

@rkettelerij
Copy link
Contributor

rkettelerij commented Jun 18, 2020

See #8163 (comment) for stacktrace of Nomad client (Nomad 0.11.1). We're going to upgrade to 0.11.3 to determine if the Go update has any effect.

@shoenig
Copy link
Member

shoenig commented Jun 23, 2020

At this point we are documenting the necessity for upgrading to Nomad v0.11.3+ to get past the Go runtime bug.

I'll leave this issue open for a little while longer - it would be great to hear from those who have upgraded to Nomad v0.11.3 and whether the hangs have been resolved.

I want give a big THANK YOU to everyone who helped collect information and track this down. Ya'll are awesome!

@idrennanvmware
Copy link
Contributor

@shoenig at this stage we have upgraded all our environments to 0.11.3 - if we experience the issues described here we will report back.

Thanks for the update

@rkettelerij
Copy link
Contributor

We've been running 0.11.3 for 7 days now without any issues.

@rkettelerij
Copy link
Contributor

Still no issues after 14 days of running 0.11.3. IMHO it's safe to say this issue is resolved.

@idrennanvmware
Copy link
Contributor

Agreed - we will have @raunak2004 close out the issue. Thanks everyone for all the thoughts and feedback cycles!

@raunak2004
Copy link
Author

Closing the issue since we haven't seen it with 0.11.3. Thanks everyone.

@github-actions
Copy link

github-actions bot commented Nov 5, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants