-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad node becoming un-responsoive #7987
Comments
Hi @raunak2004 and thanks for reporting this! So it looks from the systemd logs that this client was restarted? There's a couple things I'm seeing here.
The Nomad CLI should typically be pointed to the servers, not to the clients. Are you talking only about the CLI commands that directly interact with the "stuck" client (like The log-lines with the longer timestamps are from Nomad's embedded
Is your Consul agent on that host showing any logs that might give us another clue? Probably unrelated, but from the systemd process tree output, this log line suggests that we have a logmon process that didn't get reparented after the restart (but that shouldn't cause Nomad to become unresponsive at all). It might be helpful if you could share your systemd unit file.
(Also, I hope you don't mind, but I wrapped your logs and config in some tags to make it a bit easier to read.) |
@tgross Posting it for @raunak2004 Here is the systemd file
|
No there are no logs in consul which would indicate about the issue all the consul servers were healthy when nomad client went unhealthy. As a workaround for now we have restarted the agent. |
Ok, but if we saw a 500 error in consul-template I'd expect the local Consul agent to be having an issue. Nothing in the systemd logs for that? Also, can you describe the failure mode in a little more detail... are you talking only about the CLI commands that directly interact with the "stuck" client (like nomad alloc logs, etc.)? |
@tgross I didn't realize @raunak2004 had already opened this issue. Here's more details |
We just had this happen again today, this time on a windows node. Powershell shows the service as running - Consul shows the Nomad healthcheck failing. Restart the service - all returns to normal. |
@tgross - here's what we see in the healthcheck on Consul when this scenario occurs.
Remediation is restarting the agent |
Next time this happens (to anyone), it would be super helpful to get a stack trace from while the agent is un-responsive. This can be done by issuing e.g. https://golang.org/pkg/os/signal/#hdr-Default_behavior_of_signals_in_Go_programs |
So I just checked all our clusters dev and prod and none are currently exhibiting this behavior. As soon as this happens again we will execute the command and paste the results here. I'd be surprised if we don't see this in the next few days - it seems to happen about once a week or so Thanks! |
Had it happen. Logs attached :) You can see the last log at 14:19:30 today then a long period where things looked good (but they weren't) and then around 17:30 we saw the nomad client healthcheck fail with the timeout. the command you sent restarted the agent and restored functionality |
Adding screenshots from Consul and Nomad along with logs @idrennanvmware attached on above thread |
In case a log file with the trace for server process could be helpful, I've attached it here: #8038 (comment) |
@shoenig Is there any specific nomad telemetry metric that you think would help if we monitor? On the same lines, we are planning to add alert specifically on Nomad client HTTP check failing from Consul to capture if this happens next time. |
Thanks for providing stack traces, @idrennanvmware , @pznamensky , @kung-foo Interestingly so far I believe none of them include a goroutine in a I was looking into golang/go#38933 but that seems unlikely as none of the stack traces reference One thing I'm poking at currently is that
|
@shoenig - in our case we don't see the CPU spike that others have reported. The agent just becomes completely unresponsive. Almost like its deadlocked. |
@shoenig I have a nomad server currently hung (100% cpu, non-responsive). Anything you want from the system before I restart it? |
@kung-foo please see this previous comment; this information would be extremely useful. Thanks! |
more stack traces: https://gist.github.com/kung-foo/d01b17fbf20de9836cea2903da2ac46b |
@kung-foo that seems to match with what's described in golang/go#38051 (comment) |
@jorgemarey agreed. especially given some of the output here: #8163 (comment) |
See #8163 (comment) for yet another stacktrace of a stuck Nomad server running a 100% cpu (Nomad 0.11.1) . |
See #8163 (comment) for stacktrace of Nomad client (Nomad 0.11.1). We're going to upgrade to 0.11.3 to determine if the Go update has any effect. |
At this point we are documenting the necessity for upgrading to Nomad v0.11.3+ to get past the Go runtime bug. I'll leave this issue open for a little while longer - it would be great to hear from those who have upgraded to Nomad v0.11.3 and whether the hangs have been resolved. I want give a big THANK YOU to everyone who helped collect information and track this down. Ya'll are awesome! |
@shoenig at this stage we have upgraded all our environments to 0.11.3 - if we experience the issues described here we will report back. Thanks for the update |
We've been running 0.11.3 for 7 days now without any issues. |
Still no issues after 14 days of running 0.11.3. IMHO it's safe to say this issue is resolved. |
Agreed - we will have @raunak2004 close out the issue. Thanks everyone for all the thoughts and feedback cycles! |
Closing the issue since we haven't seen it with 0.11.3. Thanks everyone. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad node became unresponsive which causes no jobs to be scheduled.
The
systemctl status nomad
returns active but none of the nomad CLI command return any response. Since the nomad is unresponsive any service dependent on nomad also starts acting up as in this case it is zookeeperLogs:
Configuration used for nomad:
The text was updated successfully, but these errors were encountered: