process not managed by Nomad killed on restart #23969

yigit-erkoca · 2024-09-16T14:07:27Z

Nomad version

Nomad v1.7.7
BuildDate 2024-04-16T19:26:43Z
Revision 0f34c85

Operating system and Environment details

Windows 10 Enterprise 2016 LTSB

Issue

Nomad instance is killing another process on the system that is not deployed by nomad.

I see in the application event viewer log the following log from source "Process Exit Monitor"
2024-09-16 23:20:06 The process 'C:\simulator\BoardSimulator.exe' was terminated by the process 'C:\Program Files\Infra\Nomad\nomad.exe' with termination code 1. The creation time for the exiting process was 0x01db07ac4767577e.

Reproduction steps

On system reboot, some applications like simulator executables start by other windows services, as well as nomad instance.
This is a sporadic issue, does not happen at every reboot, but maybe estimated to be once in every ~100.

Expected Result

Nomad not killing/interfering any other process on the system that is not deployed by using nomad jobs.

Actual Result

Nomad is terminating other processes that are running in the system.

Job file (if appropriate)

There is only one job using raw_exec. We are just setting it up.
NomadJob.txt

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

nomad.log
The timestamp of the process termination according to the eventlog: 2024-09-16 23:20:06

The text was updated successfully, but these errors were encountered:

jrasell · 2024-09-20T08:37:30Z

Hi @yigit-erkoca and thanks for raising this issue. Nomad should certainly not be terminating processes it does not control, however, having looked at the client logs nothing stands out initially. I'll mark this for further investigation and add it to our backlog. If you do get any additional information that might be useful, please add it to this issue.

yigit-erkoca · 2024-09-20T09:48:26Z

Hi @yigit-erkoca and thanks for raising this issue. Nomad should certainly not be terminating processes it does not control, however, having looked at the client logs nothing stands out initially. I'll mark this for further investigation and add it to our backlog. If you do get any additional information that might be useful, please add it to this issue.

Hello @jrasell, thank you for your attention! Indeed I'm extending the logs to find out the root cause. I have a setup that I can reproduce this issue quite easily. I'll add more details to the issue and let you know. I appreciate the help.

tgross · 2024-10-15T15:39:16Z

It looks like the relevant portion of the logs is the following, which suggests that the spuriously killed process is getting killed while we're trying to restore the process for the previously-running task.

2024-09-16T23:20:06.267+0200 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=69909b8d-3d22-e823-e33b-ff669048d994 task=infra-cluster-package-server-task type=Received msg="Task received by client" failed=false
2024-09-16T23:20:06.268+0200 [ERROR] client.driver_mgr.raw_exec: failed to reattach to executor: driver=raw_exec error="error creating rpc client for executor plugin: Reattachment process not found" task_id=69909b8d-3d22-e823-e33b-ff669048d994/infra-cluster-package-server-task/15a41d28
2024-09-16T23:20:06.268+0200 [ERROR] client.alloc_runner.task_runner: error recovering task; cleaning up: alloc_id=69909b8d-3d22-e823-e33b-ff669048d994 task=infra-cluster-package-server-task error="failed to reattach to executor: error creating rpc client for executor plugin: Reattachment process not found" task_id=69909b8d-3d22-e823-e33b-ff669048d994/infra-cluster-package-server-task/15a41d28
2024-09-16T23:20:06.268+0200 [TRACE] client.alloc_runner.task_runner: failed to reattach to task; will not run until server is contacted: alloc_id=69909b8d-3d22-e823-e33b-ff669048d994 task=infra-cluster-package-server-task

What we see here is that the client starts up, looks in its local state, and tries to restore handles to the tasks that were running before the host shutdown (ref task_runner.go#L1290-L1321). Of course, none of them are going to be running, so that recovery will fail as expected because we fail to reattach to the executor process (ref driver.go#L288-L297).

We then call DestroyTask which should be returning ErrTaskNotFound (ref driver.go#L472) because the task handle cannot exist in the driver's state at this point. We won't see that error in the logs because we expect it so we explicitly ignore it (ref task_runner.go#L1310-L1312).

While we have saved the PID of the previously-running task in the local state, I don't see any code path where this PID is actually being made available for the driver to kill.

@yigit-erkoca you say you have a reliable reproduction, right? Can you confirm that the PID of the process getting incorrectly killed the same as the PID of the previously-running task or its executor process?

tgross · 2024-10-17T20:32:08Z

I did a little more testing and what's particularly weird about all of this is that the executor process is the one that actually sends the kill signal to the child process (or the Windows equivalent which is TerminateProcess). So if the executor process is gone -- which it must be because we rebooted -- then there's nothing to send a kill signal to another child process. It would definitely be helpful to know whether the PID we're trying to kill is the executor or previously-running task.

yigit-erkoca · 2024-10-18T14:30:06Z

@tgross, thank you for looking into this!

We have indeed a quite reliable reproduction environment. We have reproduced it again today to provide you more information with an example.

Nomad is installed as windows service, and starts normally, starts all the executors, tasks, etc.

All the executors listed below; please note the highlighted instance (which will be the one we will be interested in)

After restarting/rebooting nomad will try to restore the tasks, and one of these occasions if there is any other process with the same PID above "5652" in this case notepad.exe gets the "5652".

Nomad during the restore action, it will kill the notepad.exe. Which can be seen in the windows event viewer below.

I hope this clarifies your question. From this, it looks like the PID is the executor's PID, not the task's PID.

If you need any further information, or things that we can try, please let us we can use our reproduction environment.

tgross · 2024-10-18T14:39:18Z

Thanks @yigit-erkoca, that's super helpful! That cuts the size of the problem in half and points me towards the go-plugin interface. I'll investigate further and report back.

tgross · 2024-10-18T19:04:11Z

I did a little bit of parallel investigation on Linux, and using killsnoop I verified that we never send a kill signal to the executor process PID after host restart (or any other PID, for that matter). So that further narrows this down to some quirk of Windows-specific behavior. That's a little weird, because this code should be mostly shared between them... the differences are in the executors themselves, which are not running at the time this all happens.

tgross · 2024-10-18T19:41:29Z

Looks like I found the culprit, and it totally can happen on other platforms. It's this line in go-plugin: cmd_reattach.go#L31.

We have a PID for the executor and try to reattach
os.FindProcess finds a process for that PID, just not the one we want
We can't connect to it because it's not an executor
go-plugin says, "oops, executor must be broken, let's kill it"

I'm having a chat with colleagues about the best approach to fix this. The obvious fix is to simply delete that line, but I assume there's a reason it exists. 😁 (Edit: this is an extremely old behavior hashicorp/go-plugin#4 which makes it even trickier)

During reattachment, we look to see if the process corresponding to the stored PID is running. If so, we try to connect to that process. If that fails, we kill the process under the presumption it's not working, and return ErrProcessNotFound. But during reattachment we don't know that the PID we have is still valid. Which means that the process we're trying to attach to may have exited and a different process has spawned with the same PID. This results in some unrelated process getting silently killed. This impacts Nomad when running the `rawexec` or `exec` task drivers, because the Nomad agent spawns an "executor" process via go-plugin to control the workloads, and these executors are left running when Nomad exits. If the executors die in the meantime (or the host is rebooted), then we can potentially kill a random process on the host. Because there's no way for go-plugin to know whether the process is a go-plugin server without connecting, this kill is never really safe. Remove it. Ref: hashicorp/nomad#23969 Ref: https://hashicorp.atlassian.net/browse/NET-11233

tgross · 2024-10-21T18:42:03Z

Dependency update is here: #24265

During allocation recovery, the go-plugin reattachment workflow checks to see if the PID we stored in client state is still running. If so, we try to connect to that process. If that fails, we kill the process under the presumption it's not working. But during reattachment we don't know that the PID we have is still valid. Which means that the process we're trying to attach to may have exited and a different process has spawned with the same PID. This results in some unrelated process getting silently killed. Update go-plugin to 1.6.2, which includes a bug fix so we do not kill the process. Fixes: #23969

peter-harmann-tfs · 2024-11-11T13:52:46Z

@tgross Hello. We are running the fixed version, but it seems our processes are still being killed, this time likely from here.

We see "plugin failed to exit gracefully" in nomad logs just as our process ends unexpectedly. We are not able to reliably reproduce the issue at the moment (seems to occur under different circumstances than the original).

tgross · 2024-11-11T14:13:00Z

@peter-harmann-tfs just to confirm, you mean an unrelated process is getting killed? That is, one not managed by Nomad?

peter-harmann-tfs · 2024-11-11T14:27:15Z

@tgross Yes, it is an unrelated process.

Maybe to clarify, I am colleague of yigit-erkoca, so I am talking about the same deployment as the original issue.

tgross · 2024-11-11T14:30:42Z

Ok, thanks. I'll reopen and get this on the queue for another in-depth investigation.

cenk-saldirim · 2024-11-21T16:30:43Z

Hi, I am colleague of yigit-erkoca and peter-harmann-tfs.

We conducted a more detailed investigation regarding this issue, in order to provide steps to reproduce it.
When we correlated the time we identified a non-nomad process being killed with the timestamp in the Nomad logs, we observed the following log messages in sequence: 'plugin failed to exit gracefully' and 'reattached plugin process exited'.

We extended the logs in the reattach() and Kill() functions in the client.go file. This allowed us to log the reattach PID and reattach address information for the process that is being terminated.

The issue is finally reproduced, and observed that it occurs when the following sequence of events takes place:

Nomad has reattach PID and reattach Address information for the executor and try to reattach
The executor process has exited and a different process has spawned with the same PID. Additionally the reattach address is occupied by a different process (It's a rare scenario, but it can happen)
The reattach address is reachable, but no communication can be established (because the process using that reattach address is not an executor)
go-plugin decides to kill the process(which is not managed by nomad) using reattach PID information

I hope the information above is sufficient to identify the root cause of the problem. If more details are needed, we would be happy to provide them.

tgross · 2024-11-21T16:38:37Z

Thanks @cenk-saldirim. That scenario is exactly what hashicorp/go-plugin#320 was supposed to have fixed, which is particularly strange. Also, just a heads up that my colleague @mismithhisler is taking over the investigation from our end.

cenk-saldirim · 2024-11-21T17:31:24Z

Thanks for the information @tgross. I believe, the only difference between previous issue and the current one is the reattach address of the executor process which has exited is actively being used by another process. If that reattach address:port is not actively listening, go-plugin does not attempt to kill any process.

tgross · 2024-11-21T18:15:50Z

Ah! Sorry, I completely missed that... that complicates things quite a bit!

mismithhisler · 2024-11-21T20:03:28Z

Hi @cenk-saldirim, thanks for all the information. Do you have any more logs you could attach here showing this most recent issue? The "plugin failed to exit gracefully" is helpful, but we might be able to narrow this search with more.

Also, can you confirm the clients that are having this issue are only running raw_exec tasks? And not running any external task drivers (i.e. Podman etc)?

mismithhisler · 2024-11-21T22:59:00Z

@cenk-saldirim Just following up to my previous comment to let you know we can reproduce this and are working on a fix. Thanks again for the info you provided.

cenk-saldirim · 2024-11-22T07:55:19Z

Hi @mismithhisler, thank you very much for your finding and glad to hear that you can reproduce this. I can share the logs we collected while observing this behavior, if still needed.

mismithhisler · 2024-11-22T14:46:56Z

I think we should be good on logs for now, thank you though!

yigit-erkoca added the type/bug label Sep 16, 2024

yigit-erkoca changed the title ~~Nomad on Windows killing another process on the system~~ Nomad on Windows (raw_exec) killing another process on the system Sep 17, 2024

This was referenced Sep 17, 2024

nomad kills random processes on restart while restoring old processes #17960

Closed

Additional trace logging for task start, recover, destroy, signal #23974

Closed

tgross added this to Nomad - Community Issues Triage Sep 17, 2024

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Sep 17, 2024

jrasell added theme/platform-windows theme/client stage/needs-investigation labels Sep 20, 2024

jrasell moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Sep 20, 2024

tgross added the hcc/jira label Oct 3, 2024

tgross mentioned this issue Oct 3, 2024

Windows -> Raw exec = Broken with 1.8.2 #23668

Closed

tgross self-assigned this Oct 10, 2024

tgross moved this from Needs Roadmapping to In Progress in Nomad - Community Issues Triage Oct 10, 2024

tgross added this to the 1.9.x milestone Oct 15, 2024

tgross removed the stage/needs-investigation label Oct 15, 2024

tgross added the stage/waiting-reply label Oct 17, 2024

tgross removed the stage/waiting-reply label Oct 18, 2024

tgross changed the title ~~Nomad on Windows (raw_exec) killing another process on the system~~ process not managed by Nomad killed on restart Oct 18, 2024

tgross removed the theme/platform-windows label Oct 18, 2024

tgross mentioned this issue Oct 18, 2024

reattach: don't kill process on failed reconnection hashicorp/go-plugin#320

Merged

tgross modified the milestones: 1.9.x, 1.9.2 Oct 21, 2024

tgross mentioned this issue Oct 21, 2024

deps: update go-plugin to 1.6.2 #24265

Merged

tgross closed this as completed in #24265 Oct 21, 2024

tgross closed this as completed in 6d17a9b Oct 21, 2024

github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Oct 21, 2024

hc-github-team-nomad-core mentioned this issue Oct 21, 2024

Backport of deps: update go-plugin to 1.6.2 into release/1.9.x #24267

Merged

tgross reopened this Nov 11, 2024

tgross removed their assignment Nov 12, 2024

tgross assigned mismithhisler Nov 21, 2024

mismithhisler linked a pull request Nov 22, 2024 that will close this issue

executor: fail early on reattach if listener is not executor #24538

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

process not managed by Nomad killed on restart #23969

process not managed by Nomad killed on restart #23969

yigit-erkoca commented Sep 16, 2024 •

edited

Loading

jrasell commented Sep 20, 2024

yigit-erkoca commented Sep 20, 2024

tgross commented Oct 15, 2024

tgross commented Oct 17, 2024

yigit-erkoca commented Oct 18, 2024

tgross commented Oct 18, 2024

tgross commented Oct 18, 2024

tgross commented Oct 18, 2024 •

edited

Loading

tgross commented Oct 21, 2024

peter-harmann-tfs commented Nov 11, 2024

tgross commented Nov 11, 2024

peter-harmann-tfs commented Nov 11, 2024

tgross commented Nov 11, 2024

cenk-saldirim commented Nov 21, 2024

tgross commented Nov 21, 2024

cenk-saldirim commented Nov 21, 2024

tgross commented Nov 21, 2024

mismithhisler commented Nov 21, 2024

mismithhisler commented Nov 21, 2024

cenk-saldirim commented Nov 22, 2024

mismithhisler commented Nov 22, 2024

process not managed by Nomad killed on restart #23969

process not managed by Nomad killed on restart #23969

Comments

yigit-erkoca commented Sep 16, 2024 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

jrasell commented Sep 20, 2024

yigit-erkoca commented Sep 20, 2024

tgross commented Oct 15, 2024

tgross commented Oct 17, 2024

yigit-erkoca commented Oct 18, 2024

tgross commented Oct 18, 2024

tgross commented Oct 18, 2024

tgross commented Oct 18, 2024 • edited Loading

tgross commented Oct 21, 2024

peter-harmann-tfs commented Nov 11, 2024

tgross commented Nov 11, 2024

peter-harmann-tfs commented Nov 11, 2024

tgross commented Nov 11, 2024

cenk-saldirim commented Nov 21, 2024

tgross commented Nov 21, 2024

cenk-saldirim commented Nov 21, 2024

tgross commented Nov 21, 2024

mismithhisler commented Nov 21, 2024

mismithhisler commented Nov 21, 2024

cenk-saldirim commented Nov 22, 2024

mismithhisler commented Nov 22, 2024

yigit-erkoca commented Sep 16, 2024 •

edited

Loading

tgross commented Oct 18, 2024 •

edited

Loading