-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create new process group on process startup. #3572
Conversation
d4bb047
to
a7952ec
Compare
Hi, thanks for the PR. This looks good, we are going to review this after we have put out our next release. A couple questions in the short term. First, did you test this patch locally? It passes our CI tests which is great but having manual checks here would be helpful. Second, have you thought about this in the context of running on multiple operating systems? Raw exec should be able to run on all of our supported operating systems. Thanks again! |
Hi @chelseakomlo, |
9e909ae
to
2ba6aee
Compare
Thanks for making these improvements! We will re-review and look at merging this after our 0.7.1 release. |
We are experiencing a similar issue with child processes getting orphaned by nomad so we are also very interested in this solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind adding a test in this file? If you don't have time we can probably merge and add one when someone on the team finds time.
It should be noted that in the future using PID namespaces is a preferable alternative for exec/java drivers, but this is a complementary feature and is the best we can do for raw_exec
.
client/driver/executor/executor.go
Outdated
@@ -440,6 +449,20 @@ func ClientCleanup(ic *dstructs.IsolationConfig, pid int) error { | |||
return clientCleanup(ic, pid) | |||
} | |||
|
|||
// Cleanup any still hanging user processes | |||
func (e *UniversalExecutor) cleanupUserLeftovers(proc *os.Process) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, while I appreciate the method name I think we should go for something more precise and technical like cleanupChildProcesses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @schmichael Do you happen to know if any documentation exists around using PID namespaces instead of the raw_exec driver? Thanks so much!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbesser PID namespaces are much more complex and only relevant for the exec
and java
drivers. This is a fantastic writeup about them: https://hackernoon.com/the-curious-case-of-pid-namespaces-1ce86b6bc900
Your process group change is great. It's simpler than namespaces, works for raw_exec
as well as exec/java
, and won't interfere with a future implementation of namespaces.
Let me know if you don't have time to add a test, and I can just merge and add one.
Hi @schmichael, thanks for the comments and approval. I'm a bit tight on time, but as you proposed, we can merge it and I can try to add tests in the future if none will exist at the time. |
Clean up by sending SIGKILL to the whole process group.
Thanks for the PR @emate! Sorry we couldn't get this into 0.8.0 but now it'll be in 0.8.1 |
It is probably a good basis for #2117 too. I'm on vacation, so I will try to get a look at it. |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
The problem
Currently, when shutting down user's jobs, nomad sends SIGINT to the child process, which should cleanup any created subprocesses. However, if main user process is not able to cleanup within
kill_timeout
time window, it gets SIGKILLed which in turn causes all its child processes to be orphaned and reparented by the init process. This can lead to the situation when user loses control over the orphaned processes and they can hang there forever, until manual kill from the user.There are at least two ways to prevent that:
kill_timeout
is reached, nomad executor sends SIGKILL to the whole process group which guarantees that every user-script subprocess are destroyed properly. This looks like more elegant way and keeps the whole setup very simple.Changes