Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new process group on process startup. #3572

Merged
merged 4 commits into from
Apr 13, 2018
Merged

Conversation

emate
Copy link
Contributor

@emate emate commented Nov 19, 2017

The problem
Currently, when shutting down user's jobs, nomad sends SIGINT to the child process, which should cleanup any created subprocesses. However, if main user process is not able to cleanup within kill_timeout time window, it gets SIGKILLed which in turn causes all its child processes to be orphaned and reparented by the init process. This can lead to the situation when user loses control over the orphaned processes and they can hang there forever, until manual kill from the user.
There are at least two ways to prevent that:

  1. User should implement timeout mechanism in all of his subprocesses (or at least the first one) and make sure it cleans everything up before being SIGKILLed by nomad executor.
  2. Nomad starts all user processes in separate process group, and - on shutdown - once kill_timeout is reached, nomad executor sends SIGKILL to the whole process group which guarantees that every user-script subprocess are destroyed properly. This looks like more elegant way and keeps the whole setup very simple.

Changes

  • Create new process group upon launching a user command.
  • Clean up by sending SIGKILL to the whole process group.

@emate emate force-pushed the master branch 4 times, most recently from d4bb047 to a7952ec Compare November 20, 2017 21:10
@chelseakomlo
Copy link
Contributor

Hi, thanks for the PR. This looks good, we are going to review this after we have put out our next release.

A couple questions in the short term. First, did you test this patch locally? It passes our CI tests which is great but having manual checks here would be helpful. Second, have you thought about this in the context of running on multiple operating systems? Raw exec should be able to run on all of our supported operating systems.

Thanks again!

@emate
Copy link
Contributor Author

emate commented Nov 22, 2017

Hi @chelseakomlo,
we're currently using our forked version of Nomad with those fixes and it works as expected (AWS, Linux amd64).
I've changed a bit my code to support all unix platforms (BSD, Linux and Darwin supports kill() syscall with negative PID) and make it Windows compatible (plain Process.Kill() when no new process group is created). This should do the trick.

@emate emate force-pushed the master branch 2 times, most recently from 9e909ae to 2ba6aee Compare November 23, 2017 12:26
@chelseakomlo
Copy link
Contributor

Thanks for making these improvements! We will re-review and look at merging this after our 0.7.1 release.

@jbesser
Copy link

jbesser commented Feb 5, 2018

We are experiencing a similar issue with child processes getting orphaned by nomad so we are also very interested in this solution.

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind adding a test in this file? If you don't have time we can probably merge and add one when someone on the team finds time.

It should be noted that in the future using PID namespaces is a preferable alternative for exec/java drivers, but this is a complementary feature and is the best we can do for raw_exec.

@@ -440,6 +449,20 @@ func ClientCleanup(ic *dstructs.IsolationConfig, pid int) error {
return clientCleanup(ic, pid)
}

// Cleanup any still hanging user processes
func (e *UniversalExecutor) cleanupUserLeftovers(proc *os.Process) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, while I appreciate the method name I think we should go for something more precise and technical like cleanupChildProcesses

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @schmichael Do you happen to know if any documentation exists around using PID namespaces instead of the raw_exec driver? Thanks so much!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbesser PID namespaces are much more complex and only relevant for the exec and java drivers. This is a fantastic writeup about them: https://hackernoon.com/the-curious-case-of-pid-namespaces-1ce86b6bc900

Your process group change is great. It's simpler than namespaces, works for raw_exec as well as exec/java, and won't interfere with a future implementation of namespaces.

Let me know if you don't have time to add a test, and I can just merge and add one.

@emate
Copy link
Contributor Author

emate commented Feb 7, 2018

Hi @schmichael, thanks for the comments and approval. I'm a bit tight on time, but as you proposed, we can merge it and I can try to add tests in the future if none will exist at the time.

@schmichael schmichael merged commit 1e67780 into hashicorp:master Apr 13, 2018
@schmichael
Copy link
Member

Thanks for the PR @emate! Sorry we couldn't get this into 0.8.0 but now it'll be in 0.8.1

schmichael added a commit that referenced this pull request Apr 13, 2018
@ninoles
Copy link
Contributor

ninoles commented Apr 13, 2018

It is probably a good basis for #2117 too. I'm on vacation, so I will try to get a look at it.

@github-actions
Copy link

github-actions bot commented Mar 6, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants