Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix job support #168

Merged
merged 7 commits into from
Feb 6, 2020
Merged

Fix job support #168

merged 7 commits into from
Feb 6, 2020

Conversation

bgamari
Copy link
Contributor

@bgamari bgamari commented Feb 3, 2020

This fixes the incorrect termination behavior described in #167. The path to get here was a bit meandering.

The "Check exit code of processes using jobs in two stages" commit refactors waitForProcess to ensure that the ProcessHandle's MVar remains reachable throughout the blocking in waitForJobCompletion. It also updates the MVar after job completion and closes the OS handles to allow the process's resources to be freed.

However, I then realized that the Win32 documentation explicitly states that the delivery of job IOCP notifications is not guaranteed. This is strange and appears to contradict the guidance given by this blog post but I have nevertheless observed hung processes that would seem to confirm the lossy behavior.

In light of this, the "Refactor waiting on job objects" commit refactors the job waiting logic to avoid IOCPs entirely. We now rather query the job for a list of its constituent processes, choose one to block on, wait until it has terminated, and iterate until we find that there are no processes left.

Additionally, we give up on trying to choose which process to take the exit code from; the probability that this heuristic works is simply too far from 1 and it otherwise does nothing but contribute non-determinism to the operation.

Previously we would rely on the exit

This fixes a few nasty bugs:

 * System.Process.waitForProcess failed to keep the ProcessHandle's MVar
   alive, potentially resulting in the finalizer being run while
   waitForJobCompletion is executing. This would cause the process
   handles to be closed, resulting in waitForJobCompletion to fail.

 * waitForProcess failed to explicitly close the job, process, and IOCP
   handles.

 * waitForProcess failed to stop delegation of Ctrl-C in processes using
   jobs.
@bgamari bgamari force-pushed the fix-jobs branch 5 times, most recently from d55ca28 to 59ec311 Compare February 4, 2020 04:29
@snoyberg snoyberg requested a review from Mistuke February 4, 2020 04:38
@bgamari
Copy link
Contributor Author

bgamari commented Feb 4, 2020

I believe this is ready for review.

To test this I booted GHC and then used the GHC to boot another GHC, this time with changes in Cabal to ensure that jobs are used for all subprocess creation. Before this patch this test failed; now it seems to reliably pass.

@bgamari bgamari closed this Feb 4, 2020
@bgamari bgamari reopened this Feb 4, 2020
@bgamari bgamari changed the title WIP: Fix job support Fix job support Feb 4, 2020
@bgamari
Copy link
Contributor Author

bgamari commented Feb 4, 2020

Ugh, Appveyor cancelled the job as I accidentally closed the MR. I guess I'll have to push a new commit.

Previously in order to wait on a job object we would create an IO
Completion Port, configure the job object to emit notifications to it
with SetInformationJobObject, and wait for JOB_OBJECT_MSG_EXIT_PROCESS
notifications until all processes have died.

This followed one piece of guidance from Microsoft [1] but according to
Microsoft's own documentation, this cannot work reliably as delivery of
job notifications is not guaranteed [2]. I have seen cases where the
processes hang waiting on job objects so I can only guess that message
loss is indeed possible.

Instead we now take a simpler approach: look at the processes in the
job, if there are none then we are done. If there are still processes,
choose one and wait for it to finish. Iterate until the job is empty.

Credit for this approach goes to Davean Scies.

[1] https://devblogs.microsoft.com/oldnewthing/20130405-00/?p=4743
[2] https://docs.microsoft.com/en-us/windows/win32/api/winnt/ns-winnt-jobobject_associate_completion_port
Copy link
Contributor

@Mistuke Mistuke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @bgamari, everything I found suggest you need to handle the ERROR_MORE_DATA case.

cbits/runProcess.c Outdated Show resolved Hide resolved
cbits/runProcess.c Show resolved Hide resolved
cbits/runProcess.c Show resolved Hide resolved
@bgamari
Copy link
Contributor Author

bgamari commented Feb 4, 2020

I've pushed an additional patch to implement support for ERROR_MORE_DATA.

Copy link
Contributor

@Mistuke Mistuke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks. This looks good to me!

@snoyberg
Copy link
Collaborator

snoyberg commented Feb 5, 2020

@bgamari Can you indicate whether this should be a major, minor, or patch revision, and add a ChangeLog entry?

@bgamari
Copy link
Contributor Author

bgamari commented Feb 5, 2020

I have added a changelog entry.

I suspect that a minor version bump is appropriate; this definitely fixes a major bug but the types and semantics of the library's exports have not changed.

@snoyberg snoyberg merged commit 97116f3 into haskell:master Feb 6, 2020
@snoyberg
Copy link
Collaborator

snoyberg commented Feb 6, 2020

Thanks!

@Mistuke
Copy link
Contributor

Mistuke commented Feb 6, 2020

this definitely fixes a major bug but the types and semantics of the library's exports have not changed.

But they did change. You dropped a parameter from OpenExtHandle.

bgamari added a commit to bgamari/cabal that referenced this pull request Feb 6, 2020
Many toolchain tools written for POSIX systems rely on the exec system
call. Unfortunately, it is not possible to implement `exec` in a
POSIX-compliant manner on Windows. In particular, the semantics of the
`exec` implementation provided by the widely-used `msvcrt` C runtime
will cause process's waiting on the `exec`'ing process to incorrectly
conclude that the process has successfully terminated when in fact it is
still running in another process.

For this reason, the `process` library exposes the `use_process_jobs`
flag to use a more strict (although still not POSIX-compliant) mechanism
for tracking process completion. This is explained in this comment [2].

Unfortunately, job support in the `process` library is currently quite
broken and was only recently fixed [1]. Consequently, we only enable job
object support for process releases >= 1.6.8.

[1] haskell/process#168
[2] https://github.com/haskell/process/blob/master/System/Process.hs#L399
bgamari added a commit to bgamari/cabal that referenced this pull request Feb 6, 2020
Many toolchain tools written for POSIX systems rely on the exec system
call. Unfortunately, it is not possible to implement `exec` in a
POSIX-compliant manner on Windows. In particular, the semantics of the
`exec` implementation provided by the widely-used `msvcrt` C runtime
will cause process's waiting on the `exec`'ing process to incorrectly
conclude that the process has successfully terminated when in fact it is
still running in another process.

For this reason, the `process` library exposes the `use_process_jobs`
flag to use a more strict (although still not POSIX-compliant) mechanism
for tracking process completion. This is explained in this comment [2].

Unfortunately, job support in the `process` library is currently quite
broken and was only recently fixed [1]. Consequently, we only enable job
object support for process releases >= 1.6.8.

[1] haskell/process#168
[2] https://github.com/haskell/process/blob/master/System/Process.hs#L399
@bgamari
Copy link
Contributor Author

bgamari commented Feb 6, 2020

@Mistuke, AFAIK ProcessHandle__ should not be part of the exported interface of process.

bgamari added a commit to bgamari/cabal that referenced this pull request Feb 7, 2020
Many toolchain tools written for POSIX systems rely on the exec system
call. Unfortunately, it is not possible to implement `exec` in a
POSIX-compliant manner on Windows. In particular, the semantics of the
`exec` implementation provided by the widely-used `msvcrt` C runtime
will cause process's waiting on the `exec`'ing process to incorrectly
conclude that the process has successfully terminated when in fact it is
still running in another process.

For this reason, the `process` library exposes the `use_process_jobs`
flag to use a more strict (although still not POSIX-compliant) mechanism
for tracking process completion. This is explained in this comment [2].

Unfortunately, job support in the `process` library is currently quite
broken and was only recently fixed [1]. Consequently, we only enable job
object support for process releases >= 1.6.8.

[1] haskell/process#168
[2] https://github.com/haskell/process/blob/master/System/Process.hs#L399
bgamari added a commit to bgamari/cabal that referenced this pull request Feb 7, 2020
Many toolchain tools written for POSIX systems rely on the exec system
call. Unfortunately, it is not possible to implement `exec` in a
POSIX-compliant manner on Windows. In particular, the semantics of the
`exec` implementation provided by the widely-used `msvcrt` C runtime
will cause process's waiting on the `exec`'ing process to incorrectly
conclude that the process has successfully terminated when in fact it is
still running in another process.

For this reason, the `process` library exposes the `use_process_jobs`
flag to use a more strict (although still not POSIX-compliant) mechanism
for tracking process completion. This is explained in this comment [2].

Unfortunately, job support in the `process` library is currently quite
broken and was only recently fixed [1]. Consequently, we only enable job
object support for process releases >= 1.6.8.

[1] haskell/process#168
[2] https://github.com/haskell/process/blob/master/System/Process.hs#L399
bgamari added a commit to bgamari/cabal that referenced this pull request Feb 7, 2020
Many toolchain tools written for POSIX systems rely on the exec system
call. Unfortunately, it is not possible to implement `exec` in a
POSIX-compliant manner on Windows. In particular, the semantics of the
`exec` implementation provided by the widely-used `msvcrt` C runtime
will cause process's waiting on the `exec`'ing process to incorrectly
conclude that the process has successfully terminated when in fact it is
still running in another process.

For this reason, the `process` library exposes the `use_process_jobs`
flag to use a more strict (although still not POSIX-compliant) mechanism
for tracking process completion. This is explained in this comment [2].

Unfortunately, job support in the `process` library is currently quite
broken and was only recently fixed [1]. Consequently, we only enable job
object support for process releases >= 1.6.8.

[1] haskell/process#168
[2] https://github.com/haskell/process/blob/master/System/Process.hs#L399
phadej pushed a commit to phadej/cabal that referenced this pull request Feb 8, 2020
Many toolchain tools written for POSIX systems rely on the exec system
call. Unfortunately, it is not possible to implement `exec` in a
POSIX-compliant manner on Windows. In particular, the semantics of the
`exec` implementation provided by the widely-used `msvcrt` C runtime
will cause process's waiting on the `exec`'ing process to incorrectly
conclude that the process has successfully terminated when in fact it is
still running in another process.

For this reason, the `process` library exposes the `use_process_jobs`
flag to use a more strict (although still not POSIX-compliant) mechanism
for tracking process completion. This is explained in this comment [2].

Unfortunately, job support in the `process` library is currently quite
broken and was only recently fixed [1]. Consequently, we only enable job
object support for process releases >= 1.6.8.

[1] haskell/process#168
[2] https://github.com/haskell/process/blob/master/System/Process.hs#L399
@bgamari bgamari mentioned this pull request Feb 10, 2020
phadej pushed a commit to phadej/cabal that referenced this pull request Feb 12, 2020
Many toolchain tools written for POSIX systems rely on the exec system
call. Unfortunately, it is not possible to implement `exec` in a
POSIX-compliant manner on Windows. In particular, the semantics of the
`exec` implementation provided by the widely-used `msvcrt` C runtime
will cause process's waiting on the `exec`'ing process to incorrectly
conclude that the process has successfully terminated when in fact it is
still running in another process.

For this reason, the `process` library exposes the `use_process_jobs`
flag to use a more strict (although still not POSIX-compliant) mechanism
for tracking process completion. This is explained in this comment [2].

Unfortunately, job support in the `process` library is currently quite
broken and was only recently fixed [1]. Consequently, we only enable job
object support for process releases >= 1.6.8.

[1] haskell/process#168
[2] https://github.com/haskell/process/blob/master/System/Process.hs#L399
phadej pushed a commit to phadej/cabal that referenced this pull request Oct 25, 2020
Many toolchain tools written for POSIX systems rely on the exec system
call. Unfortunately, it is not possible to implement `exec` in a
POSIX-compliant manner on Windows. In particular, the semantics of the
`exec` implementation provided by the widely-used `msvcrt` C runtime
will cause process's waiting on the `exec`'ing process to incorrectly
conclude that the process has successfully terminated when in fact it is
still running in another process.

For this reason, the `process` library exposes the `use_process_jobs`
flag to use a more strict (although still not POSIX-compliant) mechanism
for tracking process completion. This is explained in this comment [2].

Unfortunately, job support in the `process` library is currently quite
broken and was only recently fixed [1]. Consequently, we only enable job
object support for process releases >= 1.6.8.

[1] haskell/process#168
[2] https://github.com/haskell/process/blob/master/System/Process.hs#L399
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants