Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock on Windows Server 2012 when passing same pipe for stdout and stderr #76

Open
ezyang opened this issue Nov 18, 2016 · 7 comments

Comments

@ezyang
Copy link
Contributor

ezyang commented Nov 18, 2016

The following Haskell program, built with GHC 8.0.1, deadlocks on Windows Server 2012:

import System.Process
import System.IO
import System.Exit
main = do
    hSetBuffering stderr NoBuffering
    hSetBuffering stdout NoBuffering
    (readIn, writeIn) <- createPipe
    (readOut, writeOut) <- createPipe
    (_,_,_,proch) <- createProcess (proc "python" ["-i"]) {
            std_in = UseHandle readIn,
            std_out = UseHandle writeOut,
            std_err = UseHandle writeOut
        }
    hPutStrLn writeIn "print(42)"
    hPutStrLn writeIn "quit()"
    hClose writeIn
    exit <- waitForProcess proch
    hPutStrLn stderr =<< hGetContents readOut
    hClose readOut
    exitWith exit

You can reproduce the error on AppVeyor with the following repository: https://github.com/ezyang/ghceye (appveyor.yml included).

It does not appear to deadlock on Windows 10 or Linux. I have not attempted to reproduce on a local install of Windows Server 2012. I tried this similar Python code which did not deadlock, which is why I believe it to be a bug in the process library:

import subprocess
import os

(rin, win) = os.pipe()
(rout, wout) = os.pipe()
p = subprocess.Popen(["python", "-i"], stdin=rin, stdout=wout, stderr=wout)
os.write(win, "print(2)")
os.write(win, os.linesep)
os.write(win, "quit()")
os.write(win, os.linesep)
p.wait()
@snoyberg
Copy link
Collaborator

It looks like you've left the opposite ends of the pipe open. Can you try setting close_fds and closing readIn and writeOut and see if that changes behavior? (Just a first stab at the problem.) It is confusing why this would be different from Windows 2012 versus other OSes though.

@ezyang
Copy link
Contributor Author

ezyang commented Nov 18, 2016

It still hangs in this situation (though you are right that close_fds helps avoid deadlocks on Linux in other situations.) See https://ci.appveyor.com/project/ezyang/ghceye corresponding to ezyang/ghceye@799ebb3 (I also removed the passed in stdin because we can deadlock even if it's managed by the process itself.)

Empirically, it also seems to be the case that if you don't feed stdin into the subprocess, the deadlock doesn't occur.

@snoyberg
Copy link
Collaborator

I've added more debugging output, which yields something very interesting: this is blocked on the waitForProcess call, not the hGetContents call.

@snoyberg
Copy link
Collaborator

OK, I accidentally fixed the bug, and believe I understand why your original program failed. If you look at this commit:

snoyberg/ghceye@347cca9

I changed things such that the output is read in a separate thread. My theory is that on Windows Server 2012, the process is not exiting because it is still attempting to write some data to its stdout, but the buffer is full. The buffer will not be emptied until we get hGetContents, which will not happen until waitForProcess completes, causing a deadlock. In my version, the buffer is emptied by a separate thread, avoiding the deadlock.

This looks like valid behavior in the process package to me, and an example of it being difficult to use the process API safely without concurrency.

@ezyang
Copy link
Contributor Author

ezyang commented Nov 18, 2016

Hmm. While that it progress, I think there a few unexplained things:

  1. I noticed you added -threaded to GHC. If I remove -threaded, it deadlocks again. Does this mean that System.Process can only be used safely in a deadlock-free manner by the threaded runtime? If so, this merits a prominent disclaimer on the package. But it seems extremely odd to me that this cannot be made to work in the single-threaded runtime. Is the problem that waitpid() is blocking? Then we should switch to a different mechanism of determining if a process has exited, e.g., catching SIGCHILD and writing a byte into a appropriate pipe which we are selecting on (with an appropriate implementation on Windows.)
  2. Could Windows Server 2012 really have a buffer whose size in the tens of bytes? This seems difficult to believe. Additionally, the claim is inconsistent with (my claim) that the original code works as long as you aren't piping data to stdin (I know this works, because we've been running this kind of code on AppVeyor for quite some time now.)
  3. In the original code sample I extracted this from, I was spawning a thread to read out the output, and I was compiling with the threaded runtime, and it was still deadlocking. Obviously I need to boil that into a minimal test case, but I'm disinclined to believe that this is the full story.

@snoyberg
Copy link
Collaborator

  1. I do believe the problem is that waitpid() is blocking. I think I probably would be in favor of putting a disclaimer on the package saying it should really only be used with the multithreaded runtime. However, given that when I took over maintainership nothing like that existed, I'm concerned that I might be mistaken about my assumptions.
  2. I don't think the buffer is actually that small. My guess (now forced to be a bit more thought out by your observations) is that we're somehow ending up in a situation where the python process refuses to exit because the buffer has data in it. So perhaps instead of "it's still trying to write to the full buffer," the real answer is "the process is refusing to exit because data exists on the buffer." I'm not sure.
  3. You could be right, I'm definitely shooting in the dark here. However, I would say that the minimal case here falls into the undefined behavior category, since on all OSes, with enough output coming from the process, it will deadlock. If you can demonstrate the deadlock in a case of behavior which should be fully defined, that will be even more interesting.

@ezyang
Copy link
Contributor Author

ezyang commented Nov 18, 2016

In the meantime, here are some references for how to implement non-blocking waitpid: https://ldpreload.com/blog/signalfd-is-useless https://github.com/ravinet/mahimahi/blob/master/src/util/event_loop.cc#L25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants