Race condition in `waitForProcess` #38

NorfairKing · 2021-03-19T10:57:45Z

I'm not sure if this is supposed to go here or in process, but this code fails about 1 iteration in 200:

#!/usr/bin/env stack
-- stack --resolver lts-16.31 script

{-# LANGUAGE OverloadedStrings #-}

import Control.Monad
import qualified Data.ByteString as SB
import System.IO
import System.Mem
import System.Process.Typed

main :: IO ()
main =
  forM_ [1 .. 10000] $ \i -> do
    withProcessTerm (setStdin createPipe $ setStdout createPipe $ setStderr inherit $ proc "cat" []) $ \ph -> do
      putStrLn $ "Iteration: " <> show i
      let inHandle = getStdin ph
      let outHandle = getStdout ph
      let contents = "Hello world\n"
      SB.hPut inHandle contents
      hFlush inHandle
      output <- SB.hGet outHandle (SB.length contents)
      when (output /= contents) $ fail "failed to read."

I couldn't reproduce the failure without the interaction with the standard streams, so I may very well be doing something wrong, but every 200 iterations or so I see this:

Iteration: 134
Iteration: 135
Iteration: 136
repro.hs: waitForProcess: does not exist (No child processes)

The text was updated successfully, but these errors were encountered:

NorfairKing · 2021-03-19T11:01:06Z

I tried reproducing this problem with process (without typed-process) but could not reproduce the problem like this:

#!/usr/bin/env stack
-- stack --resolver lts-16.31 script

{-# LANGUAGE OverloadedStrings #-}

import Control.Monad
import qualified Data.ByteString as SB
import System.IO
import System.Mem
import System.Process

main :: IO ()
main =
  forM_ [1 .. 10000] $ \i -> do
    withCreateProcess ((proc "cat" []) {std_in = CreatePipe, std_out = CreatePipe, std_err = Inherit}) $ \(Just inHandle) (Just outHandle) _ ph -> do
      putStrLn $ "Iteration: " <> show i
      let contents = "Hello world\n"
      SB.hPut inHandle contents
      hFlush inHandle
      output <- SB.hGet outHandle (SB.length contents)
      when (output /= contents) $ fail "failed to read."

NorfairKing · 2021-03-19T11:19:18Z

This is pretty easy to turn into a failing test. I wouldn't mind making a PR for that if it would help!

snoyberg · 2021-03-21T06:45:53Z

I'm not sure it's relevant to this test case failing, but the approach you're taking here is, generally, broken. You cannot write to stdin in the same thread that reads from stdout, since that can lead to deadlock on buffers filling up. Unlikely to be relevant here, but worth mentioning.

Reproducing case would be appreciated. I'll see if something jumps out at me in the code.

snoyberg · 2021-03-21T06:47:46Z

Also, it makes sense you can't reproduce with that process example, since it doesn't try to terminate the process afterwards (presumably where the error message is coming from).

snoyberg · 2021-03-21T07:03:33Z

Strangely enough, making this seemingly innocuous change seems to fix the problem on my system, can you confirm?

@@ -768,6 +768,7 @@ startProcess pConfig'@ProcessConfig {..} = liftIO $ do
                       -- then call waitForProcess ourselves
                       Left _ -> do
                           eres <- try $ P.terminateProcess pHandle
+                          P.getProcessExitCode pHandle
                           ec <-
                             case eres of
                               Left e

My theory: there's a subtle race condition in process where it's not fulfilling its responsibilities around waitForProcess being idempotent in some cases. Calling getProcessExitCode first forces the MVar to end up in a valid state, making the later call to waitForProcess succeed reliably.

snoyberg · 2021-03-21T07:54:51Z

The bug is definitely in process, and it's an async exception bug. It's exactly the kind of bug that typed-process is trying to work around. And unfortunately there's no obvious solution to it without breaking the goals of process.

Issue: we want waitForProcess to be interruptible. If it gets interrupted, we never get a chance to see the exit code. But if the waitpid system call succeeds, then the process is removed from the process table and can never be queried again.

In our case, the polling thread in typed-process is calling waitForProcess and gets canceled. It has to be canceled to avoid a separate race condition: terminateProcess getting called on the wrong process! However, because waitForProcess isn't actually async-exception-safe, none of this works.

I'm open to suggestions on best behavior here, but this is probably just going to remain a bug in some case no matter what, with some potential workarounds in place for common situations. The right solution is for process to handle async exceptions correctly, but I'm not sure if that's possible while allowing it to be interruptible.

NorfairKing · 2021-03-22T04:06:03Z

I'm not sure it's relevant to this test case failing, but the approach you're taking here is, generally, broken. You cannot write to stdin in the same thread that reads from stdout, since that can lead to deadlock on buffers filling up. Unlikely to be relevant here, but worth mentioning.

OH, good point! I hope this isn't an issue in this simple case. I just needed some example of interacting with an external process so I figured cat was the simplest example but I'm open to other suggestions.

Also, it makes sense you can't reproduce with that process example, since it doesn't try to terminate the process afterwards (presumably where the error message is coming from).

Wait what? I just had another look at the source and withCreateProcess calls cleanupProcess in a bracket (http://hackage.haskell.org/package/process-1.6.11.0/docs/src/System.Process.html#withCreateProcess) and cleanupProcess calls terminateProcess (http://hackage.haskell.org/package/process-1.6.11.0/docs/src/System.Process.html#cleanupProcess).
Am I missing something?

My theory: there's a subtle race condition in process where it's not fulfilling its responsibilities around waitForProcess being idempotent in some cases.

Yes that seems to match my intuition as well.

The bug is definitely in process, and it's an async exception bug. It's exactly the kind of bug that typed-process is trying to work around. And unfortunately there's no obvious solution to it without breaking the goals of process.
I'm open to suggestions on best behavior here, but this is probably just going to remain a bug in some case no matter what, with some potential workarounds in place for common situations. The right solution is for process to handle async exceptions correctly, but I'm not sure if that's possible while allowing it to be interruptible.

Ah shoot, that's what I was afraid of. To be quite honest, I'm not convinced that I fully grasp what's going on here, but I definitely expected my example to "just work" or at least not crash. (I'm using a fancy withX function afterall..)

Whatever we do to solve this issue, could we please make sure to have a regression test for this behaviour if we decide that this
example shouldn't crash? I'd be happy to write the necessary tests.

(P.S. I still don't understand why my process example doesn't exhibit the problem.)

nh2 · 2021-03-23T03:26:54Z

Side note: This

      output <- SB.hGet outHandle (SB.length contents)
      when (output /= contents) $ fail "failed to read."

need not work. As far as I can tell, hGet reads "up to n" bytes, so output may well be Hello wo if the kernel decides so.

You need SB.hGetContents if you want to loop-read until the end.

Also nice repro!

bitc · 2021-12-23T14:06:26Z

I'm not sure I fully understand the problem, but would this idea work?

When running a process, immediately do an internal forkIO that just calls the waitpid system call. Then this mini-thread will never be interrupted directly with an async exception (and it can be internally interacted with via STM or MVars)

snoyberg · 2021-12-24T03:24:46Z

See my comment above

In our case, the polling thread in typed-process is calling waitForProcess and gets canceled. It has to be canceled to avoid a separate race condition: terminateProcess getting called on the wrong process! However, because waitForProcess isn't actually async-exception-safe, none of this works.

sol · 2023-05-23T12:21:26Z

This is the same as #69.

What's happening here @NorfairKing with your original example is that:

hGet returns with some, but not all of the output (as @nh2 already pointed out)
when (output /= contents) $ fail "failed to read." will result in an exception
We mess up in pCleanup, erroneously calling waitForProcess twice. This results in a second exception does not exist (No child processes), which masks the first one, resulting in the bug.

Or in other words: The bug is not that you get an exception in the first place, the bug is that you get the wrong exception.

A minimal reproducing example is at: #69 (comment)

Exactly how we mess up in pCleanup I described in great detail here: #70 (comment)

sol · 2023-05-23T12:34:30Z

However, because waitForProcess isn't actually async-exception-safe, none of this works.

@snoyberg we essentially have the following code (ignoring the fallback code for the single-threaded runtime):

ec <- unmask $ do
    P.waitForProcess pHandle
    -- Do we have any guarantee that we are not interrupted right here?
atomically $ putTMVar pExitCode ec

Do we have anything that guarantees us that we are not interrupted right after waitForProcess returns? If not, then I think even if waitForProcess were perfectly async-exception-safe, we would still have a bug, as at this point we lost the exit code: We successfully retrieved it from waitForProcess, but we failed to put it into pExitCode. Leaving pExitCode empty.

sol · 2023-05-23T19:22:48Z

If not, then I think even if waitForProcess were perfectly async-exception-safe, we would still have a bug

Actually no, if waitForProcess were async-exception-safe and idempotent then the current code would probably work (modulo handling of delegate_ctlc, which I think would still be broken). Somehow this slipped my attention. I naively assumed that waitForProcess behaves like waitpid here.

So, as I understand it, we have two issues:

waitForProcess is not well behaved when it gets interrupted. It is documented to be idempotent, but it loses that property when subjected to async-exceptions.
Just by looking at the type of pExitCode :: TMVar ExitCode I think it's evident that we have another bug. With this type waitExitCode can only ever return a value, never propagate an exception. Now, what value should waitExitCode return when there is no value? When waitForProcess fails with an exception (as it does when delegate_ctlc = True on ctrl-c)? It will diverge, a.k.a. deadlock.

For (1) is there an important reason that we have to interrupt waitForProcess in the first place? If yes, for what exact reason? Can we name it?

Otherwise, how about simply not to interrupt waitForProcess and leave all of the heavy lifting to async:

terminate the process with terminateProcess at which point waitForProcess will return on its own
use Async.wait waitingThread to get the exit code

For (2), we could change the type of pExitCode to TMVar (Either SomeException ExitCode) to allow waitExitCode to propagate exceptions (specifically UserInterupt when delegate_ctlc = True) instead of deadlocking. But then again, async already provides exactly that functionality. So how about relying on async yet again?

Ensure that waitForProcess is never called more than once (fixes #69) #70 addresses only (1)
Don't deadlock on delegate_ctlc (fixes #73) #74 addresses both (2) on top of Ensure that waitForProcess is never called more than once (fixes #69) #70

NorfairKing pushed a commit to NorfairKing/sydtest that referenced this issue Mar 19, 2021

turn of typed-process tests because of fpco/typed-process#38

935ff18

snoyberg mentioned this issue Mar 21, 2021

WIP: make waitForProcess async safe haskell/process#204

Open

NorfairKing pushed a commit to NorfairKing/typed-process that referenced this issue Mar 22, 2021

Regression test for fpco#38

18e3ae8

NorfairKing mentioned this issue Mar 22, 2021

Regression test 38 #39

Open

bitc mentioned this issue Dec 19, 2021

Bug: async exceptions and readProcess_ (waitForProcess: does not exist) #47

Open

robx mentioned this issue Jan 29, 2022

waitForProcess: does not exist (No child processes) haskell/process#226

Closed

TristanCacqueray mentioned this issue Jan 25, 2023

Nested scope stuck on Data.ByteString.hGetLine awkward-squad/ki#19

Closed

qaristote mentioned this issue Feb 1, 2023

Backends.Process: the test-suite may fail randomly tweag/smtlib-backends#15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in `waitForProcess` #38

Race condition in `waitForProcess` #38

NorfairKing commented Mar 19, 2021

NorfairKing commented Mar 19, 2021

NorfairKing commented Mar 19, 2021

snoyberg commented Mar 21, 2021

snoyberg commented Mar 21, 2021

snoyberg commented Mar 21, 2021

snoyberg commented Mar 21, 2021

NorfairKing commented Mar 22, 2021 •

edited

Loading

nh2 commented Mar 23, 2021

bitc commented Dec 23, 2021

snoyberg commented Dec 24, 2021

sol commented May 23, 2023

sol commented May 23, 2023 •

edited

Loading

sol commented May 23, 2023

Race condition in waitForProcess #38

Race condition in waitForProcess #38

Comments

NorfairKing commented Mar 19, 2021

NorfairKing commented Mar 19, 2021

NorfairKing commented Mar 19, 2021

snoyberg commented Mar 21, 2021

snoyberg commented Mar 21, 2021

snoyberg commented Mar 21, 2021

snoyberg commented Mar 21, 2021

NorfairKing commented Mar 22, 2021 • edited Loading

nh2 commented Mar 23, 2021

bitc commented Dec 23, 2021

snoyberg commented Dec 24, 2021

sol commented May 23, 2023

sol commented May 23, 2023 • edited Loading

sol commented May 23, 2023

Race condition in `waitForProcess` #38

Race condition in `waitForProcess` #38

NorfairKing commented Mar 22, 2021 •

edited

Loading

sol commented May 23, 2023 •

edited

Loading