Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MacOS: posix_spawnp: invalid argument (Bad file descriptor) #251

Closed
psibi opened this issue Jun 16, 2022 · 23 comments · Fixed by #257
Closed

MacOS: posix_spawnp: invalid argument (Bad file descriptor) #251

psibi opened this issue Jun 16, 2022 · 23 comments · Fixed by #257

Comments

@psibi
Copy link
Contributor

psibi commented Jun 16, 2022

Using the process shipped with GHC 9.2.3 (and even the latest master), I have been able to reproduce a bug while running integration tests for Stack. Following points about it:

  • Reproducible only on MacOS. They work fine on both Windows and Linux.
  • It's trivial to reproduce it on Mac with the stack integration test suite. I tried to get a minimal example, but my attempts have been futile.

For reproducing it via Mac, use this branch:

/Users/ec2-user/.stack/programs/x86_64-osx/ghc-9.2.3/bin/ghc-pkg-9.2.3: startProcess: posix_spawnp: invalid argument (Bad file descriptor)
Main.hs: Exited with exit code: ExitFailure 1
CallStack (from HasCallStack):
  error, called at /Users/ec2-user/stack/test/integration/lib/StackTest.hs:63:34 in main:StackTest
  stack, called at /Users/ec2-user/stack/test/integration/tests/script-extra-dep/Main.hs:4:8 in main:Main

End of log for script-extra-dep
Failed tests:
- script-extra-dep - ExitFailure 1

A quick way to reproduce it doing something like this stack-integration-test -m 111-custom-snapshot.

Note that the above branch is sprinkled with lots of logs. It was done to help debugging. Let me know if you want me to remove it. I have also sprinkled some logs on my process fork and it indicates to me that the bug happens when creating the ghc-pkg process. Also the log indicates that I get the exit status code of 9.

I invoked dtruss using the following command:

sudo dtruss ~/.local/bin/stack-integration-test -m 111-custom-snapshot 

This is the dtruss logs for the above command: https://gist.github.com/psibi/0c3c89dd2b90012d7d9f3a64ceffb73a

@psibi
Copy link
Contributor Author

psibi commented Jun 16, 2022

There is some more info about the bug here: commercialhaskell/stack#5763 (comment)

@bgamari
Copy link
Contributor

bgamari commented Jun 16, 2022

Thanks for opening this!

Unfortunately, the dtruss output isn't terribly helpful as it exhibits a much different failure mode from what you describe here:

...
Using runghc located at /Users/ec2-user/.stack/programs/x86_64-osx/ghc-9.2.3/bin/runghc
runghc 9.2.3
Initializing/updating the original Pantry store
Checking for project config at: /Users/ec2-user/stack/stack.yaml
Loading project config file stack.yaml
You are not the owner of '/Users/ec2-user/.stack/'. Aborting to protect file permissions.
Retry with '--allow-different-user' to disable this precaution.
stack-integration-test: Received ExitFailure 1 when running
Raw command: /Users/ec2-user/.local/bin/stack update

I suspect the sudo is at fault here. Could you try again with --allow-different-user? Also, do add -f to your dtruss invocation to ensure that forked processes are followed.

@psibi
Copy link
Contributor Author

psibi commented Jun 16, 2022

Could you try again with --allow-different-user?

Not sure what you mean by that since stack accepts --allow-different-user. Here I'm reproducing the issue with the executable named stack-integration-test which is different than stack and doesn't accept that flag:

ec2-user@ip-10-0-4-6 stack % ~/.local/bin/stack-integration-test --help
Initiating Stack integration test running
Stack integration tests

Usage: stack-integration-test [-s|--speed SPEED] [-m|--match STRING]

Available options:
  -h,--help                Show this help text

Dtruss logs for the following command:

sudo dtruss -f ~/.local/bin/stack-integration-test -m 111-  

Log link: https://gist.github.com/psibi/dc56d065590253f033e1de6a0a820c4d

@bgamari
Copy link
Contributor

bgamari commented Jun 16, 2022

Not sure what you mean by that since stack accepts --allow-different-user

@psibi, you may need to either:

  • modify stack-integration-test to pass --allow-different-user to stack
  • temporarily chown root -R /Users/ec2-user/.stack/ to circumvent the check, or
  • convince Darwin to allow you to execute dtruss as an unprivileged user

Dtruss logs for the following command:

It looks like this failed for yet another reason:

stack-integration-test: Executable named runghc not found on path: ["/usr/local/bin","/usr/bin","/bin","/usr/sbin","/sbin"]

Are you sure PATH is set properly?

@psibi
Copy link
Contributor Author

psibi commented Jun 17, 2022

@bgamari I did these things:

  • Configured stack to use --allow-different-user
  • temporarily changed ownership also to root.

I cannot allow Darwin to execute dtruss as an unprivileged user since it seems it's not possible to do in AWS EC2's Mac instance. This is the command I ran:

sudo dtruss -f ~/.local/bin/stack-integration-test -m 111-custom-snapshot

Log link: https://gist.github.com/psibi/8c65361bab44fda64523e20d45790dc0

@bgamari
Copy link
Contributor

bgamari commented Jun 23, 2022

Unfortunately it appears that dtruss isn't capturing calls to posix_spawnp, likely because it is not a system call. I suppose I will need to try to reproduce this locally. Any hints for doing so, @psibi ?

@psibi
Copy link
Contributor Author

psibi commented Jun 24, 2022

@bgamari It's straightforward to reproduce it in MacOS with the stack codebase. You can follow these steps:

Once you have the executable ready, executing something like this will reproduce the bug:

$ stack-integration-test -m 111-custom-snapshot

The above command will run the integration test of 111-custom-snapshot and that will lead straight into this bug.

@bgamari
Copy link
Contributor

bgamari commented Jun 24, 2022

Build the branch and create the executable stack-integration-test out of it.

Specifically how should I go about building it such that it links against my process tree? Apologies for the silly question; I have relatively little experience with stack.

@psibi
Copy link
Contributor Author

psibi commented Jun 25, 2022

@bgamari Sorry, I should been more clear. These are the steps:

stack install --flag=stack:integration-tests stack
  • And then you can reproduce the issue with the stack-integration-test binary:
stack-integration-test -m 111-custom-snapshot

Let me know if you are stuck or need any more help. :-) (I don't have a personal Mac and I did my entire testing using Amazon's EC2 Mac instance. I can create a new instance and help you further, if any steps are unclear)

@bgamari
Copy link
Contributor

bgamari commented Jun 25, 2022

Thanks @psibi! Very helpful.

@mpickering
Copy link
Contributor

@psibi I can't build with your instructions. Can you please help?

process                      > [1 of 2] Compiling Main             ( /private/var/folders/zm/mbjlbmnj1bg9gnpsf60mwq7c0000gq/T/stack-31b596dfe164f4b4/process-1.6.14.0/Setup.hs, /private/var/folders/zm/mbjlbmnj1bg9gnpsf60mwq7c0000gq/T/stack-31b596dfe164f4b4/process-1.6.14.0/.stack-work/dist/x86_64-osx/Cabal-3.6.3.0/setup/Main.o )
process                      > [2 of 2] Compiling StackSetupShim   ( /Users/matt/.stack/setup-exe-src/setup-shim-mPHDZzAJ.hs, /private/var/folders/zm/mbjlbmnj1bg9gnpsf60mwq7c0000gq/T/stack-31b596dfe164f4b4/process-1.6.14.0/.stack-work/dist/x86_64-osx/Cabal-3.6.3.0/setup/StackSetupShim.o )
process                      > Linking /private/var/folders/zm/mbjlbmnj1bg9gnpsf60mwq7c0000gq/T/stack-31b596dfe164f4b4/process-1.6.14.0/.stack-work/dist/x86_64-osx/Cabal-3.6.3.0/setup/setup ...
process                      > Configuring process-1.6.14.0...
process                      > Warning: The 'build-type' is 'Configure' but there is no 'configure' script.
process                      > You probably need to run 'autoreconf -i' to generate it.
process                      > setup: configure script not found.

@mpickering
Copy link
Contributor

Solved by adding autoreconf to my path.

@mpickering
Copy link
Contributor

Seems that the error is due to using (setStdin closed pc) in sinkProcessStdout if you change it to use setStdin nullStream pc then it works.

@psibi
Copy link
Contributor Author

psibi commented Jul 29, 2022

@mpickering You mean in the Stack codebase ?

Because the same codebase is working for both Linux & Windows OS (and the previous process version used to work for Mac too).

@psibi
Copy link
Contributor Author

psibi commented Jul 29, 2022

Also, just to add context: I currently have an workaround for this in Mac: commercialhaskell/stack#5763 (comment)

But ideally it's good to avoid that workaround.

@mpickering
Copy link
Contributor

What I suggested is a workaround and highlights the cause of the issue. Your workaround also works for the same reason, because sinkProcessStderrStdout doesn't try to close the StdIn handle.

@mpickering
Copy link
Contributor

In particular I can't reproduce this directly but I think the error is something to do with nested posix_spawnp, ie one Haskell executable calling another haskell executable, calling another executable.

@mpickering
Copy link
Contributor

Here's at least one test which does something different on Linux/Mac but I'm not sure it's the same issue.

https://gist.github.com/bd27f3db19ddbad480c8eae212058103

If you compile both of these files then run ./T2, it fails on OSX for me with TestProcess: /Users/matt/.ghcup/bin/ghc-pkg: createProcess: posix_spawnp: failed (Undefined error: 0) but works on linux (NixOS).

@robx
Copy link
Contributor

robx commented Aug 2, 2022

Here's at least one test which does something different on Linux/Mac but I'm not sure it's the same issue.

I tried reproducing this on a C level; I'm not getting quite the same errors as reported here so it's probably something slightly different, but the program below (compile with cc spawn.c, then run e.g. ./a.out echo hello) works on Linux, while it fails on macOS with posix_spawnp: No such file or directory. So on macOS asking for an already-closed file descriptor to be closed fails.

spawn.c
#include <stdio.h>
#include <stdlib.h>
#include <spawn.h>
#include <sys/wait.h>
#include <unistd.h>

int main(int argc, char **argv) {
	pid_t pid;
	posix_spawn_file_actions_t fa;
	
	if (argc < 2) {
		exit(0);
	}

        if (close(STDIN_FILENO) != 0) {
		perror("close");
		exit(1);
	}

	if (posix_spawn_file_actions_init(&fa) != 0) {
		perror("posix_spawn_file_actions_init");
		exit(1);
	}
	if (posix_spawn_file_actions_addclose(&fa, STDIN_FILENO) != 0) {
		perror("posix_spawn_file_actions_addclose");
		exit(1);
	}
	if (posix_spawnp(&pid, argv[1], &fa, NULL, &argv[1], NULL) != 0) {
		perror("posix_spawnp");
		exit(1);
	}
	waitpid(pid, NULL, 0);
}

@bgamari
Copy link
Contributor

bgamari commented Aug 4, 2022

Thanks for narrowing this down, @mpickering. The problem indeed appears to be that posix_spawnp is failing due to close being called on an already-closed fd. It's a bit unclear what the specification intends here. It would make sense if addclose would "eat" errors since this would provide a race-free way of ensuring a subprocess executes with a closed fd. However, POSIX 1-2008 says of posix_spawn_file_actions_addclose:

It shall not be considered an error for the fildes argument passed to these functions to specify a file descriptor for which the specified operation could not be performed at the time of the call. Any such error will be detected when the associated file actions object is later used during a posix_spawn() or posix_spawnp() operation.

Which suggests that the posix_spawnp should fail if an addclose action is added on a closed fd. Some people recommend avoiding addclose for this and other reasons, instead advocating use of O_CLOEXEC. However, this seems extremely racy in a multithreaded program (e.g. what if two threads attempt to spawn a subprocess, one with stdin open and the other with it closed).

@bgamari
Copy link
Contributor

bgamari commented Aug 4, 2022

The semantics suggested above are quite unfortunate since addclose throwing an error on a closed fd precludes our one means of reliably starting a subprocess with a closed stdin/stdout/stderr. This seems like quite an oversight on the part of the Open Group and I can only imagine that this is why glibc and musl rather opted to implement the non-erroring semantics, despite the fact that POSIX specifies otherwise.

We can, however, work around this by using the fact that posix_spawn_file_actions_addopen will close the fd being opened if it is already open. For instance, if we want to ensure that stdin is closed we first addopen(stdin, "/dev/null") (serving to close the inherited stdin, if it exists, and ensuring that there is a valid handle to close) and then call addclose(stdin). Given that this adds a bit of spawn overhead, we should probably guard this on a configure check.

bgamari added a commit to bgamari/process that referenced this issue Aug 4, 2022
Previously to spawn a process with a closed standard handle, we
would use `posix_spawn_file_action_addclose`. However, it turns out that
POSIX specifies that `spawnp()` may fail if `addclose()` is used on an
fd that is already closed. While glibc and musl appear to ignore this
aspect of the specification, Darwin indeed follows it leading to haskell#251.

This behavior is rather unfortunate as
`posix_spawn_file_action_addclose` is a convenient way to close a handle
in a subprocess in a race-free manner (e.g. unlike `O_CLOEXEC`, which is
global). To avoid haskell#251 we must first use
`posix_spawn_file_action_addopen` on the fd (e.g. opening `/dev/null`)
to be closed to ensure that it is valid, which has the side-effect of
closing the inherited fd. We can then safely use
`posix_spawn_file_action_addclose` to close the fd.

Fixes haskell#251.
@bgamari bgamari mentioned this issue Aug 4, 2022
@bgamari
Copy link
Contributor

bgamari commented Aug 4, 2022

@mpickering has confirmed that #257 fixes the Stack reproducer.

@psibi
Copy link
Contributor Author

psibi commented Aug 5, 2022

Thanks everyone for fixing this!

bgamari added a commit to bgamari/process that referenced this issue Nov 1, 2022
It turns out that this test is subtly broken. In particular, the
test will fail if any file is opened in the subprocess before the child
is run since the closed fd 0 may be reused for the new file.
This tends to happen in the threaded RTS due to the event manager's
control pipe (see GHC #22395). Unfortunately, it's not really clear how
else haskell#251 can reliably be tested.
Mistuke pushed a commit to Mistuke/process that referenced this issue Mar 11, 2023
Previously to spawn a process with a closed standard handle, we
would use `posix_spawn_file_action_addclose`. However, it turns out that
POSIX specifies that `spawnp()` may fail if `addclose()` is used on an
fd that is already closed. While glibc and musl appear to ignore this
aspect of the specification, Darwin indeed follows it leading to haskell#251.

This behavior is rather unfortunate as
`posix_spawn_file_action_addclose` is a convenient way to close a handle
in a subprocess in a race-free manner (e.g. unlike `O_CLOEXEC`, which is
global). To avoid haskell#251 we must first use
`posix_spawn_file_action_addopen` on the fd (e.g. opening `/dev/null`)
to be closed to ensure that it is valid, which has the side-effect of
closing the inherited fd. We can then safely use
`posix_spawn_file_action_addclose` to close the fd.

Fixes haskell#251.
Mistuke pushed a commit to Mistuke/process that referenced this issue Mar 11, 2023
It turns out that this test is subtly broken. In particular, the
test will fail if any file is opened in the subprocess before the child
is run since the closed fd 0 may be reused for the new file.
This tends to happen in the threaded RTS due to the event manager's
control pipe (see GHC #22395). Unfortunately, it's not really clear how
else haskell#251 can reliably be tested.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants