Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wreck permission denied feeds into dangling ROUTER DEALER and PAIR sockets and death #1468

Closed
trws opened this issue Apr 14, 2018 · 14 comments
Closed

Comments

@trws
Copy link
Member

trws commented Apr 14, 2018

I have no idea why this happened, but this is the death output from a 1500 node run that died this morning. External evidence makes me think that one or more nodes hung, causing literally thousands of jobs to sit in either submitted or runrequest for hours, then unhung and scheduled all of those in about 5 seconds, printed this, and died an unceremonious death. I have the full (but rather voluminous) logfile if anyone wants it.

broker: entering event loop
broker: entering event loop
initial load complete, reloading sched
sched reload complete, preparing environment
37086:owner
59021:owner
25332:owner
37084:owner
59751:owner
ENV READY
launching shell without PTY
echo $FLUX_URI
local:///var/tmp/flux-lOf9SM
wrexecd exec: Permission denied
E: (flux-broker) 18-04-14 10:36:21 dangling 'ROUTER' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/overlay.c:478
E: (flux-broker) 18-04-14 10:36:21 dangling 'DEALER' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/overlay.c:549
E: (flux-broker) 18-04-14 10:36:21 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:21 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:21 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:21 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:21 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:21 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:21 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:21 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:21 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:21 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:21 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:21 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:21 dangling sockets: cannot terminate ZMQ safely
wrexecd exec: Permission denied
E: (flux-broker) 18-04-14 10:36:23 dangling 'ROUTER' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/overlay.c:478
E: (flux-broker) 18-04-14 10:36:23 dangling 'DEALER' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/overlay.c:549
E: (flux-broker) 18-04-14 10:36:23 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:23 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:23 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:23 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:23 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:23 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:23 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:23 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:23 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:23 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:23 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:23 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:23 dangling sockets: cannot terminate ZMQ safely
wrexecd exec: Permission denied
E: (flux-broker) 18-04-14 10:36:24 dangling 'ROUTER' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/overlay.c:478
E: (flux-broker) 18-04-14 10:36:24 dangling 'DEALER' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/overlay.c:549
E: (flux-broker) 18-04-14 10:36:24 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:24 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:24 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:24 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:24 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:24 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:24 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:24 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:24 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:24 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:24 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:24 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:24 dangling sockets: cannot terminate ZMQ safely
wrexecd exec: Permission denied
E: (flux-broker) 18-04-14 10:36:25 dangling 'ROUTER' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/overlay.c:478
E: (flux-broker) 18-04-14 10:36:25 dangling 'DEALER' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/overlay.c:549
E: (flux-broker) 18-04-14 10:36:25 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:25 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:25 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:25 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:25 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:25 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:25 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:25 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:25 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:25 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:25 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:25 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:25 dangling sockets: cannot terminate ZMQ safely
wrexecd exec: Permission denied
E: (flux-broker) 18-04-14 10:36:30 dangling 'ROUTER' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/overlay.c:478
E: (flux-broker) 18-04-14 10:36:30 dangling 'DEALER' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/overlay.c:549
E: (flux-broker) 18-04-14 10:36:30 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:30 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:30 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:30 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:30 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:30 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:30 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:30 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:30 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:30 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:30 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/broker/module.c:571
E: (flux-broker) 18-04-14 10:36:30 dangling 'PAIR' socket created at /g/g12/scogland/projects/flux/flux-core/src/connectors/shmem/shmem.c:259
E: (flux-broker) 18-04-14 10:36:30 dangling sockets: cannot terminate ZMQ safely
[mpiexec@sierra503] control_cb (pm/pmiserv/pmiserv_cb.c:200): assert (!closed) failed
[mpiexec@sierra503] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@sierra503] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpiexec@sierra503] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion
@garlick
Copy link
Member

garlick commented Apr 15, 2018

Ah maybe the job module needs to call _exit() instead of exit() upon exec failure to avoid triggering czmq's atexit handlers in the forked child?

Hard to know what would have caused the exec failure in the first place.

@trws
Copy link
Member Author

trws commented Apr 15, 2018

That might help? The only thing here I think we might be able to address is figure out why an exec failure of wrexecd became an instance failure rather than a failed job. It might be the atexit somehow polluting the overlay or something else, but if we can get it to at least avoid taking the instance with it that would be really good.

@grondo
Copy link
Contributor

grondo commented Apr 15, 2018

Ah maybe the job module needs to call _exit() instead of exit() upon exec failure to avoid triggering czmq's atexit handlers in the forked child?

Good thought @garlick. I was able to reproduce the dangling '*' socket errors in a builddir by removing exec permissions on the built wrexecd libtool script. Changing to _exit (255) silences these errors. Unfortunately, job module does not collect child's exit status so the job still gets stuck in runrequest or reserved state.

@garlick
Copy link
Member

garlick commented Apr 15, 2018

@trws: running the czmq atexit handler in the process forked from the job module (broker thread) would likely be fatal to the broker's overlay sockets, or cause a crash since closing zmq sockets in one thread while another thread is using them is (I believe) not allowed.

@grondo
Copy link
Contributor

grondo commented Apr 15, 2018

I wasn't able to reproduce the instance failure when reproducing this error. Could be because the instance wasn't busy?

@grondo
Copy link
Contributor

grondo commented Apr 15, 2018

If zeromq sockets can't survive a fork/destroy, I'm actually surprised zmq doesn't install its own pthread_atfork handlers. The job module closes all fds before exec, so I'm struggling to figure out how erroneous destruction of the stale zmq sockets in the forked child could affect the parent broker process.

@trws
Copy link
Member Author

trws commented Apr 15, 2018

This particular failure happened right in the middle of a major flood, since a ton of messages got unstuck all together. I could easily see it being an inconsistent state caused by that if the child were allowed to continue after the exec, maybe relying on FD_CLOEXEC rather than atfork?

This might actually be a good place to try a switch to using posix_spawn (we discussed it briefly way back when in #384) rather than explicit fork/exec, it should (hopefully), help with a bunch of built-in checking for failure conditions and cleanup.

@garlick
Copy link
Member

garlick commented Apr 15, 2018 via email

@garlick
Copy link
Member

garlick commented Apr 15, 2018

I'm struggling to figure out how erroneous destruction of the stale zmq sockets in the forked child could affect the parent broker process.

I need to review how that stuff works, but all the UNIX fds backing zmq sockets for broker and modules are managed by one zmq service thread, so if the mechanism allows the forked thread to somehow signal the service thread, then teardown could still happen.

@grondo
Copy link
Contributor

grondo commented Apr 16, 2018

This might actually be a good place to try a switch to using posix_spawn (we discussed it briefly way back when in #384) rather than explicit fork/exec, it should (hopefully), help with a bunch of built-in checking for failure conditions and cleanup.

I'd actually be willing to experiment with this. However, I think first the job module should be converted to using local cmb.exec to launch the wrexecds, so that failure of the child (failed exec, non-zero exit code) can at least be noted by the job module with a potential for cleanup. Then we'd have to look at converting the subprocess library to optionally use vfork(2), which could be a pain for the current version. (Turns out posix_spawn isn't fully available on all distros now according to @garlick, though it is available on RHEL7)

When we rewrite the subprocess library, we'll keep an optional use of vfork in mind, specifically for launching subprocesses from the broker. vfork won't work for our other use cases (mainly job shell) since we'll need to use the nicer fork/exec abstraction to complete work in the parent after the fork, but before the child calls exec (e.g. move between cgroups for example). However, it is assumed the job shell will have much smaller RSS so the hit of COW with fork is probably negligible there.

Also, maybe by the time we revisit this, we could run modules as processes and the broker.exec service could be moved to a separate process, again alleviating the need for vfork there as well.

@garlick
Copy link
Member

garlick commented Apr 16, 2018

(Turns out posix_spawn isn't fully available on all distros now according to @garlick, though it is available on RHEL7)

Oops, I was wrong about that. On my Ubuntu 16.04.4 LTS desktop there was no man page for it. I read from a googled man page that glibc 2.2 or newer was required, and then misread the local glibc version. Turns out I have glibc 2.23 and /usr/include/spawn.h exists, so probably my system is OK.

@grondo
Copy link
Contributor

grondo commented Apr 16, 2018

No problem. posix_spawn is fairly limiting since it was apparently designed to allow execution on MMU-less systems, but also has a kind of crazy interface. It is apparently implemented via vfork() in glibc anyway, so we could just use that when/if we do this experiment.

@trws
Copy link
Member Author

trws commented Apr 17, 2018

vfork or clone would be good, if only for performance reasons. You're right, the posix_spawn interface itself isn't as useful as I recalled, it's a very large difference from fork performance-wise and in some other ways on bsd derivatives, but evidently not on linux.

grondo added a commit to grondo/flux-core that referenced this issue Feb 5, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984
Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1468
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
grondo added a commit to grondo/flux-core that referenced this issue Feb 5, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984

Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1534
Closes flux-framework#1468
Closes flux-framework#1443
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#1407
Closes flux-framework#1393
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
grondo added a commit to grondo/flux-core that referenced this issue Feb 5, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984

Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1534
Closes flux-framework#1468
Closes flux-framework#1443
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#1407
Closes flux-framework#1393
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
grondo added a commit to grondo/flux-core that referenced this issue Feb 9, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984

Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1534
Closes flux-framework#1468
Closes flux-framework#1443
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#1407
Closes flux-framework#1393
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
@grondo
Copy link
Contributor

grondo commented Feb 13, 2019

closed by #1988

@grondo grondo closed this as completed Feb 13, 2019
chu11 pushed a commit to chu11/flux-core that referenced this issue Feb 13, 2019
The wreck exec system is worthless, remove it along with associated
commands, tests, and support code.

Since libjsc doesn't work without wreck, it is removed as well.

Fixes flux-framework#1984
Closes flux-framework#1947
Closes flux-framework#1618
Closes flux-framework#1595
Closes flux-framework#1593
Closes flux-framework#1468
Closes flux-framework#1438
Closes flux-framework#1419
Closes flux-framework#1410
Closes flux-framework#915
Closes flux-framework#894
Closes flux-framework#866
Closes flux-framework#833
Closes flux-framework#774
Closes flux-framework#772
Closes flux-framework#335
Closes flux-framework#249
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants