-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wreck permission denied feeds into dangling ROUTER DEALER and PAIR sockets and death #1468
Comments
Ah maybe the job module needs to call Hard to know what would have caused the exec failure in the first place. |
That might help? The only thing here I think we might be able to address is figure out why an exec failure of wrexecd became an instance failure rather than a failed job. It might be the atexit somehow polluting the overlay or something else, but if we can get it to at least avoid taking the instance with it that would be really good. |
Good thought @garlick. I was able to reproduce the |
@trws: running the czmq atexit handler in the process forked from the job module (broker thread) would likely be fatal to the broker's overlay sockets, or cause a crash since closing zmq sockets in one thread while another thread is using them is (I believe) not allowed. |
I wasn't able to reproduce the instance failure when reproducing this error. Could be because the instance wasn't busy? |
If zeromq sockets can't survive a fork/destroy, I'm actually surprised zmq doesn't install its own |
This particular failure happened right in the middle of a major flood, since a ton of messages got unstuck all together. I could easily see it being an inconsistent state caused by that if the child were allowed to continue after the exec, maybe relying on FD_CLOEXEC rather than atfork? This might actually be a good place to try a switch to using posix_spawn (we discussed it briefly way back when in #384) rather than explicit fork/exec, it should (hopefully), help with a bunch of built-in checking for failure conditions and cleanup. |
Oh good point! Sorry I may have been hasty in my assumptions about the
fatality.
…On Sun, Apr 15, 2018, 9:19 AM Mark Grondona ***@***.***> wrote:
If zeromq sockets can't survive a fork/destroy, I'm actually surprised zmq
doesn't install its own pthread_atfork handlers. The job module closes
all fds before exec, so I'm struggling to figure out how erroneous
destruction of the stale zmq sockets in the forked child could affect the
parent broker process.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1468 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKX25t4Pf3Iiz38kD-JKLM3GNfwIpXPks5to3MHgaJpZM4TVMz2>
.
|
I need to review how that stuff works, but all the UNIX fds backing zmq sockets for broker and modules are managed by one zmq service thread, so if the mechanism allows the forked thread to somehow signal the service thread, then teardown could still happen. |
I'd actually be willing to experiment with this. However, I think first the When we rewrite the subprocess library, we'll keep an optional use of Also, maybe by the time we revisit this, we could run modules as processes and the broker.exec service could be moved to a separate process, again alleviating the need for |
Oops, I was wrong about that. On my Ubuntu 16.04.4 LTS desktop there was no man page for it. I read from a googled man page that glibc 2.2 or newer was required, and then misread the local glibc version. Turns out I have glibc 2.23 and /usr/include/spawn.h exists, so probably my system is OK. |
No problem. |
vfork or clone would be good, if only for performance reasons. You're right, the posix_spawn interface itself isn't as useful as I recalled, it's a very large difference from fork performance-wise and in some other ways on bsd derivatives, but evidently not on linux. |
The wreck exec system is worthless, remove it along with associated commands, tests, and support code. Since libjsc doesn't work without wreck, it is removed as well. Fixes flux-framework#1984 Closes flux-framework#1947 Closes flux-framework#1618 Closes flux-framework#1595 Closes flux-framework#1593 Closes flux-framework#1468 Closes flux-framework#1438 Closes flux-framework#1419 Closes flux-framework#1410 Closes flux-framework#915 Closes flux-framework#894 Closes flux-framework#866 Closes flux-framework#833 Closes flux-framework#774 Closes flux-framework#772 Closes flux-framework#335 Closes flux-framework#249
The wreck exec system is worthless, remove it along with associated commands, tests, and support code. Since libjsc doesn't work without wreck, it is removed as well. Fixes flux-framework#1984 Closes flux-framework#1947 Closes flux-framework#1618 Closes flux-framework#1595 Closes flux-framework#1593 Closes flux-framework#1534 Closes flux-framework#1468 Closes flux-framework#1443 Closes flux-framework#1438 Closes flux-framework#1419 Closes flux-framework#1410 Closes flux-framework#1407 Closes flux-framework#1393 Closes flux-framework#915 Closes flux-framework#894 Closes flux-framework#866 Closes flux-framework#833 Closes flux-framework#774 Closes flux-framework#772 Closes flux-framework#335 Closes flux-framework#249
The wreck exec system is worthless, remove it along with associated commands, tests, and support code. Since libjsc doesn't work without wreck, it is removed as well. Fixes flux-framework#1984 Closes flux-framework#1947 Closes flux-framework#1618 Closes flux-framework#1595 Closes flux-framework#1593 Closes flux-framework#1534 Closes flux-framework#1468 Closes flux-framework#1443 Closes flux-framework#1438 Closes flux-framework#1419 Closes flux-framework#1410 Closes flux-framework#1407 Closes flux-framework#1393 Closes flux-framework#915 Closes flux-framework#894 Closes flux-framework#866 Closes flux-framework#833 Closes flux-framework#774 Closes flux-framework#772 Closes flux-framework#335 Closes flux-framework#249
The wreck exec system is worthless, remove it along with associated commands, tests, and support code. Since libjsc doesn't work without wreck, it is removed as well. Fixes flux-framework#1984 Closes flux-framework#1947 Closes flux-framework#1618 Closes flux-framework#1595 Closes flux-framework#1593 Closes flux-framework#1534 Closes flux-framework#1468 Closes flux-framework#1443 Closes flux-framework#1438 Closes flux-framework#1419 Closes flux-framework#1410 Closes flux-framework#1407 Closes flux-framework#1393 Closes flux-framework#915 Closes flux-framework#894 Closes flux-framework#866 Closes flux-framework#833 Closes flux-framework#774 Closes flux-framework#772 Closes flux-framework#335 Closes flux-framework#249
closed by #1988 |
The wreck exec system is worthless, remove it along with associated commands, tests, and support code. Since libjsc doesn't work without wreck, it is removed as well. Fixes flux-framework#1984 Closes flux-framework#1947 Closes flux-framework#1618 Closes flux-framework#1595 Closes flux-framework#1593 Closes flux-framework#1468 Closes flux-framework#1438 Closes flux-framework#1419 Closes flux-framework#1410 Closes flux-framework#915 Closes flux-framework#894 Closes flux-framework#866 Closes flux-framework#833 Closes flux-framework#774 Closes flux-framework#772 Closes flux-framework#335 Closes flux-framework#249
I have no idea why this happened, but this is the death output from a 1500 node run that died this morning. External evidence makes me think that one or more nodes hung, causing literally thousands of jobs to sit in either submitted or runrequest for hours, then unhung and scheduled all of those in about 5 seconds, printed this, and died an unceremonious death. I have the full (but rather voluminous) logfile if anyone wants it.
The text was updated successfully, but these errors were encountered: