-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bindings/python and libev: work around future leak #2563
Conversation
Update vendored libev to current upstream. This version asserts when epoll_ctl() fails with EINVAL which partially addresses flux-framework#2554. ev_epoll.c:134: epoll_modify: Assertion `("libev: I/O watcher with invalid fd found in epoll_ctl", errno != EBADF && errno != ELOOP && errno != EINVAL)' failed. 4.27 Thu Jun 27 22:43:44 CEST 2019 - linux aio backend almost complete rewritten to work around its limitations. - epoll backend now mandatory for linux aio backend. - fail assertions more aggressively on invalid fd's detected in the event loop, do not just silently fd_kill in case of user error. - ev_io_start/ev_io_stop now verify the watcher fd using a syscall when EV_VERIFY is 2 or higher. 4.26 (EV only) - update to libecb 0x00010006. - new experimental linux aio backend (linux 4.18+). - removed redundant 0-ptr check in ev_once. - updated/extended ev_set_allocator documentation. - replaced EMPTY2 macro by array_needsize_noinit. - minor code cleanups. - epoll backend now uses epoll_create1 also after fork.
Problem: ev_flux ignores events on its internal io watcher, so when libev raises EV_ERROR on it because something went wrong internally (like flux-framework#2554), nothing happens. The internal io watcher is only used for its side effect of unblocking the reactor when pollevents edge triggers from "no events" to "some events". The prep/check/idle watchers do the heavy lifting. Add a check for pending EV_ERROR events on the io watcher in the check callback. If found, notify the user via the ev_flux watcher callback. Note: in libev 4.25, a failure of epoll_ctl() would internally call fd_kill(), which would call ev_feed_event() to raise EV_ERROR on the watcher. With this patch, EV_ERROR is caught; however, only after a one minute delay, which has not been explained. libev 4.27 turns that error into an assertion. Since other error paths in libev call fd_kill(), this change may be useful for other as yet unseen problems.
Problem: ev_zmq ignores events on its internal io watcher, so when libev raises EV_ERROR on it because something went wrong internally, nothing happens. The internal io watcher is only used for its side effect of unblocking the reactor when pollevents edge triggers from "no events" to "some events". The prep/check/idle watchers do the heavy lifting. Add a check for pending EV_ERROR events on the io watcher in the check callback. If found, notify the user via the ev_zmq watcher callback. This tracks a similar change in libflux/ev_flux.c, based on experience with an internal failure documented in flux-framework#2554.
Allow future.pimpl._clear() to be called from other parts of the binding by dropping an underscore from its prefix. The _clear method has some extra protections to ensure the C destructor is only called once, thus future.pimpl._clear() is the preferred way to explicitly dispose of a future. This is a prereq for explicit cleanup of futures in the job() class. Originally proposed in flux-framework#2553.
Problem: job.submit() and job.wait() both seem to leak futures. The futures in these "synchronous" methods go out of scope and thus should be automatically destroyed, but due to a circular reference alluded to by @SteVwonder in flux-framework#2549, they persist. As a workaround, explicitly call _clear() on the futures in these methods. Credit goes to @andre-merzky for proposing the first version of this patch in flux-framework#2553, based on a suggestion by @grondo, with changes proposed by @SteVwonder. Group effort! Fixes flux-framework#2549.
LGTM! Thanks for running this down and putting it all together! Boy, the "general notes about linux aio" in the new I tried recreating the original failure with a hacked up
|
Yeah, I liked my message better, but figured it was best to stay synced with upstream. If it becomes a problem we can always go back and tweak the message to be flux specific. |
For sure! Linking it here for the entertainment of others :-) |
Ok, well this seems like a good candidate to go in immediately. Merging! |
Thanks! |
Thanks all! :-) |
It turns out that updating libev from 4.25 to 4.27 promotes the
epoll_ctl()
failure discussed in #2554 to an assertion failure:so this PR updates the vendored copy of libev.
Additionally, it seems that our
ev_flux
andev_zmq
composite watchers mask errors thrown by libev. They both embed io watchers, but use them for the side effect of unblocking the reactor so prep/check/idle callbacks can do the real event management. However, when libev has a file descriptor problem on an io watcher, it callsev_feed_event()
to make an EV_ERROR event pending on the watcher. Our composite watchers weren't checking for that, so I added a trivial fix for that and verified that in libev 4.25, this also caught theepoll_ctl()
EINVAL failure.Caveat: I don't understand it (and I poked at it a lot) but the 60 second pause we observed is still here even with the EV_ERROR event being handled. As far as I can tell, the loop blocks with the EV_ERROR event pending on the fd for this period. Then we handle it. (Before we didn't handle it and the check watcher ran when the loop unblocked, which allowed the future to be fulfilled) . I don't know where the timeout is coming from and I don't know why the pending EV_ERROR doesn't unblock the loop immediately.
Finally, pulled in the python binding workaround for the future leak in
job.submit()
and nowjob.wait()
proposed by @andre-merzky, with @SteVwonder's proposed change. This is a temporary workaround until the circular reference pointed out by @SteVwonder in #2549 is resolved. I verified the fix withThe exit code 1 results are expected - the point is it doesn't hang (libev 4.25) or assert (libev 4.27).