Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bindings/python and libev: work around future leak #2563

Merged
merged 5 commits into from
Dec 2, 2019

Conversation

garlick
Copy link
Member

@garlick garlick commented Dec 2, 2019

It turns out that updating libev from 4.25 to 4.27 promotes the epoll_ctl() failure discussed in #2554 to an assertion failure:

ev_epoll.c:134: epoll_modify: Assertion `("libev: I/O watcher with invalid fd found in epoll_ctl", errno != EBADF && errno != ELOOP && errno != EINVAL)' failed.

so this PR updates the vendored copy of libev.

Additionally, it seems that our ev_flux and ev_zmq composite watchers mask errors thrown by libev. They both embed io watchers, but use them for the side effect of unblocking the reactor so prep/check/idle callbacks can do the real event management. However, when libev has a file descriptor problem on an io watcher, it calls ev_feed_event() to make an EV_ERROR event pending on the watcher. Our composite watchers weren't checking for that, so I added a trivial fix for that and verified that in libev 4.25, this also caught the epoll_ctl() EINVAL failure.

Caveat: I don't understand it (and I poked at it a lot) but the 60 second pause we observed is still here even with the EV_ERROR event being handled. As far as I can tell, the loop blocks with the EV_ERROR event pending on the fd for this period. Then we handle it. (Before we didn't handle it and the check watcher ran when the loop unblocked, which allowed the future to be fulfilled) . I don't know where the timeout is coming from and I don't know why the pending EV_ERROR doesn't unblock the loop immediately.

Finally, pulled in the python binding workaround for the future leak in job.submit() and now job.wait() proposed by @andre-merzky, with @SteVwonder's proposed change. This is a temporary workaround until the circular reference pointed out by @SteVwonder in #2549 is resolved. I verified the fix with

$ flux python t/job-manager/submit-waitany.py 512
submit: 778295050240 /bin/true
submit: 778781589504 /bin/true
submit: 779184242688 /bin/true
submit: 779620450304 /bin/true
[snip]
wait: 1659065335808 Error: task(s) exited with exit code 1
wait: 1659870642176 Error: task(s) exited with exit code 1
wait: 1661078601728 Error: task(s) exited with exit code 1
$

The exit code 1 results are expected - the point is it doesn't hang (libev 4.25) or assert (libev 4.27).

garlick and others added 5 commits December 1, 2019 17:32
Update vendored libev to current upstream.

This version asserts when epoll_ctl() fails with EINVAL
which partially addresses flux-framework#2554.

ev_epoll.c:134: epoll_modify: Assertion `("libev: I/O watcher with invalid fd found in epoll_ctl", errno != EBADF && errno != ELOOP && errno != EINVAL)' failed.

4.27 Thu Jun 27 22:43:44 CEST 2019
  - linux aio backend almost complete rewritten to work around its
    limitations.
  - epoll backend now mandatory for linux aio backend.
  - fail assertions more aggressively on invalid fd's detected
    in the event loop, do not just silently fd_kill in case of
    user error.
  - ev_io_start/ev_io_stop now verify the watcher fd using
    a syscall when EV_VERIFY is 2 or higher.

4.26 (EV only)
  - update to libecb 0x00010006.
  - new experimental linux aio backend (linux 4.18+).
  - removed redundant 0-ptr check in ev_once.
  - updated/extended ev_set_allocator documentation.
  - replaced EMPTY2 macro by array_needsize_noinit.
  - minor code cleanups.
  - epoll backend now uses epoll_create1 also after fork.
Problem: ev_flux ignores events on its internal io watcher,
so when libev raises EV_ERROR on it because something went wrong
internally (like flux-framework#2554), nothing happens.

The internal io watcher is only used for its side effect of
unblocking the reactor when pollevents edge triggers from
"no events" to "some events".  The prep/check/idle watchers
do the heavy lifting.

Add a check for pending EV_ERROR events on the io watcher in
the check callback.  If found, notify the user via the ev_flux
watcher callback.

Note: in libev 4.25, a failure of epoll_ctl() would internally
call fd_kill(), which would call ev_feed_event() to raise
EV_ERROR on the watcher.  With this patch, EV_ERROR is caught;
however, only after a one minute delay, which has not been
explained.  libev 4.27 turns that error into an assertion.

Since other error paths in libev call fd_kill(), this change may
be useful for other as yet unseen problems.
Problem: ev_zmq ignores events on its internal io watcher,
so when libev raises EV_ERROR on it because something went wrong
internally, nothing happens.

The internal io watcher is only used for its side effect of
unblocking the reactor when pollevents edge triggers from
"no events" to "some events".  The prep/check/idle watchers
do the heavy lifting.

Add a check for pending EV_ERROR events on the io watcher in
the check callback.  If found, notify the user via the ev_zmq
watcher callback.

This tracks a similar change in libflux/ev_flux.c, based on
experience with an internal failure documented in flux-framework#2554.
Allow future.pimpl._clear() to be called from other parts
of the binding by dropping an underscore from its prefix.

The _clear method has some extra protections to ensure the C
destructor is only called once, thus future.pimpl._clear()
is the preferred way to explicitly dispose of a future.

This is a prereq for explicit cleanup of futures in the
job() class.

Originally proposed in flux-framework#2553.
Problem: job.submit() and job.wait() both seem to
leak futures.

The futures in these "synchronous" methods go out
of scope and thus should be automatically destroyed,
but due to a circular reference alluded to by
@SteVwonder in flux-framework#2549, they persist.

As a workaround, explicitly call _clear() on the
futures in these methods.

Credit goes to @andre-merzky for proposing the first
version of this patch in flux-framework#2553, based on a suggestion
by @grondo, with changes proposed by @SteVwonder.
Group effort!

Fixes flux-framework#2549.
@grondo
Copy link
Contributor

grondo commented Dec 2, 2019

LGTM! Thanks for running this down and putting it all together!

Boy, the "general notes about linux aio" in the new ev_linuxaio.c are entertaining and enlightening!

I tried recreating the original failure with a hacked up flux-job submit and we do hit the assertion failure once epoll_ctl returns EINVAL -- however, the actual abort message leaves something to be desired (though it is unique enough that we'll likely remember to tell users what to look fo)

494: 11069405790208
495: 11070294982656
496: 11070848630784
497: 11071653937152
498: 11072090144768
499: 11072895451136
lt-flux-job: ev_epoll.c:134: epoll_modify: Assertion `("libev: I/O watcher with invalid fd found in epoll_ctl", (*__errno_location ()) != 9 && (*__errno_location ()) != 40 && (*__errno_location ()) != 22)' failed.
Aborted (core dumped)

@garlick
Copy link
Member Author

garlick commented Dec 2, 2019

Yeah, I liked my message better, but figured it was best to stay synced with upstream. If it becomes a problem we can always go back and tweak the message to be flux specific.

@garlick
Copy link
Member Author

garlick commented Dec 2, 2019

Boy, the "general notes about linux aio" in the new ev_linuxaio.c are entertaining and enlightening!

For sure! Linking it here for the entertainment of others :-)

@grondo
Copy link
Contributor

grondo commented Dec 2, 2019

Yeah, I liked my message better, but figured it was best to stay synced with upstream. If it becomes a problem we can always go back and tweak the message to be flux specific.

Ok, well this seems like a good candidate to go in immediately. Merging!

@grondo grondo merged commit 8b3476e into flux-framework:master Dec 2, 2019
@garlick
Copy link
Member Author

garlick commented Dec 2, 2019

Thanks!

@garlick garlick deleted the plug_future_leak branch December 2, 2019 19:51
@andre-merzky
Copy link
Contributor

Thanks all! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants