-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang when submitting many jobs via Python #2549
Comments
Is this hang happening with the script you included in #2548? https://gist.github.com/andre-merzky/a6f1eb33dc5c55c51438e041a0349ae7 If so, can you include the |
@dongahn : no worries, happy to do some testing... When I interrupt the hanging script, I see this backttrace:
I use flux-core master @ 2e299e2 and flux-sched master @ b54ff48 . I did not try to use standalone commands ( Let me know if you need more details or want me to run something! Thanks, Andre. |
PS.: the script uses some helper from our |
I can reproduce this with a modified version of your script. In my testing, I added a print for each submitted job with timing information to get a sense of what was going on. Interestingly, the test always gets to 499 jobs before appearing to stop. However, this is not a hang but the job submission just seems to slow way down:
There is almost exactly 60s between the job submissions after 😕 |
I found it was pretty simple to modify However, on a hunch I removed the This leads me to believe there may be two issues here:
|
Hey @grondo, you are right: adding an explicit future destroy solves this :-) I did not attempt to track down the behavior of the reactor, that's over my head right now... |
FWIW, I believe this fixes the underlying circular reference: diff --git a/src/bindings/python/flux/wrapper.py b/src/bindings/python/flux/wrapper.py
index 4a58e7f12..7a328f075 100644
--- a/src/bindings/python/flux/wrapper.py
+++ b/src/bindings/python/flux/wrapper.py
@@ -18,6 +18,7 @@ import os
import errno
import inspect
import six
+import weakref
class MissingFunctionError(Exception):
@@ -318,15 +319,15 @@ class Wrapper(WrapperBase):
setattr(self, name, fun)
return fun
- new_fun = self.check_wrap(fun, name)
- new_method = six.create_bound_method(new_fun, self)
+ new_fun = self.check_wrap(fun, name)
+ new_method = six.create_bound_method(new_fun, weakref.proxy(self))
# Store the wrapper function into the instance
# to prevent a second lookup
setattr(self, name, new_method) |
I can confirm that this does fix the problem! Edit: Except now
|
Problem: job.submit() and job.wait() both seem to leak futures. The futures in these "synchronous" methods go out of scope and thus should be automatically destroyed, but due to a circular reference alluded to by @SteVwonder in flux-framework#2549, they persist. As a workaround, explicitly call _clear() on the futures in these methods. Credit goes to @andre-merzky for proposing the first version of this patch in flux-framework#2553, based on a suggestion by @grondo, with changes proposed by @SteVwonder. Group effort! Fixes flux-framework#2549.
Per @andre-merzky's comment in #2548 (opening as a new, separate issue):
@dongahn's response:
The text was updated successfully, but these errors were encountered: