Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang when submitting many jobs via Python #2549

Closed
SteVwonder opened this issue Nov 21, 2019 · 8 comments
Closed

Hang when submitting many jobs via Python #2549

SteVwonder opened this issue Nov 21, 2019 · 8 comments

Comments

@SteVwonder
Copy link
Member

Per @andre-merzky's comment in #2548 (opening as a new, separate issue):

The example seems to hang if I try to submit >500 tasks in one go - but submitting 256 tasks twice works. Any idea what's up there? What is the recommended way to obtain a debug trace? Sorry if I missed that in the docs. This test was running on a local linux box, FWIW, and used PMI-1.

@dongahn's response:

Sorry about that. The new execution system and graph scheduler are branch new compared to our old execution system and scheduler. So this could be a performance bug somewhere. What are your tasks? If you describe how to reproduce this with stand alone Flux commands, we should be able to get to this. Performance optimization and hardening is our main focus once we complete our system instance work. What flux-core and flux-sched versions did you use? Current master? We recently added some stdout/stderr performance bug fixes so you definitely want to use that. My guess is, this would be MPI programs that uses Flux's PMI1?

@SteVwonder
Copy link
Member Author

The example seems to hang...

What are your tasks?

Is this hang happening with the script you included in #2548? https://gist.github.com/andre-merzky/a6f1eb33dc5c55c51438e041a0349ae7

If so, can you include the spec.json that you are submitting?

@andre-merzky
Copy link
Contributor

@dongahn : no worries, happy to do some testing...
@SteVwonder : Yes, that's the one. If I change njobs in line 97 to, say, 512, the submission eventually stalls. The spec.json is derived from one of the examples or tests I picked up in flux-core's source tree, I can't remember where exactly. It runs a single-core /bin/date, so no MPI.

When I interrupt the hanging script, I see this backttrace:

CTraceback (most recent call last):
  File "./rp_flux.py", line 113, in <module>
    use_flux()
  File "./rp_flux.py", line 89, in use_flux
    jobid = job.submit(h, jobspec)
  File "/home/merzky/projects/flux/flux-core/../install/lib/flux/python2.7/flux/job.py", line 54, in submit
    return submit_get_id(future)
  File "/home/merzky/projects/flux/flux-core/../install/lib/flux/python2.7/flux/util.py", line 27, in func_wrapper
    return func(calling_obj, *args, **kwargs)
  File "/home/merzky/projects/flux/flux-core/../install/lib/flux/python2.7/flux/job.py", line 46, in submit_get_id
    future.wait_for()  # ensure the future is fulfilled
  File "/home/merzky/projects/flux/flux-core/../install/lib/flux/python2.7/flux/future.py", line 130, in wait_for
    self.pimpl.wait_for(timeout)
  File "/home/merzky/projects/flux/flux-core/../install/lib/flux/python2.7/flux/util.py", line 27, in func_wrapper
    return func(calling_obj, *args, **kwargs)
  File "/home/merzky/projects/flux/flux-core/../install/lib/flux/python2.7/flux/wrapper.py", line 193, in __call__
    result = self.fun(*args)
KeyboardInterrupt

I use flux-core master @ 2e299e2 and flux-sched master @ b54ff48 . I did not try to use standalone commands (flux submit ...), only via the script above.

Let me know if you need more details or want me to run something!

Thanks, Andre.

@andre-merzky
Copy link
Contributor

andre-merzky commented Nov 21, 2019

PS.: the script uses some helper from our radical.utils - either install it from pypi, or remove all calls to the reporter and profiler. And the comment-line in json when switching to Python's default json reader. The script behavior remains unchanged. Sorry for the troubles...

@grondo
Copy link
Contributor

grondo commented Nov 22, 2019

I can reproduce this with a modified version of your script. In my testing, I added a print for each submitted job with timing information to get a sense of what was going on. Interestingly, the test always gets to 499 jobs before appearing to stop. However, this is not a hang but the job submission just seems to slow way down:

[snip]
16.411s: 491: 281773342720
16.444s: 492: 282494763008
16.470s: 493: 283031633920
16.500s: 494: 283568504832
16.543s: 495: 283971158016
16.576s: 496: 284810018816
16.626s: 497: 285229449216
16.660s: 498: 286068310016
16.690s: 499: 286655512576
76.493s: 500: 287158829056
136.296s: 501: 1297398890496
196.053s: 502: 2299887878144

There is almost exactly 60s between the job submissions after 499. Suspicious?
Other than that I have no idea what is going on, except to say we are certainly able to submit more than 500 jobs without this kind of delay from a single process, e.g. using t/ingest/submitbench - so either something specific to Python, or maybe this mode of submitting jobs in a loop and calling future_wait_for() on each RPC...

😕

@grondo
Copy link
Contributor

grondo commented Nov 22, 2019

I found it was pretty simple to modify flux job submit to submit jobs in a loop if requested, and made that change to simulate what the Python script is doing. It didn't have any problem submitting 512, 1024 or 2048 jobs.

However, on a hunch I removed the flux_future_destroy (f) in the submit loop (i.e. leaked all futures from flux_job_submit()), and after that, flux job submit -r 512 spec.json hangs after... 499 jobs:

This leads me to believe there may be two issues here:

  1. Python code is not calling the destructor for the flux_future_t created internally in job.submit() when it leaves the scope (sorry if this should be obvious, I don't know Python very well)

  2. Leaving ~500 futures active is causing some kind of problem in the now reactor used internally in flux_future_wait_for(). Maybe a matchtag issue? Though I don't know why the futures would eventually be fulfilled after ~60s, so maybe I'm way off the mark there...

@andre-merzky
Copy link
Contributor

andre-merzky commented Nov 22, 2019

Hey @grondo, you are right: adding an explicit future destroy solves this :-)

I did not attempt to track down the behavior of the reactor, that's over my head right now...

@SteVwonder
Copy link
Member Author

FWIW, I believe this fixes the underlying circular reference:

diff --git a/src/bindings/python/flux/wrapper.py b/src/bindings/python/flux/wrapper.py
index 4a58e7f12..7a328f075 100644
--- a/src/bindings/python/flux/wrapper.py
+++ b/src/bindings/python/flux/wrapper.py
@@ -18,6 +18,7 @@ import os
 import errno
 import inspect
 import six
+import weakref


 class MissingFunctionError(Exception):
@@ -318,15 +319,15 @@ class Wrapper(WrapperBase):
             setattr(self, name, fun)
             return fun

-        new_fun = self.check_wrap(fun, name)
-        new_method = six.create_bound_method(new_fun, self)
+        new_fun = self.check_wrap(fun, name)
+        new_method = six.create_bound_method(new_fun, weakref.proxy(self))

         # Store the wrapper function into the instance
         # to prevent a second lookup
         setattr(self, name, new_method)

@garlick
Copy link
Member

garlick commented Nov 24, 2019

I can confirm that this does fix the problem!

Edit: Except now t/python/t0012-futures.py fails with

TAP version 13
ok 1 __main__.TestHandle.test_01_rpc_get
PASS: python/t0012-futures.py 1 __main__.TestHandle.test_01_rpc_get
2019-11-24T22:19:16.900612Z broker.err[0]: Run level 2 Segmentation fault (rc=139) 0.5s
flux-start: 0 (pid 673) exited with rc=139
ok 2 __main__.TestHandle.test_02_future_wait_for
PASS: python/t0012-futures.py 2 __main__.TestHandle.test_02_future_wait_for
ERROR: python/t0012-futures.py - missing test plan

garlick added a commit to garlick/flux-core that referenced this issue Dec 2, 2019
Problem: job.submit() and job.wait() both seem to
leak futures.

The futures in these "synchronous" methods go out
of scope and thus should be automatically destroyed,
but due to a circular reference alluded to by
@SteVwonder in flux-framework#2549, they persist.

As a workaround, explicitly call _clear() on the
futures in these methods.

Credit goes to @andre-merzky for proposing the first
version of this patch in flux-framework#2553, based on a suggestion
by @grondo, with changes proposed by @SteVwonder.
Group effort!

Fixes flux-framework#2549.
@grondo grondo closed this as completed in 011e9ea Dec 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants