large job launch with output redirect does not work [well] #1406

trws · 2018-03-31T19:33:03Z

flux submit -N 400 -O host-test.out hostname has the following timing output:

    ID       NTASKS     STARTING      RUNNING     COMPLETE        TOTAL
     1          400       0.556s       4.015m      11.961m      15.976m

After spending 16 minutes running, the host-test.out file is empty. Using flux wreck attach shows output from all of the nodes (and did after about 6 minutes), but nothing went into the output file at all.

dmesg output: (skipping tons of sched spew, sched did all of this in half a second)

2018-03-31T19:16:06.333604Z sched.info[0]: hostname: sierra1271, digest: 74FFAD13DE186FD3843342F3A8C5ACCEC60EE704
2018-03-31T19:16:06.333616Z sched.info[0]: broker found, rank: 400
2018-03-31T19:16:06.400859Z sched.debug[0]: job (1) assigned new state: allocated
2018-03-31T19:16:06.401206Z sched.debug[0]: Allocated 400 node(s) for job 1
2018-03-31T19:16:06.402166Z sched.debug[0]: attempting job 1 state change from selected to allocated
2018-03-31T19:16:06.405195Z sched.debug[0]: job (1) assigned new state: runrequest
2018-03-31T19:16:06.405241Z sched.debug[0]: job 1 runrequest
2018-03-31T19:16:06.405369Z sched.debug[0]: attempting job 1 state change from allocated to runrequest
2018-03-31T19:16:06.906813Z sched.debug[0]: attempting job 1 state change from runrequest to starting
2018-03-31T19:16:07.086183Z broker.debug[0]: content purge: 69 entries
2018-03-31T19:19:09.086267Z broker.debug[0]: content purge: 1 entries
2018-03-31T19:20:07.827169Z broker.debug[0]: content flush begin
2018-03-31T19:20:07.829555Z broker.debug[0]: content flush +128 (dirty=303 pending=256)
2018-03-31T19:20:07.851207Z broker.debug[0]: content flush begin
2018-03-31T19:20:07.852156Z broker.debug[0]: content flush +47 (dirty=175 pending=175)
2018-03-31T19:20:07.955077Z kvs.debug[0]: aggregated 14 transactions (28 ops)
2018-03-31T19:20:07.999758Z kvs.debug[0]: aggregated 27 transactions (54 ops)
2018-03-31T19:20:08.044097Z kvs.debug[0]: aggregated 46 transactions (92 ops)
2018-03-31T19:20:08.113061Z aggregator.info[0]: push: lwj.0.0.1.exit_status: count=64 fwd_count=0 total=400
2018-03-31T19:20:08.204213Z aggregator.info[0]: push: lwj.0.0.1.exit_status: count=145 fwd_count=0 total=400
2018-03-31T19:20:08.210242Z aggregator.info[0]: push: lwj.0.0.1.exit_status: count=399 fwd_count=0 total=400
2018-03-31T19:20:08.142086Z kvs.debug[0]: aggregated 116 transactions (232 ops)
2018-03-31T19:20:08.302531Z broker.debug[0]: content flush begin
2018-03-31T19:20:08.303382Z broker.debug[0]: content flush +42 (dirty=170 pending=170)
2018-03-31T19:20:08.292487Z kvs.debug[0]: aggregated 195 transactions (390 ops)
2018-03-31T19:20:09.086168Z broker.debug[0]: content purge: 58 entries
2018-03-31T19:21:07.430061Z sched.debug[0]: attempting job 1 state change from starting to running
2018-03-31T19:21:07.430103Z sched.debug[0]: check callback about to schedule jobs.
2018-03-31T19:21:07.521910Z aggregator.info[0]: push: lwj.0.0.1.exit_status: count=400 fwd_count=0 total=400
2018-03-31T19:21:07.521931Z aggregator.info[0]: sink: lwj.0.0.1.exit_status: count=400 total=400

The text was updated successfully, but these errors were encountered:

grondo · 2018-03-31T19:55:06Z

For this test you might try -o stdio-delay-commit, though 400 lines of output shouldn't take that long anyway.

Does -O output-file work for smaller jobs? Is the real issue here "redirecting output of large jobs doesn't work"?

If live redirect of output to a file doesn't work, we could always work around by saving output from jobs after they've exited.

trws · 2018-03-31T19:57:42Z

It does work for small jobs, I can try delay commit, but the issues are that it didn’t get output and also that it took 16 minutes. Especially given that all of the brokers are already up, and doing it with tools that don’t have pre-existing connections takes 6 seconds, we should really figure out why it takes so long.

…

On 31 Mar 2018, at 12:55, Mark Grondona wrote: For this test you might try `-o stdio-delay-commit`, though 400 lines of output shouldn't take that long anyway. Does `-O output-file` work for smaller jobs? Is the real issue here "redirecting output of large jobs doesn't work"? -- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: #1406 (comment)

trws · 2018-03-31T19:59:02Z

Just realized I had forgotten to post this: It is an 800-node instance. Running the same by using mpiexec set to rsh to all the target nodes simultaneously from a single source node, takes 6 seconds for the same list of nodes.

grondo · 2018-03-31T20:04:35Z

Yeah, unfortunately launching with state in the kvs is going to be slower than rsh direct no matter what we do (up to a certain size of course). I'm worried that a lot of extra kvs commits may have slipped into launch of high rank-count jobs, since we haven't had a chance to try that in a long time. (last I checked it was nowhere near this bad though)

grondo · 2018-03-31T20:13:39Z

Might be useful to also grab the ljw. dmesg lines from all nodes. It is surprising that it took 4m to get to the "running" state (barrier/fence after all wrexecds have finished starting all tasks).

trws · 2018-03-31T20:25:01Z

I'd be happy to do that if I knew how... What I posted was all that came out of running dmesg, is there a way to grab the rest of it? The instance is still up. Also, I tried once with both delay commit and no-pmi turned on and the time was almost identical. Oddly the master broker is almost completely idle the whole time too, there's something funny going on.

grondo · 2018-03-31T20:25:07Z

Oh, @trws, sorry to be so chatty, but can you try flux-wreckrun --immediate with the same configuration, and verify that rank.N.cores information makes sense for this job? I just tried haha test oversubscribing on ipa with 400 brokers and flux wreckrun -v -N 400 -n400 -w complete ran in <2s

$ flux wreckrun -v -N 400 -n 400 -w complete hostname
wreckrun: 0.012s: Registered jobid 4
wreckrun: 0.014s: State = reserved
wreckrun: 0.015s: job.submit: Function not implemented
wreckrun: Allocating 400 tasks across 400 available nodes..
wreckrun: tasks per node: node[0-399]: 1
wreckrun: 0.085s: Sending run event
wreckrun: 1.463s: State = starting
wreckrun: 1.559s: State = running
wreckrun: 1.559s: State = complete

Of course, I think brokers on shared nodes use a different topology that might make this test a lot faster but @garlick might have to comment on that.

trws · 2018-03-31T20:26:24Z

I think I can do that, this version still has the using wreckrun kills sched bug, but I think I can just reload sched.

grondo · 2018-03-31T20:27:02Z

I agree there must be something strange going on. Sorry about this!
(And unfortunately I feel really bad -- surely it is my bug)

trws · 2018-03-31T20:27:50Z

Ok, something seriously strange is going on, now completely for sure:

splash:hwloc$ flux wreckrun -v -N 400 -n 400 -w complete hostname
wreckrun: 0.013s: Registered jobid 5
wreckrun: 0.014s: State = reserved
wreckrun: Allocating 400 tasks across 884 available nodes..
wreckrun: tasks per node: node[0-399]: 1
wreckrun: 0.150s: Sending run event
wreckrun: 1.523s: State = starting
wreckrun: 3.142s: State = running
wreckrun: 3.142s: State = complete

trws · 2018-03-31T20:28:31Z

I'm trying the same wreckrun with IO allowed, will send the output when (if..?) it finishes.

grondo · 2018-03-31T20:29:28Z

I sadly got an assertion error from zuuid_new() when trying to process output from 400 tasks (memory error?) Hopefully doesn't hit you.

trws · 2018-03-31T20:30:27Z

Could be a memory error, but sierra nodes have 256gb each, so that shouldn't be an issue. Will see.

trws · 2018-03-31T20:34:06Z

Wow... just to be completely sure, I ran the wreckrun version redirecting to files, the whole thing completes in two seconds, and all commands ran to completion. The one set to retrieve output is still running.

grondo · 2018-03-31T20:38:22Z

Just to understand

flux wreckrun -I -N400 -w complete hostname works
flux wreckrun -I -N400 hostname works
flux wreckrun -I -N400 -O hostfile.output hostname works?
flux submit -N400 -O hostfile.output runs slow

Maybe something sched sets in the rank.N.cores dirs is causing slowness or triggering a bug ?

dongahn · 2018-03-31T20:40:15Z

Maybe something sched sets in the rank.N.cores dirs is causing slowness

Would be good to verify this. If this is the issue, the patch I posted can be worth a try.

trws · 2018-03-31T20:42:11Z

What does -I do?

otherwise, -w complete works
The second and below run sufficiently slowly that I haven't been able to complete any of them in the time we've been talking.

dongahn · 2018-03-31T20:42:44Z

Seems it is more likely IO? It could be a few commit sites in the execution service could be a scalability bottleneck. Maybe redirecting to files could be a valid work around for the weekend production runs though.

grondo · 2018-03-31T20:46:05Z

-I runs without invoking the scheduler (I'm assuming you're running off current master)

Seems it is more likely IO?

The same I/O is happening with and without -w complete, the only difference is that something isn't trying to read the I/O in realtime from the kvs. The reader in the -O output case could be stuck which would explain the "hang". If this is the case then just dump the I/O to a file after the job is complete.

grondo · 2018-03-31T20:47:04Z

i.e., I should say it appears that reading I/O has a bug causing the slowness, not writing of I/O nor the KVS commits.

trws · 2018-03-31T20:49:30Z

This is a bit behind while I test the current master to make sure nothing regressed before making it production.

Yeah, it pretty much has to be I/O. Or something caused by reading it maybe? This is the output from running without -w:

splash:hwloc$ flux wreckrun -v -N 400 -n 400 hostname
wreckrun: 0.012s: Registered jobid 6
wreckrun: 0.013s: State = reserved
wreckrun: Allocating 400 tasks across 884 available nodes..
wreckrun: tasks per node: node[0-399]: 1
wreckrun: 0.153s: Sending run event
wreckrun: 1149.840s: State = starting
wreckrun: 1154.960s: State = running
wreckrun: 1154.981s: State = complete

Note that it ran for another 100 seconds after printing that it was complete.

dongahn · 2018-03-31T20:52:52Z

@trws: sched master doesn't have that N.cores fix. If you are still using hwloc with 1Node containing only one core, that should be okay. Otherwise yoh may want to try the N.cores fix.

grondo · 2018-03-31T20:54:06Z

I wonder if the output reader is being woken up a lot and eating up process time so other reactor callbacks are starved. Does flux wreck ls show slow time as well for this particular run?

dongahn · 2018-03-31T20:54:36Z

#1400 (comment)

dongahn · 2018-03-31T20:55:23Z

I won't be able to respond for a while. In transit to LA.

trws · 2018-03-31T20:55:59Z

It shows 14 minutes for everything except the ones with the -w argument. Oddly that's two minutes faster than the submitted version.

dongahn · 2018-03-31T21:00:10Z

There is always scheduling cost with submit version...

trws · 2018-03-31T21:00:52Z

True, but the surprise is because sched sent the run request in less than half a second.

grondo · 2018-03-31T21:04:17Z

I'm baffled how use of -w alone would change the timing here. That just disables reading of output/err and opening of single stdin kz file in the kvs, and only in the flux-wreckrun frontend. It should have had little effect on the running job.

For now it might be best to avoid redirecting output within a job until we figure this out. (There are workarounds since the output is stored in the KVS). My availability is going to be spotty for the next week.

Unfortunately I can't reproduce on 400 brokers oversubscribed on 4 nodes of ipa.

trws · 2018-04-01T16:50:15Z

Funny thought, but do kvs directory watch callbacks get invoked for subdirectories? Also, did the atomically append operation ever happen? Either of those should make it possible to make it one reader regardless of num tasks. (probably a do-later item, but since we're in there now it came to mind)

garlick · 2018-04-01T16:50:49Z

Got it thanks!

Then it does sound like the high value fix is to change kz so that its internal KVS watch callback calls flux_kvs_lookup() on the new data, and then a continuation of the lookup calls the kz_ready_f callback when the data is available. kz_get() would fetch data cached in the kz handle and should return EWOULDBLOCK when open in non-blocking mode and no more data is cached.

Sound right?

garlick · 2018-04-01T16:54:20Z

@trws, yes on both, though that would require the watcher to have some knowledge of how the ranks are organized in a job directory. Something to ponder.

grondo · 2018-04-01T17:07:23Z

Sound right?

Yes, anything that avoids these watchers causing sleeping in reactor callbacks. I think all the Lua callbacks expect is one chunk of output to be handed to them.

Eventually we plan to do I/O reduction in which case there probably will be a single reader for all output, which will help. As we write the new execution system, removal of all N:N patterns should be a goal. (Besides output files, we also have per-task directories which should go away in the replacement)

Sorry to leave you with this issue @garlick

garlick · 2018-04-02T21:23:08Z

@trws from #1411:

This certainly helps, but I'm sorry to say that it doesn't seem as though this fixed it completely. Without this patch, a 180 node single task per node job would take minutes, now a 200 node job completes in less than two seconds. Unfortunately the 400 node job still takes 5 minutes to start, and more than 10 to run with output redirected. Without it redirected, about 3 seconds. There's some kind of degenerate case we haven't managed to squash, maybe setting up all of those watches?

garlick · 2018-04-02T21:40:04Z

Sorry, I'm getting confused. Could you define "with output redirected"? Why is that the slow case?

grondo · 2018-04-02T21:48:54Z

@garlick, output redirected is slow because it attaches all kz watchers to the reactor of the first wrexecd in the job.

trws · 2018-04-02T21:49:22Z

When using '-O' on submit or wreckrun, or generically adding the output key to the jobdesc, it causes either the wreckrun command or the main wreck daemon to register a watch for every output and error stream. Without that option, these commands all run quite quickly and well, despite the same amount of output going into the kvs.

This is output I got from perf record on a wreckrun of 300 processes, each with its own node:

      - 98.73% luaD_precall                                                                                           ▒
         - 87.91% l_iowatcher_add                                                                                     ▒
            - 87.73% kz_set_ready_cb                                                                                  ▒
               - 87.73% flux_kvs_watch_dir                                                                            ▒
                  - 86.63% kvs_watch_rpc_get                                                                          ▒
                     - 86.58% flux_rpc_get_unpack                                                                     ▒
                        - 86.46% flux_future_get                                                                      ▒
                           - flux_future_wait_for                                                                     ▒
                              + 55.82% flux_dispatch_requeue

It looks like the majority of the time in wreckrun for this one was actually spent in just setting up the watches in the first place. Do we need the synchronous get in the watch_dir function?

garlick · 2018-04-02T22:00:32Z

Do we need the synchronous get in the watch_dir function?

Hmm, yes, in fact we could have it return a future like everything else. Let me see about that.

So is the "go fast" option then flux wreckrun --wait-until=completed?

trws · 2018-04-02T22:00:59Z

Yes, or submit without '-O' also works fast.

grondo · 2018-04-02T22:11:58Z

So is the "go fast" option then flux wreckrun --wait-until=completed

Yes, or submit without '-O' also works fast.

Anything trying to read the stderr/out from a large job appears to be slow. Even flux wreck attach ID for a large job (though in my 4K task tests it completed much faster than @trws's 400 task example)

When testing with flux wreck attach there does appear to be a large startup cost. The program doesn't go back into the reactor loop until it has added watchers for every task stdout/err, so during this time messages for the kz watchers we've already added are probably coming in and getting requeued.

Anything that speeds up this initial registration and gets us into the reactor faster will help, but also keep in mind how much time we want to spend improving this

grondo · 2018-04-02T22:18:33Z

In case it wasn't clear from before, when using submit with -O the first wrexecd of the job takes over the output reading (and writing it to a file). So perhaps things are worse there because kz reading starves the wrexecd from processing other work related to job management.

garlick · 2018-04-02T22:28:57Z

Fixing the watch interface (at least for kz) seems like a high value, yet fairly contained effort.

Yeah, at some point it makes sense to do the I/O reduction idea and stop propping up kz. Even if we get rid of all the synchronous RPC's that are blocking the reactor, I shudder at how much processing takes place in the KVS module for each watched key.

Thanks for the clarification guys!

grondo · 2018-04-02T22:33:17Z

Yeah, sorry @garlick I didn't mean to discourage or minimize your excellent efforts here. My point was more about avoiding the use of -O file for jobs if it is painful. I'm a bit worried that even for 400 tasks @trws's case is so slow. In my testing on ipa I was getting output from 4K tasks in ~3m not 10m for a job 10x smaller!

trws · 2018-04-02T22:35:37Z

I get the impression that the individual latency from having all of them on different nodes is contributing to it being this bad. Possibly also from having one broker per task.

grondo · 2018-04-02T22:42:14Z

I get the impression that the individual latency from having all of them on different nodes is contributing to it being this bad. Possibly also from having one broker per task.

I might not be understanding your statement, but with -O file all kz watchers run on the first wrexecd of the job, just as in wreckrun they are all running in the wreckrun process. I did try 1 task per "broker" on ipa and couldn't reproduce the extreme results in the sierra case:

$ flux getattr size
400
$ time flux wreckrun -N400 hostname > /dev/null

real	0m5.498s
user	0m2.201s
sys	0m2.281s

trws · 2018-04-02T22:57:48Z

That pretty much means it has to be nodes then. I know all the watchers are on one node, but the kvs is being hammered by all of the nodes, so anything blocking will have more messages to requeue before it finds a match. There may be more to it than that, but there's something specific about it being individual nodes it seems.

grondo · 2018-04-02T23:02:10Z

@trws true, and I think brokers that share a node wire up differently (forgot that for a minute).

However, what baffles me is that if we don't install the kvs watchers everything is fine ... i.e. writing to kvs not a problem, which is the only thing done from multiple nodes. (probably missing something simple though)

garlick · 2018-04-02T23:12:32Z

The re-queueing only occurs in the flux_t handle/endpoint, not in the broker. The amount of re-queueing would be proportional to the number of messages arrving at that endpoint, not passing through its broker. (forget for a moment that modules are threads of the broker - they only communicate with the broker through messages).

If that seemed like a non-sequitur maybe I'm not understanding your point @trws.

trws · 2018-04-02T23:40:22Z

I see what you mean @garlick, but the more traffic there is and the higher the latency of the traffic the more the queues will fill up. I'm expecting that all of the kvs messages, including the watch and subsequent get, go to the kvs endpoint don't they?

garlick · 2018-04-03T02:48:46Z

Yes, all true. Sorry if I misunderstood your earlier description.

garlick · 2018-04-03T13:48:31Z

Rereading this I still am not sure we are communicating. If you're around today @trws let's have a chat. I want to make sure I'm working on the highest priority problem.

grondo · 2018-04-13T16:57:09Z

What's the status of this issue? I think things should have greatly improved since @garlick's work in libkz, however I don't have a good place to test. I could never reproduce this issue on up to 512 brokers oversubscribed over just a few nodes.

garlick · 2018-04-13T17:20:34Z

I think this is solved (or at least adequately worked around) by the combination of defering the libkz kvs_watch() calls and your KZ_FLAGS_NOFOLLOW flag and flux wreck attach --no-follow. Let's close and we can open new issues for any other loose ends.

trws · 2018-04-13T21:51:54Z

It’s a little better, but only a little. The fundamental problem with this one is that getting all the watchers setup causes some really nasty feedback with job launch that can make the two together take a very long time, and sometimes lose output among other things. It only seems to happen at truly large scales though, so it’s a bit hard to test.

…

On 13 Apr 2018, at 9:57, Mark Grondona wrote: What's the status of this issue? I think things should have greatly improved since @garlick's work in libkz, however I don't have a good place to test. I could never reproduce this issue on up to 512 brokers oversubscribed over just a few nodes. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #1406 (comment)

grondo · 2018-04-13T21:59:47Z

Thanks for the feedback @trws! I think we know that the 2*ntasks "streams" per job isn't going to work and have plans to fix for scale with a new I/O scheme. Hopefully the workarounds are tolerable for the splash use case and sorry about the issues!

trws · 2018-04-13T22:03:47Z

No worries, we’ve gotten it to where we’ve worked around this for splash.

…

On 13 Apr 2018, at 14:59, Mark Grondona wrote: Thanks for the feedback @trws! I think we know that the 2*ntasks "streams" per job isn't going to work and have plans to fix for scale with a new I/O scheme. Hopefully the workarounds are tolerable for the splash use case and sorry about the issues! -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #1406 (comment)

trws changed the title ~~large job launch does not work [well~~ large job launch does not work [well] Mar 31, 2018

garlick mentioned this issue Apr 2, 2018

libkz: avoid blocking RPCs in reactor callbacks #1411

Merged

garlick mentioned this issue Apr 3, 2018

[splash] need workaround to efficiently read job stdio output from large jobs #1420

Closed

garlick closed this as completed Apr 13, 2018

large job launch with output redirect does not work [well] #1406

large job launch with output redirect does not work [well] #1406

Comments

trws commented Mar 31, 2018 • edited Loading

grondo commented Mar 31, 2018 • edited Loading

trws commented Mar 31, 2018 via email

trws commented Mar 31, 2018

grondo commented Mar 31, 2018 • edited Loading

grondo commented Mar 31, 2018

trws commented Mar 31, 2018

grondo commented Mar 31, 2018

trws commented Mar 31, 2018

grondo commented Mar 31, 2018 • edited Loading

trws commented Mar 31, 2018

trws commented Mar 31, 2018

grondo commented Mar 31, 2018

trws commented Mar 31, 2018

trws commented Mar 31, 2018

grondo commented Mar 31, 2018

dongahn commented Mar 31, 2018

trws commented Mar 31, 2018

dongahn commented Mar 31, 2018

grondo commented Mar 31, 2018

grondo commented Mar 31, 2018

trws commented Mar 31, 2018

dongahn commented Mar 31, 2018

grondo commented Mar 31, 2018

dongahn commented Mar 31, 2018

dongahn commented Mar 31, 2018

trws commented Mar 31, 2018

dongahn commented Mar 31, 2018

trws commented Mar 31, 2018

grondo commented Mar 31, 2018 • edited Loading

trws commented Apr 1, 2018

garlick commented Apr 1, 2018

garlick commented Apr 1, 2018

grondo commented Apr 1, 2018

garlick commented Apr 2, 2018

garlick commented Apr 2, 2018

grondo commented Apr 2, 2018 • edited Loading

trws commented Apr 2, 2018

garlick commented Apr 2, 2018

trws commented Apr 2, 2018

grondo commented Apr 2, 2018

grondo commented Apr 2, 2018

garlick commented Apr 2, 2018

grondo commented Apr 2, 2018

trws commented Apr 2, 2018

grondo commented Apr 2, 2018 • edited Loading

trws commented Apr 2, 2018

grondo commented Apr 2, 2018

garlick commented Apr 2, 2018

trws commented Apr 2, 2018

garlick commented Apr 3, 2018

garlick commented Apr 3, 2018

grondo commented Apr 13, 2018

garlick commented Apr 13, 2018

trws commented Apr 13, 2018 via email

grondo commented Apr 13, 2018

trws commented Apr 13, 2018 via email

trws commented Mar 31, 2018 •

edited

Loading

grondo commented Mar 31, 2018 •

edited

Loading

grondo commented Mar 31, 2018 •

edited

Loading

grondo commented Mar 31, 2018 •

edited

Loading

grondo commented Mar 31, 2018 •

edited

Loading

grondo commented Apr 2, 2018 •

edited

Loading

grondo commented Apr 2, 2018 •

edited

Loading