Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

large job launch with output redirect does not work [well] #1406

Closed
trws opened this issue Mar 31, 2018 · 66 comments
Closed

large job launch with output redirect does not work [well] #1406

trws opened this issue Mar 31, 2018 · 66 comments

Comments

@trws
Copy link
Member

trws commented Mar 31, 2018

flux submit -N 400 -O host-test.out hostname has the following timing output:

    ID       NTASKS     STARTING      RUNNING     COMPLETE        TOTAL
     1          400       0.556s       4.015m      11.961m      15.976m

After spending 16 minutes running, the host-test.out file is empty. Using flux wreck attach shows output from all of the nodes (and did after about 6 minutes), but nothing went into the output file at all.

dmesg output: (skipping tons of sched spew, sched did all of this in half a second)

2018-03-31T19:16:06.333604Z sched.info[0]: hostname: sierra1271, digest: 74FFAD13DE186FD3843342F3A8C5ACCEC60EE704
2018-03-31T19:16:06.333616Z sched.info[0]: broker found, rank: 400
2018-03-31T19:16:06.400859Z sched.debug[0]: job (1) assigned new state: allocated
2018-03-31T19:16:06.401206Z sched.debug[0]: Allocated 400 node(s) for job 1
2018-03-31T19:16:06.402166Z sched.debug[0]: attempting job 1 state change from selected to allocated
2018-03-31T19:16:06.405195Z sched.debug[0]: job (1) assigned new state: runrequest
2018-03-31T19:16:06.405241Z sched.debug[0]: job 1 runrequest
2018-03-31T19:16:06.405369Z sched.debug[0]: attempting job 1 state change from allocated to runrequest
2018-03-31T19:16:06.906813Z sched.debug[0]: attempting job 1 state change from runrequest to starting
2018-03-31T19:16:07.086183Z broker.debug[0]: content purge: 69 entries
2018-03-31T19:19:09.086267Z broker.debug[0]: content purge: 1 entries
2018-03-31T19:20:07.827169Z broker.debug[0]: content flush begin
2018-03-31T19:20:07.829555Z broker.debug[0]: content flush +128 (dirty=303 pending=256)
2018-03-31T19:20:07.851207Z broker.debug[0]: content flush begin
2018-03-31T19:20:07.852156Z broker.debug[0]: content flush +47 (dirty=175 pending=175)
2018-03-31T19:20:07.955077Z kvs.debug[0]: aggregated 14 transactions (28 ops)
2018-03-31T19:20:07.999758Z kvs.debug[0]: aggregated 27 transactions (54 ops)
2018-03-31T19:20:08.044097Z kvs.debug[0]: aggregated 46 transactions (92 ops)
2018-03-31T19:20:08.113061Z aggregator.info[0]: push: lwj.0.0.1.exit_status: count=64 fwd_count=0 total=400
2018-03-31T19:20:08.204213Z aggregator.info[0]: push: lwj.0.0.1.exit_status: count=145 fwd_count=0 total=400
2018-03-31T19:20:08.210242Z aggregator.info[0]: push: lwj.0.0.1.exit_status: count=399 fwd_count=0 total=400
2018-03-31T19:20:08.142086Z kvs.debug[0]: aggregated 116 transactions (232 ops)
2018-03-31T19:20:08.302531Z broker.debug[0]: content flush begin
2018-03-31T19:20:08.303382Z broker.debug[0]: content flush +42 (dirty=170 pending=170)
2018-03-31T19:20:08.292487Z kvs.debug[0]: aggregated 195 transactions (390 ops)
2018-03-31T19:20:09.086168Z broker.debug[0]: content purge: 58 entries
2018-03-31T19:21:07.430061Z sched.debug[0]: attempting job 1 state change from starting to running
2018-03-31T19:21:07.430103Z sched.debug[0]: check callback about to schedule jobs.
2018-03-31T19:21:07.521910Z aggregator.info[0]: push: lwj.0.0.1.exit_status: count=400 fwd_count=0 total=400
2018-03-31T19:21:07.521931Z aggregator.info[0]: sink: lwj.0.0.1.exit_status: count=400 total=400
@trws trws changed the title large job launch does not work [well large job launch does not work [well] Mar 31, 2018
@grondo
Copy link
Contributor

grondo commented Mar 31, 2018

For this test you might try -o stdio-delay-commit, though 400 lines of output shouldn't take that long anyway.

Does -O output-file work for smaller jobs? Is the real issue here "redirecting output of large jobs doesn't work"?

If live redirect of output to a file doesn't work, we could always work around by saving output from jobs after they've exited.

@trws
Copy link
Member Author

trws commented Mar 31, 2018 via email

@trws
Copy link
Member Author

trws commented Mar 31, 2018

Just realized I had forgotten to post this: It is an 800-node instance. Running the same by using mpiexec set to rsh to all the target nodes simultaneously from a single source node, takes 6 seconds for the same list of nodes.

@grondo
Copy link
Contributor

grondo commented Mar 31, 2018

Yeah, unfortunately launching with state in the kvs is going to be slower than rsh direct no matter what we do (up to a certain size of course). I'm worried that a lot of extra kvs commits may have slipped into launch of high rank-count jobs, since we haven't had a chance to try that in a long time. (last I checked it was nowhere near this bad though)

@grondo
Copy link
Contributor

grondo commented Mar 31, 2018

Might be useful to also grab the ljw. dmesg lines from all nodes. It is surprising that it took 4m to get to the "running" state (barrier/fence after all wrexecds have finished starting all tasks).

@trws
Copy link
Member Author

trws commented Mar 31, 2018

I'd be happy to do that if I knew how... What I posted was all that came out of running dmesg, is there a way to grab the rest of it? The instance is still up. Also, I tried once with both delay commit and no-pmi turned on and the time was almost identical. Oddly the master broker is almost completely idle the whole time too, there's something funny going on.

@grondo
Copy link
Contributor

grondo commented Mar 31, 2018

Oh, @trws, sorry to be so chatty, but can you try flux-wreckrun --immediate with the same configuration, and verify that rank.N.cores information makes sense for this job? I just tried haha test oversubscribing on ipa with 400 brokers and flux wreckrun -v -N 400 -n400 -w complete ran in <2s

$ flux wreckrun -v -N 400 -n 400 -w complete hostname
wreckrun: 0.012s: Registered jobid 4
wreckrun: 0.014s: State = reserved
wreckrun: 0.015s: job.submit: Function not implemented
wreckrun: Allocating 400 tasks across 400 available nodes..
wreckrun: tasks per node: node[0-399]: 1
wreckrun: 0.085s: Sending run event
wreckrun: 1.463s: State = starting
wreckrun: 1.559s: State = running
wreckrun: 1.559s: State = complete

Of course, I think brokers on shared nodes use a different topology that might make this test a lot faster but @garlick might have to comment on that.

@trws
Copy link
Member Author

trws commented Mar 31, 2018

I think I can do that, this version still has the using wreckrun kills sched bug, but I think I can just reload sched.

@grondo
Copy link
Contributor

grondo commented Mar 31, 2018

I agree there must be something strange going on. Sorry about this!
(And unfortunately I feel really bad -- surely it is my bug)

@trws
Copy link
Member Author

trws commented Mar 31, 2018

Ok, something seriously strange is going on, now completely for sure:

splash:hwloc$ flux wreckrun -v -N 400 -n 400 -w complete hostname
wreckrun: 0.013s: Registered jobid 5
wreckrun: 0.014s: State = reserved
wreckrun: Allocating 400 tasks across 884 available nodes..
wreckrun: tasks per node: node[0-399]: 1
wreckrun: 0.150s: Sending run event
wreckrun: 1.523s: State = starting
wreckrun: 3.142s: State = running
wreckrun: 3.142s: State = complete

@trws
Copy link
Member Author

trws commented Mar 31, 2018

I'm trying the same wreckrun with IO allowed, will send the output when (if..?) it finishes.

@grondo
Copy link
Contributor

grondo commented Mar 31, 2018

I sadly got an assertion error from zuuid_new() when trying to process output from 400 tasks (memory error?) Hopefully doesn't hit you.

@trws
Copy link
Member Author

trws commented Mar 31, 2018

Could be a memory error, but sierra nodes have 256gb each, so that shouldn't be an issue. Will see.

@trws
Copy link
Member Author

trws commented Mar 31, 2018

Wow... just to be completely sure, I ran the wreckrun version redirecting to files, the whole thing completes in two seconds, and all commands ran to completion. The one set to retrieve output is still running.

@grondo
Copy link
Contributor

grondo commented Mar 31, 2018

Just to understand

  • flux wreckrun -I -N400 -w complete hostname works
  • flux wreckrun -I -N400 hostname works
  • flux wreckrun -I -N400 -O hostfile.output hostname works?
  • flux submit -N400 -O hostfile.output runs slow

Maybe something sched sets in the rank.N.cores dirs is causing slowness or triggering a bug ?

@dongahn
Copy link
Member

dongahn commented Mar 31, 2018

Maybe something sched sets in the rank.N.cores dirs is causing slowness

Would be good to verify this. If this is the issue, the patch I posted can be worth a try.

@trws
Copy link
Member Author

trws commented Mar 31, 2018

What does -I do?

otherwise, -w complete works
The second and below run sufficiently slowly that I haven't been able to complete any of them in the time we've been talking.

@dongahn
Copy link
Member

dongahn commented Mar 31, 2018

Seems it is more likely IO? It could be a few commit sites in the execution service could be a scalability bottleneck. Maybe redirecting to files could be a valid work around for the weekend production runs though.

@grondo
Copy link
Contributor

grondo commented Mar 31, 2018

-I runs without invoking the scheduler (I'm assuming you're running off current master)

Seems it is more likely IO?

The same I/O is happening with and without -w complete, the only difference is that something isn't trying to read the I/O in realtime from the kvs. The reader in the -O output case could be stuck which would explain the "hang". If this is the case then just dump the I/O to a file after the job is complete.

@grondo
Copy link
Contributor

grondo commented Mar 31, 2018

i.e., I should say it appears that reading I/O has a bug causing the slowness, not writing of I/O nor the KVS commits.

@trws
Copy link
Member Author

trws commented Mar 31, 2018

This is a bit behind while I test the current master to make sure nothing regressed before making it production.

Yeah, it pretty much has to be I/O. Or something caused by reading it maybe? This is the output from running without -w:

splash:hwloc$ flux wreckrun -v -N 400 -n 400 hostname
wreckrun: 0.012s: Registered jobid 6
wreckrun: 0.013s: State = reserved
wreckrun: Allocating 400 tasks across 884 available nodes..
wreckrun: tasks per node: node[0-399]: 1
wreckrun: 0.153s: Sending run event
wreckrun: 1149.840s: State = starting
wreckrun: 1154.960s: State = running
wreckrun: 1154.981s: State = complete

Note that it ran for another 100 seconds after printing that it was complete.

@dongahn
Copy link
Member

dongahn commented Mar 31, 2018

@trws: sched master doesn't have that N.cores fix. If you are still using hwloc with 1Node containing only one core, that should be okay. Otherwise yoh may want to try the N.cores fix.

@grondo
Copy link
Contributor

grondo commented Mar 31, 2018

I wonder if the output reader is being woken up a lot and eating up process time so other reactor callbacks are starved. Does flux wreck ls show slow time as well for this particular run?

@dongahn
Copy link
Member

dongahn commented Mar 31, 2018

#1400 (comment)

@dongahn
Copy link
Member

dongahn commented Mar 31, 2018

I won't be able to respond for a while. In transit to LA.

@trws
Copy link
Member Author

trws commented Mar 31, 2018

It shows 14 minutes for everything except the ones with the -w argument. Oddly that's two minutes faster than the submitted version.

@dongahn
Copy link
Member

dongahn commented Mar 31, 2018

There is always scheduling cost with submit version...

@trws
Copy link
Member Author

trws commented Mar 31, 2018

True, but the surprise is because sched sent the run request in less than half a second.

@grondo
Copy link
Contributor

grondo commented Mar 31, 2018

I'm baffled how use of -w alone would change the timing here. That just disables reading of output/err and opening of single stdin kz file in the kvs, and only in the flux-wreckrun frontend. It should have had little effect on the running job.

For now it might be best to avoid redirecting output within a job until we figure this out. (There are workarounds since the output is stored in the KVS). My availability is going to be spotty for the next week.

Unfortunately I can't reproduce on 400 brokers oversubscribed on 4 nodes of ipa.

@trws
Copy link
Member Author

trws commented Apr 1, 2018

Funny thought, but do kvs directory watch callbacks get invoked for subdirectories? Also, did the atomically append operation ever happen? Either of those should make it possible to make it one reader regardless of num tasks. (probably a do-later item, but since we're in there now it came to mind)

@garlick
Copy link
Member

garlick commented Apr 1, 2018

Got it thanks!

Then it does sound like the high value fix is to change kz so that its internal KVS watch callback calls flux_kvs_lookup() on the new data, and then a continuation of the lookup calls the kz_ready_f callback when the data is available. kz_get() would fetch data cached in the kz handle and should return EWOULDBLOCK when open in non-blocking mode and no more data is cached.

Sound right?

@garlick
Copy link
Member

garlick commented Apr 1, 2018

@trws, yes on both, though that would require the watcher to have some knowledge of how the ranks are organized in a job directory. Something to ponder.

@grondo
Copy link
Contributor

grondo commented Apr 1, 2018

Sound right?

Yes, anything that avoids these watchers causing sleeping in reactor callbacks. I think all the Lua callbacks expect is one chunk of output to be handed to them.

Eventually we plan to do I/O reduction in which case there probably will be a single reader for all output, which will help. As we write the new execution system, removal of all N:N patterns should be a goal. (Besides output files, we also have per-task directories which should go away in the replacement)

Sorry to leave you with this issue @garlick

@garlick
Copy link
Member

garlick commented Apr 2, 2018

@trws from #1411:

This certainly helps, but I'm sorry to say that it doesn't seem as though this fixed it completely. Without this patch, a 180 node single task per node job would take minutes, now a 200 node job completes in less than two seconds. Unfortunately the 400 node job still takes 5 minutes to start, and more than 10 to run with output redirected. Without it redirected, about 3 seconds. There's some kind of degenerate case we haven't managed to squash, maybe setting up all of those watches?

@garlick
Copy link
Member

garlick commented Apr 2, 2018

Sorry, I'm getting confused. Could you define "with output redirected"? Why is that the slow case?

@grondo
Copy link
Contributor

grondo commented Apr 2, 2018

@garlick, output redirected is slow because it attaches all kz watchers to the reactor of the first wrexecd in the job.

@trws
Copy link
Member Author

trws commented Apr 2, 2018

When using '-O' on submit or wreckrun, or generically adding the output key to the jobdesc, it causes either the wreckrun command or the main wreck daemon to register a watch for every output and error stream. Without that option, these commands all run quite quickly and well, despite the same amount of output going into the kvs.

This is output I got from perf record on a wreckrun of 300 processes, each with its own node:

      - 98.73% luaD_precall                                                                                           ▒
         - 87.91% l_iowatcher_add                                                                                     ▒
            - 87.73% kz_set_ready_cb                                                                                  ▒
               - 87.73% flux_kvs_watch_dir                                                                            ▒
                  - 86.63% kvs_watch_rpc_get                                                                          ▒
                     - 86.58% flux_rpc_get_unpack                                                                     ▒
                        - 86.46% flux_future_get                                                                      ▒
                           - flux_future_wait_for                                                                     ▒
                              + 55.82% flux_dispatch_requeue     

It looks like the majority of the time in wreckrun for this one was actually spent in just setting up the watches in the first place. Do we need the synchronous get in the watch_dir function?

@garlick
Copy link
Member

garlick commented Apr 2, 2018

Do we need the synchronous get in the watch_dir function?

Hmm, yes, in fact we could have it return a future like everything else. Let me see about that.

So is the "go fast" option then flux wreckrun --wait-until=completed?

@trws
Copy link
Member Author

trws commented Apr 2, 2018

Yes, or submit without '-O' also works fast.

@grondo
Copy link
Contributor

grondo commented Apr 2, 2018

So is the "go fast" option then flux wreckrun --wait-until=completed

Yes, or submit without '-O' also works fast.

Anything trying to read the stderr/out from a large job appears to be slow. Even flux wreck attach ID for a large job (though in my 4K task tests it completed much faster than @trws's 400 task example)

When testing with flux wreck attach there does appear to be a large startup cost. The program doesn't go back into the reactor loop until it has added watchers for every task stdout/err, so during this time messages for the kz watchers we've already added are probably coming in and getting requeued.

Anything that speeds up this initial registration and gets us into the reactor faster will help, but also keep in mind how much time we want to spend improving this

@grondo
Copy link
Contributor

grondo commented Apr 2, 2018

In case it wasn't clear from before, when using submit with -O the first wrexecd of the job takes over the output reading (and writing it to a file). So perhaps things are worse there because kz reading starves the wrexecd from processing other work related to job management.

@garlick
Copy link
Member

garlick commented Apr 2, 2018

Fixing the watch interface (at least for kz) seems like a high value, yet fairly contained effort.

Yeah, at some point it makes sense to do the I/O reduction idea and stop propping up kz. Even if we get rid of all the synchronous RPC's that are blocking the reactor, I shudder at how much processing takes place in the KVS module for each watched key.

Thanks for the clarification guys!

@grondo
Copy link
Contributor

grondo commented Apr 2, 2018

Yeah, sorry @garlick I didn't mean to discourage or minimize your excellent efforts here. My point was more about avoiding the use of -O file for jobs if it is painful. I'm a bit worried that even for 400 tasks @trws's case is so slow. In my testing on ipa I was getting output from 4K tasks in ~3m not 10m for a job 10x smaller!

@trws
Copy link
Member Author

trws commented Apr 2, 2018

I get the impression that the individual latency from having all of them on different nodes is contributing to it being this bad. Possibly also from having one broker per task.

@grondo
Copy link
Contributor

grondo commented Apr 2, 2018

I get the impression that the individual latency from having all of them on different nodes is contributing to it being this bad. Possibly also from having one broker per task.

I might not be understanding your statement, but with -O file all kz watchers run on the first wrexecd of the job, just as in wreckrun they are all running in the wreckrun process. I did try 1 task per "broker" on ipa and couldn't reproduce the extreme results in the sierra case:

$ flux getattr size
400
$ time flux wreckrun -N400 hostname > /dev/null

real	0m5.498s
user	0m2.201s
sys	0m2.281s

@trws
Copy link
Member Author

trws commented Apr 2, 2018

That pretty much means it has to be nodes then. I know all the watchers are on one node, but the kvs is being hammered by all of the nodes, so anything blocking will have more messages to requeue before it finds a match. There may be more to it than that, but there's something specific about it being individual nodes it seems.

@grondo
Copy link
Contributor

grondo commented Apr 2, 2018

@trws true, and I think brokers that share a node wire up differently (forgot that for a minute).

However, what baffles me is that if we don't install the kvs watchers everything is fine ... i.e. writing to kvs not a problem, which is the only thing done from multiple nodes. (probably missing something simple though)

@garlick
Copy link
Member

garlick commented Apr 2, 2018

The re-queueing only occurs in the flux_t handle/endpoint, not in the broker. The amount of re-queueing would be proportional to the number of messages arrving at that endpoint, not passing through its broker. (forget for a moment that modules are threads of the broker - they only communicate with the broker through messages).

If that seemed like a non-sequitur maybe I'm not understanding your point @trws.

@trws
Copy link
Member Author

trws commented Apr 2, 2018

I see what you mean @garlick, but the more traffic there is and the higher the latency of the traffic the more the queues will fill up. I'm expecting that all of the kvs messages, including the watch and subsequent get, go to the kvs endpoint don't they?

@garlick
Copy link
Member

garlick commented Apr 3, 2018

Yes, all true. Sorry if I misunderstood your earlier description.

@garlick
Copy link
Member

garlick commented Apr 3, 2018

Rereading this I still am not sure we are communicating. If you're around today @trws let's have a chat. I want to make sure I'm working on the highest priority problem.

@grondo
Copy link
Contributor

grondo commented Apr 13, 2018

What's the status of this issue? I think things should have greatly improved since @garlick's work in libkz, however I don't have a good place to test. I could never reproduce this issue on up to 512 brokers oversubscribed over just a few nodes.

@garlick
Copy link
Member

garlick commented Apr 13, 2018

I think this is solved (or at least adequately worked around) by the combination of defering the libkz kvs_watch() calls and your KZ_FLAGS_NOFOLLOW flag and flux wreck attach --no-follow. Let's close and we can open new issues for any other loose ends.

@garlick garlick closed this as completed Apr 13, 2018
@trws
Copy link
Member Author

trws commented Apr 13, 2018 via email

@grondo
Copy link
Contributor

grondo commented Apr 13, 2018

Thanks for the feedback @trws! I think we know that the 2*ntasks "streams" per job isn't going to work and have plans to fix for scale with a new I/O scheme. Hopefully the workarounds are tolerable for the splash use case and sorry about the issues!

@trws
Copy link
Member Author

trws commented Apr 13, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants