-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
large job launch with output redirect does not work [well] #1406
Comments
For this test you might try Does If live redirect of output to a file doesn't work, we could always work around by saving output from jobs after they've exited. |
It does work for small jobs, I can try delay commit, but the issues are
that it didn’t get output and also that it took 16 minutes.
Especially given that all of the brokers are already up, and doing it
with tools that don’t have pre-existing connections takes 6 seconds,
we should really figure out why it takes so long.
…On 31 Mar 2018, at 12:55, Mark Grondona wrote:
For this test you might try `-o stdio-delay-commit`, though 400 lines
of output shouldn't take that long anyway.
Does `-O output-file` work for smaller jobs? Is the real issue here
"redirecting output of large jobs doesn't work"?
--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
#1406 (comment)
|
Just realized I had forgotten to post this: It is an 800-node instance. Running the same by using mpiexec set to rsh to all the target nodes simultaneously from a single source node, takes 6 seconds for the same list of nodes. |
Yeah, unfortunately launching with state in the kvs is going to be slower than rsh direct no matter what we do (up to a certain size of course). I'm worried that a lot of extra kvs commits may have slipped into launch of high rank-count jobs, since we haven't had a chance to try that in a long time. (last I checked it was nowhere near this bad though) |
Might be useful to also grab the |
I'd be happy to do that if I knew how... What I posted was all that came out of running dmesg, is there a way to grab the rest of it? The instance is still up. Also, I tried once with both delay commit and no-pmi turned on and the time was almost identical. Oddly the master broker is almost completely idle the whole time too, there's something funny going on. |
Oh, @trws, sorry to be so chatty, but can you try
Of course, I think brokers on shared nodes use a different topology that might make this test a lot faster but @garlick might have to comment on that. |
I think I can do that, this version still has the using wreckrun kills sched bug, but I think I can just reload sched. |
I agree there must be something strange going on. Sorry about this! |
Ok, something seriously strange is going on, now completely for sure:
|
I'm trying the same wreckrun with IO allowed, will send the output when (if..?) it finishes. |
I sadly got an assertion error from |
Could be a memory error, but sierra nodes have 256gb each, so that shouldn't be an issue. Will see. |
Wow... just to be completely sure, I ran the wreckrun version redirecting to files, the whole thing completes in two seconds, and all commands ran to completion. The one set to retrieve output is still running. |
Just to understand
Maybe something sched sets in the |
Would be good to verify this. If this is the issue, the patch I posted can be worth a try. |
What does -I do? otherwise, -w complete works |
Seems it is more likely IO? It could be a few commit sites in the execution service could be a scalability bottleneck. Maybe redirecting to files could be a valid work around for the weekend production runs though. |
The same I/O is happening with and without |
i.e., I should say it appears that reading I/O has a bug causing the slowness, not writing of I/O nor the KVS commits. |
This is a bit behind while I test the current master to make sure nothing regressed before making it production. Yeah, it pretty much has to be I/O. Or something caused by reading it maybe? This is the output from running without -w:
Note that it ran for another 100 seconds after printing that it was complete. |
@trws: sched master doesn't have that N.cores fix. If you are still using hwloc with 1Node containing only one core, that should be okay. Otherwise yoh may want to try the N.cores fix. |
I wonder if the output reader is being woken up a lot and eating up process time so other reactor callbacks are starved. Does |
I won't be able to respond for a while. In transit to LA. |
It shows 14 minutes for everything except the ones with the |
There is always scheduling cost with submit version... |
True, but the surprise is because sched sent the run request in less than half a second. |
I'm baffled how use of For now it might be best to avoid redirecting output within a job until we figure this out. (There are workarounds since the output is stored in the KVS). My availability is going to be spotty for the next week. Unfortunately I can't reproduce on 400 brokers oversubscribed on 4 nodes of ipa. |
Funny thought, but do kvs directory watch callbacks get invoked for subdirectories? Also, did the atomically append operation ever happen? Either of those should make it possible to make it one reader regardless of num tasks. (probably a do-later item, but since we're in there now it came to mind) |
Got it thanks! Then it does sound like the high value fix is to change kz so that its internal KVS watch callback calls Sound right? |
@trws, yes on both, though that would require the watcher to have some knowledge of how the ranks are organized in a job directory. Something to ponder. |
Yes, anything that avoids these watchers causing sleeping in reactor callbacks. I think all the Lua callbacks expect is one chunk of output to be handed to them. Eventually we plan to do I/O reduction in which case there probably will be a single reader for all output, which will help. As we write the new execution system, removal of all N:N patterns should be a goal. (Besides output files, we also have per-task directories which should go away in the replacement) Sorry to leave you with this issue @garlick |
|
Sorry, I'm getting confused. Could you define "with output redirected"? Why is that the slow case? |
@garlick, output redirected is slow because it attaches all kz watchers to the reactor of the first wrexecd in the job. |
When using '-O' on submit or wreckrun, or generically adding the output key to the jobdesc, it causes either the wreckrun command or the main wreck daemon to register a watch for every output and error stream. Without that option, these commands all run quite quickly and well, despite the same amount of output going into the kvs. This is output I got from perf record on a wreckrun of 300 processes, each with its own node:
It looks like the majority of the time in wreckrun for this one was actually spent in just setting up the watches in the first place. Do we need the synchronous get in the watch_dir function? |
Hmm, yes, in fact we could have it return a future like everything else. Let me see about that. So is the "go fast" option then |
Yes, or submit without '-O' also works fast. |
Anything trying to read the stderr/out from a large job appears to be slow. Even When testing with Anything that speeds up this initial registration and gets us into the reactor faster will help, but also keep in mind how much time we want to spend improving this |
In case it wasn't clear from before, when using submit with |
Fixing the watch interface (at least for kz) seems like a high value, yet fairly contained effort. Yeah, at some point it makes sense to do the I/O reduction idea and stop propping up kz. Even if we get rid of all the synchronous RPC's that are blocking the reactor, I shudder at how much processing takes place in the KVS module for each watched key. Thanks for the clarification guys! |
Yeah, sorry @garlick I didn't mean to discourage or minimize your excellent efforts here. My point was more about avoiding the use of |
I get the impression that the individual latency from having all of them on different nodes is contributing to it being this bad. Possibly also from having one broker per task. |
I might not be understanding your statement, but with
|
That pretty much means it has to be nodes then. I know all the watchers are on one node, but the kvs is being hammered by all of the nodes, so anything blocking will have more messages to requeue before it finds a match. There may be more to it than that, but there's something specific about it being individual nodes it seems. |
@trws true, and I think brokers that share a node wire up differently (forgot that for a minute). However, what baffles me is that if we don't install the kvs watchers everything is fine ... i.e. writing to kvs not a problem, which is the only thing done from multiple nodes. (probably missing something simple though) |
The re-queueing only occurs in the flux_t handle/endpoint, not in the broker. The amount of re-queueing would be proportional to the number of messages arrving at that endpoint, not passing through its broker. (forget for a moment that modules are threads of the broker - they only communicate with the broker through messages). If that seemed like a non-sequitur maybe I'm not understanding your point @trws. |
I see what you mean @garlick, but the more traffic there is and the higher the latency of the traffic the more the queues will fill up. I'm expecting that all of the kvs messages, including the watch and subsequent get, go to the kvs endpoint don't they? |
Yes, all true. Sorry if I misunderstood your earlier description. |
Rereading this I still am not sure we are communicating. If you're around today @trws let's have a chat. I want to make sure I'm working on the highest priority problem. |
What's the status of this issue? I think things should have greatly improved since @garlick's work in libkz, however I don't have a good place to test. I could never reproduce this issue on up to 512 brokers oversubscribed over just a few nodes. |
I think this is solved (or at least adequately worked around) by the combination of defering the libkz |
It’s a little better, but only a little. The fundamental problem with
this one is that getting all the watchers setup causes some really nasty
feedback with job launch that can make the two together take a very long
time, and sometimes lose output among other things. It only seems to
happen at truly large scales though, so it’s a bit hard to test.
…On 13 Apr 2018, at 9:57, Mark Grondona wrote:
What's the status of this issue? I think things should have greatly
improved since @garlick's work in libkz, however I don't have a good
place to test. I could never reproduce this issue on up to 512 brokers
oversubscribed over just a few nodes.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1406 (comment)
|
Thanks for the feedback @trws! I think we know that the 2*ntasks "streams" per job isn't going to work and have plans to fix for scale with a new I/O scheme. Hopefully the workarounds are tolerable for the splash use case and sorry about the issues! |
No worries, we’ve gotten it to where we’ve worked around this for
splash.
…On 13 Apr 2018, at 14:59, Mark Grondona wrote:
Thanks for the feedback @trws! I think we know that the 2*ntasks
"streams" per job isn't going to work and have plans to fix for scale
with a new I/O scheme. Hopefully the workarounds are tolerable for the
splash use case and sorry about the issues!
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1406 (comment)
|
flux submit -N 400 -O host-test.out hostname
has the following timing output:After spending 16 minutes running, the host-test.out file is empty. Using
flux wreck attach
shows output from all of the nodes (and did after about 6 minutes), but nothing went into the output file at all.dmesg output: (skipping tons of sched spew, sched did all of this in half a second)
The text was updated successfully, but these errors were encountered: