lightweight flux-run processes #327

trws · 2015-08-13T19:37:31Z

As part of a discussion on current job scheduler issues, the memory use of the srun command itself came up as a problem, which I had not expected. Since we have per-node services that provide the actual managment and communication, it might be worth designing the srun equivalent to be a lightweight passthrough to that functionality and intentionally trim it down, or perhaps set everything up and then exec into a cheaper passthrough process at the end. Socket exhaustion would still be an issue, but I'm not sure there is a way around that, whereas memory use should be relatively easy to cut down if we target it.

grondo · 2015-08-13T20:30:11Z

I'm not sure I understand what is meant by "passthrough process" here.

We don't have the same design issues in flux as with srun (connect all IO back through a single process), so I don't think we'll have any of the same memory issues.

In fact, in the parallel launch prototype we have now, the "front end" (flux-wreckrun) is not even really necessary since parallel launch is actually a distributed application built on top of kvs...

(Or did I totally miss the gist of this issue?)

trws · 2015-08-13T20:48:28Z

What we have is a great percentage of the way to making this happen, the point is more to keep tabs on the weight of whatever the user has connected to their terminal. The cited figure was users wanting to have "thousands of fruns" running on a node simultaneously. What I was thinking of when I wrote "passthrough process" was having the wreckrun successor do all the necessary setup for a job, then reduce itself to just file descriptors connected to the broker and exec to a super-light-weight process that just keeps hold of the terminal and passes through IO as needed. This is not to say that the current version is heavy, it takes about 3mb where srun is closer to 20 IIRC, but that's still an unfortunately large amount if there are to be a huge number of them.

grondo · 2015-08-13T21:11:24Z

Ah, ok. That is a good point.

Do users really want "thousands of flux run" processes, all connected to their terminal, though? I think this is based on an assumption that the srun equivalent will be required, whereas it really is not.

My guess is that the real use case here is "thousands of simultaneous programs" (to use our terminology). In this case I think we should target a system where flux run processes do not persist, and therefore the memory usage of an individual process becomes moot. We can then provide a tool or gateway to monitor the thousands of running programs with a top-like or other console based utility, and this is what would be connected to their terminal?

Near term I'd say #259 (backing store for kvs master) is going to be the real memory issue. Unless we move away from keeping stdio streams in kvs then likely 1000 jobs (simultaneous or not) will exhaust memory on the master node.

Not to say that we should ignore the memory usage of flux run, though and likely we could do a little bit better than we are now, but I think with a couple threads at least (zeromq and main thread), a flux reactor, and then the various watchers -- we might not be able to pare it down significantly.
(I'm not sure how you'd get more lightweight than what is there now, I mean)

trws · 2015-08-13T21:29:29Z

I thought the same thing when John first mentioned this, that users shouldn't want to do this anyway I mean, but from what he says the instruction from LC has been to use sruns in place of batches and API calls to try and keep the simultaneous submission volume down. As a result, he was predicting that users will continue to just spam the system with tons of flux run jobs even if there is a better way, just from inertia.

The backing store is definitely the bigger issue at the moment, along with garbage collection for the KVS #258, but this could be an interesting trick if we can pull it off. I was thinking that the process that actually remains after job setup need not even have a full multi-threaded process remaining. The logic is already almost all in the broker, so if launching concludes with two or three sockets/FDs/whatever that can be fed into an epoll or similar and read/fed until a "done" signal is reached, it could be pared down to nearly nothing. I would bet it could be done for (in the terminal-connected process itself) <100k if we were serious about it, but that would probably take an alternative libc, explicitly dropping the stack-frame size, -Os, and a little re-architecting of the actual run service to offer a compatible interface. The approach may not be a great fit for this, but it's what came to mind.

grondo · 2015-08-13T21:55:10Z

Yeah, I can see how what you describe would significantly reduce memory footprint.
You would basically be moving the tiny bit of advanced functionality from the flux run process into a module in the broker, and then need a way for the itty-bitty run to connect a set of simple fds directly to the module. (It could also be that flux run in default mode would satisfy your requirements). The module would need to be a little more complex because it would need to be able to multiplex the 1000s of jobs, and you might be just moving memory usage from one place to another. But overall it does sound like a neat and interesting idea.

Oh, I also forgot that users of the flux local connector don't actually use zeromq, so they don't have that zeromq thread necessarily.

Did you get any information on how users use this style of launching 1000s of sruns? I mean if they are all connected to the terminal then they must be running them in the background?

trws · 2015-08-13T22:06:41Z

I only got relatively high-level information. As you say, they must be running them in the background, but users still may depend on the process existing as a handle. The one thing I'm not sure I've mentioned is that it sounds like this is the preferred method for waiting for jobs to finish before enqueueing more. Rather than using squeue or sinfo and pounding away by polling, a number of services, ATS for example, just run one srun per job and hold onto it to detect completion. John mentioned having some test-cases that he would be willing to share with us and even make public, but hasn't gotten to sending them along yet. As soon as I hear back I'll post some more information.

garlick · 2015-08-14T01:59:46Z

Data point: on my ubuntu vm with 1GB of ram, I was able to get just short of 1000 flux-ping's running against a size=3 session before things went sideways. Top was showing a resident set size of around 1.6MB per process.

#!/bin/bash
for i in `seq 0 999`; do
    flux ping 0 &
    sleep 0.1
done

I think if we actually worked on it we could shrink this down.

As you mentioned the KVS footprint will probably be the next big issue, but solvable.

This is a good exercise I think.

garlick · 2015-08-14T02:07:04Z

Actually, 1000 seems to be a magic number of some kind. I get an assertion in zuuid in the broker right after starting the 1000'th client. Hmm....

trws · 2015-08-14T18:07:18Z

That may be where you run out of file descriptors or process/thread contexts, the default per-user limit is only 1024 for FDs, though it's a little suspicious that your user would have exactly 24 descriptors in use... We're likely to hit this limit well before the memory limit on LC systems, but many of them set the user-requestable maximum significantly higher anyway. I usually keep limits on FDs and processes in the tens of thousands, and file-system watches in the millions, running out of these never goes well...

grondo · 2015-08-14T20:16:03Z

FD limits are set very high on LC systems (on the order of 10-16K per process) because of srun/mvapich limitations (used to be 2 fds for every task), though the soft limit is still 1024 -- but that is per process. I'm not aware of a per-user limit on file descriptors.

I do think you're right though that @garlick may be hitting fd limit in the broker if if local connector results in a file descriptor per connection. This might be a good test case for error handling too
(set fd limit low and make sure errors handled cleanly)

@garlick want to try bumping ulimit -n in broker and see if that has an impact (or do we want to track this in a different issue)? If we're worried about fd scalability here (probably not a real issue), then we might want to investigate an alternate, high throughput connector that uses messages instead of an fd per client. (Just a wild idea.)

grondo · 2015-08-14T21:42:20Z

Just because it is easy I tried this with flux start -N4 on one of our systems. I was easily able to run 2048 flux ping processes. At 4096 flux cmd driver started getting EAGAIN from fork. I bumped up ulimit -n to the hard limit (of 32K I think), and was able to run without error. I successfully ran up to 8192 simultaneous ping processes without incident (and didn't even come close to exhausting memory -- something to investigate for later, the memory usage on the node didn't go up linearly with the number of flux-ping processes, unless I'm being dumb)

(flux-1120087.0-0) grondo@hype345:~/git/flux-core.git$ pgrep lt-flux-ping | wc -l 
8192
(flux-1120087.0-0) grondo@hype345:~/git/flux-core.git$ cat /proc/meminfo
MemTotal:       32644432 kB
MemFree:        26685028 kB
Buffers:               0 kB
Cached:           167420 kB
SwapCached:            0 kB
Active:          2756460 kB
Inactive:         108936 kB
Active(anon):    2734824 kB
Inactive(anon):    69864 kB
Active(file):      21636 kB
Inactive(file):    39072 kB

Most impressive is that the broker stayed a lean 9mb after many many of these runs.

trws · 2015-08-14T21:52:11Z

Wow, that's impressively low usage. Maybe this will just be a non-issue.

grondo · 2015-08-14T22:13:34Z

Well, that's just flux ping ;-)

I tried flux exec next and while I was able to run 4096 processes, things started to fall apart a bit on multiple runs, and I'm pretty sure I've got a memleak somewhere in that code (and perhaps an fd leak as well.)

(flux-1120088.0-0) grondo@hype345:~/git/flux-core.git$ for i in `seq 1 4096`; do (src/cmd/flux exec -n sh -c 'sleep 60; hostname' &); done
(flux-1120088.0-0) grondo@hype345:~/git/flux-core.git$ pgrep flux-exec | wc -l
4096
(flux-1120088.0-0) grondo@hype345:~/git/flux-core.git$ cat /proc/meminfo
MemTotal:       32644432 kB
MemFree:        28676172 kB
Buffers:               0 kB
Cached:           161524 kB
SwapCached:            0 kB
Active:          2247388 kB
Inactive:         107904 kB
Active(anon):    2230660 kB
Inactive(anon):    70156 kB

I was also able to cause some segfaults in subprocess management code under very high load so some work to be done here. :-(

garlick · 2016-12-28T22:24:02Z

It seems like this discussion didn't boil down to any specific issues to solve, so closing.

trws added the enhancement label Aug 13, 2015

garlick closed this as completed Dec 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lightweight flux-run processes #327

lightweight flux-run processes #327

trws commented Aug 13, 2015

grondo commented Aug 13, 2015

trws commented Aug 13, 2015

grondo commented Aug 13, 2015

trws commented Aug 13, 2015

grondo commented Aug 13, 2015

trws commented Aug 13, 2015

garlick commented Aug 14, 2015

garlick commented Aug 14, 2015

trws commented Aug 14, 2015

grondo commented Aug 14, 2015

grondo commented Aug 14, 2015

trws commented Aug 14, 2015

grondo commented Aug 14, 2015

garlick commented Dec 28, 2016

lightweight flux-run processes #327

lightweight flux-run processes #327

Comments

trws commented Aug 13, 2015

grondo commented Aug 13, 2015

trws commented Aug 13, 2015

grondo commented Aug 13, 2015

trws commented Aug 13, 2015

grondo commented Aug 13, 2015

trws commented Aug 13, 2015

garlick commented Aug 14, 2015

garlick commented Aug 14, 2015

trws commented Aug 14, 2015

grondo commented Aug 14, 2015

grondo commented Aug 14, 2015

trws commented Aug 14, 2015

grondo commented Aug 14, 2015

garlick commented Dec 28, 2016