-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lightweight flux-run processes #327
Comments
I'm not sure I understand what is meant by "passthrough process" here. We don't have the same design issues in flux as with srun (connect all IO back through a single process), so I don't think we'll have any of the same memory issues. In fact, in the parallel launch prototype we have now, the "front end" (flux-wreckrun) is not even really necessary since parallel launch is actually a distributed application built on top of kvs... (Or did I totally miss the gist of this issue?) |
What we have is a great percentage of the way to making this happen, the point is more to keep tabs on the weight of whatever the user has connected to their terminal. The cited figure was users wanting to have "thousands of |
Ah, ok. That is a good point. Do users really want "thousands of My guess is that the real use case here is "thousands of simultaneous programs" (to use our terminology). In this case I think we should target a system where Near term I'd say #259 (backing store for kvs master) is going to be the real memory issue. Unless we move away from keeping stdio streams in kvs then likely 1000 jobs (simultaneous or not) will exhaust memory on the master node. Not to say that we should ignore the memory usage of |
I thought the same thing when John first mentioned this, that users shouldn't want to do this anyway I mean, but from what he says the instruction from LC has been to use The backing store is definitely the bigger issue at the moment, along with garbage collection for the KVS #258, but this could be an interesting trick if we can pull it off. I was thinking that the process that actually remains after job setup need not even have a full multi-threaded process remaining. The logic is already almost all in the broker, so if launching concludes with two or three sockets/FDs/whatever that can be fed into an epoll or similar and read/fed until a "done" signal is reached, it could be pared down to nearly nothing. I would bet it could be done for (in the terminal-connected process itself) <100k if we were serious about it, but that would probably take an alternative libc, explicitly dropping the stack-frame size, |
Yeah, I can see how what you describe would significantly reduce memory footprint. Oh, I also forgot that users of the flux local connector don't actually use zeromq, so they don't have that zeromq thread necessarily. Did you get any information on how users use this style of launching 1000s of sruns? I mean if they are all connected to the terminal then they must be running them in the background? |
I only got relatively high-level information. As you say, they must be running them in the background, but users still may depend on the process existing as a handle. The one thing I'm not sure I've mentioned is that it sounds like this is the preferred method for waiting for jobs to finish before enqueueing more. Rather than using squeue or sinfo and pounding away by polling, a number of services, ATS for example, just run one srun per job and hold onto it to detect completion. John mentioned having some test-cases that he would be willing to share with us and even make public, but hasn't gotten to sending them along yet. As soon as I hear back I'll post some more information. |
Data point: on my ubuntu vm with 1GB of ram, I was able to get just short of 1000 flux-ping's running against a size=3 session before things went sideways. Top was showing a resident set size of around 1.6MB per process.
I think if we actually worked on it we could shrink this down. As you mentioned the KVS footprint will probably be the next big issue, but solvable. This is a good exercise I think. |
Actually, 1000 seems to be a magic number of some kind. I get an assertion in zuuid in the broker right after starting the 1000'th client. Hmm.... |
That may be where you run out of file descriptors or process/thread contexts, the default per-user limit is only 1024 for FDs, though it's a little suspicious that your user would have exactly 24 descriptors in use... We're likely to hit this limit well before the memory limit on LC systems, but many of them set the user-requestable maximum significantly higher anyway. I usually keep limits on FDs and processes in the tens of thousands, and file-system watches in the millions, running out of these never goes well... |
FD limits are set very high on LC systems (on the order of 10-16K per process) because of srun/mvapich limitations (used to be 2 fds for every task), though the soft limit is still 1024 -- but that is per process. I'm not aware of a per-user limit on file descriptors. I do think you're right though that @garlick may be hitting fd limit in the broker if if local connector results in a file descriptor per connection. This might be a good test case for error handling too @garlick want to try bumping |
Just because it is easy I tried this with
Most impressive is that the broker stayed a lean 9mb after many many of these runs. |
Wow, that's impressively low usage. Maybe this will just be a non-issue. |
Well, that's just I tried
I was also able to cause some segfaults in subprocess management code under very high load so some work to be done here. :-( |
It seems like this discussion didn't boil down to any specific issues to solve, so closing. |
As part of a discussion on current job scheduler issues, the memory use of the srun command itself came up as a problem, which I had not expected. Since we have per-node services that provide the actual managment and communication, it might be worth designing the srun equivalent to be a lightweight passthrough to that functionality and intentionally trim it down, or perhaps set everything up and then exec into a cheaper passthrough process at the end. Socket exhaustion would still be an issue, but I'm not sure there is a way around that, whereas memory use should be relatively easy to cut down if we target it.
The text was updated successfully, but these errors were encountered: