-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add totalview_jobid symbol into flux-job #3110
Comments
Tagging @lee218llnl and petertea. |
Maybe we can just add Tools are so used to only be able to launch one daemon per compute node (rank in our case), it seems this should suffice. Even when they need to launch multiple daemons, they first launch one "super"-daemon and launch the rest. |
Also, once jobs are being contained, using Long term we had planned to use the solution proposed in #2298. (Note that a As it happens, I just did a proof of concept implementation and I think this is doable, so good timing. |
By containment, you mean cgroup? At what level cgroup will be imposed? The implicit flux instance launched by flux mini batch will be contained correct? How about the parallel jobs running inside that instance? Will this be contained too? I will look at #2298.
In terms of creating a parallel task track, what is important for me is the bulk launch interface itself at this point. I can try that interface (whether it is flux exec or something else) to test STAT under single user flux without cgroup. Then, when the containment-capable bulk launching comes, I can retest. W/ the exec service in the job shell, would the user interface still be flux exec or something else? |
cgroup, and/or namespace. E.g. when using a polyinstantiated
I think that is TBD. It may depend on whether it makes sense to make BTW, for single user instance
|
Ah... Now I'm making the connection. This work actually has two use cases then! 1) job listing of nested instances; 2) tool bulk launching support.
Let me see if I can make some progress with this. Then, when the new interface comes, I can swap. My guess is this should unblock me; we will see. |
I changed the Issue ticket name to "Add totalview_jobid symbol into flux-job". This variable is not a part of MPIR debug interface, but it is used in RMs like SLURM to allow a debugging tool to fetch the target jobid directly from the address space of the launcher. Both STAT and TV depends on this variable for its bulk launching. Please see LLNL/LaunchMON#50 (comment) |
Yes! |
A tool like STAT and totalview needs to be able to launch and co-locate its tool daemons with the target MPI processes.
For totalview, I was able to get around this by using its serial launching mode (using ssh or rsh to launch a daemon per each compute node). But I don't think I can get around this for STAT which only supports bulk launching.
I can add the
totalview_jobid
symbol intoflux job
(which gets filled with the jobid) which then gets extracted by the tools and use it to expand the tool's launching string.But we currently don't have a way to co-locate tool daemons with the processes, given the jobid.
BTW I think tool launching should bypass scheduling so flux exec or similar seems to make more sense as the bulk launcher.
flux exec
won't take JOBID as its input, though correct? We probably don't have to worry about the scalability of flux exec just yet.Maybe I can turn the hostname list into rank list and use
flux exec
...In general, it feels like some minimal support within flux-core can make tool launching easier.
@grondo: any ideas?
The text was updated successfully, but these errors were encountered: