Add totalview_jobid symbol into flux-job #3110

dongahn · 2020-08-04T23:04:28Z

A tool like STAT and totalview needs to be able to launch and co-locate its tool daemons with the target MPI processes.

For totalview, I was able to get around this by using its serial launching mode (using ssh or rsh to launch a daemon per each compute node). But I don't think I can get around this for STAT which only supports bulk launching.

I can add the totalview_jobid symbol into flux job (which gets filled with the jobid) which then gets extracted by the tools and use it to expand the tool's launching string.

But we currently don't have a way to co-locate tool daemons with the processes, given the jobid.

BTW I think tool launching should bypass scheduling so flux exec or similar seems to make more sense as the bulk launcher. flux exec won't take JOBID as its input, though correct? We probably don't have to worry about the scalability of flux exec just yet.

Maybe I can turn the hostname list into rank list and use flux exec...

In general, it feels like some minimal support within flux-core can make tool launching easier.

@grondo: any ideas?

The text was updated successfully, but these errors were encountered:

dongahn · 2020-08-04T23:05:31Z

Tagging @lee218llnl and petertea.

dongahn · 2020-08-04T23:20:27Z

Maybe we can just add --jobid=<JOBID> option to flux exec? flux exec can fetch the rankset from R of the JOBID and convert it into --rank.

Tools are so used to only be able to launch one daemon per compute node (rank in our case), it seems this should suffice. Even when they need to launch multiple daemons, they first launch one "super"-daemon and launch the rest.

grondo · 2020-08-04T23:26:21Z

flux exec is only accessible by instance owner, so this would not work for multi-user jobs in general -- though perhaps in most cases the debugged job would be running within a single user instance (batch job), so maybe this is not a deal breaker.

Also, once jobs are being contained, using rsh or flux exec to launch debugger servers or tool daemons may not work unless there is a method to "enter" the container of the job.

Long term we had planned to use the solution proposed in #2298. (Note that a flux exec --jobid=JOBID solution is proposed there as well). The difference is that the exec server must run in the job shell so that 1) guests can gain exec access, and 2) the spawned subprocesses are launched in the same container as the job shell.

As it happens, I just did a proof of concept implementation and I think this is doable, so good timing.

dongahn · 2020-08-05T00:04:26Z

Also, once jobs are being contained, using rsh or flux exec to launch debugger servers or tool daemons may not work unless there is a method to "enter" the container of the job.

By containment, you mean cgroup? At what level cgroup will be imposed? The implicit flux instance launched by flux mini batch will be contained correct? How about the parallel jobs running inside that instance? Will this be contained too? I will look at #2298.

As it happens, I just did a proof of concept implementation and I think this is doable, so good timing.

In terms of creating a parallel task track, what is important for me is the bulk launch interface itself at this point. I can try that interface (whether it is flux exec or something else) to test STAT under single user flux without cgroup. Then, when the containment-capable bulk launching comes, I can retest.

W/ the exec service in the job shell, would the user interface still be flux exec or something else?

grondo · 2020-08-05T00:11:59Z

By containment, you mean cgroup?

cgroup, and/or namespace. E.g. when using a polyinstantiated /tmp, a login session to a node via ssh or rsh may not be able to see the /tmp used by the job, so local FLUX_URI may not be available.

W/ the exec service in the job shell, would the user interface still be flux exec or something else?

I think that is TBD. It may depend on whether it makes sense to make flux exec job aware, or if it makes more sense to have a flux job rexec (amount of work required for code refactoring might come into play).
My first inclination would be to use flux exec --jobid=JOBID as you described above.

BTW, for single user instance flux exec would work for now. Try something like this:

ƒ(s=95,builddir) grondo@fluke2:~/git/flux-core.git$ flux exec -r `flux jobs -no {ranks} ƒ7LvkmLT` hostname
fluke12
fluke13
fluke11
fluke14
ƒ(s=95,builddir) grondo@fluke2:~/git/flux-core.git$ flux exec -r `flux jobs -no {ranks} ƒ6xb1b35` hostname
fluke9
fluke8
fluke7
fluke10

dongahn · 2020-08-05T00:20:08Z

cgroup, and/or namespace. E.g. when using a polyinstantiated /tmp, a login session to a node via ssh or rsh may not be able to see the /tmp used by the job, so local FLUX_URI may not be available.

Ah... Now I'm making the connection. This work actually has two use cases then! 1) job listing of nested instances; 2) tool bulk launching support.

BTW, for single user instance flux exec would work for now. Try something like this:

Let me see if I can make some progress with this. Then, when the new interface comes, I can swap.

My guess is this should unblock me; we will see.

dongahn · 2020-08-05T03:42:34Z

I changed the Issue ticket name to "Add totalview_jobid symbol into flux-job".

This variable is not a part of MPIR debug interface, but it is used in RMs like SLURM to allow a debugging tool to fetch the target jobid directly from the address space of the launcher. Both STAT and TV depends on this variable for its bulk launching.

Please see LLNL/LaunchMON#50 (comment)

grondo · 2020-08-31T20:24:21Z

@dongahn, can this issue be closed after merge of #3130?

dongahn · 2020-08-31T20:32:18Z

Yes!

dongahn mentioned this issue Aug 5, 2020

Add support for Flux resource manager LLNL/LaunchMON#50

Open

dongahn changed the title ~~Bulk launch support for tool daemons~~ Add totalview_jobid symbol into flux-job Aug 5, 2020

This was referenced Aug 11, 2020

flux-job: add totalview_jobid support and misc. fixes #3130

Merged

Add support for Flux LLNL/LaunchMON#51

Merged

dongahn closed this as completed Aug 31, 2020

lee218llnl mentioned this issue Aug 22, 2022

known issues with flux LLNL/LaunchMON#63

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add totalview_jobid symbol into flux-job #3110

Add totalview_jobid symbol into flux-job #3110

dongahn commented Aug 4, 2020

dongahn commented Aug 4, 2020

dongahn commented Aug 4, 2020

grondo commented Aug 4, 2020

dongahn commented Aug 5, 2020

grondo commented Aug 5, 2020

dongahn commented Aug 5, 2020 •

edited

Loading

dongahn commented Aug 5, 2020

grondo commented Aug 31, 2020

dongahn commented Aug 31, 2020

Add totalview_jobid symbol into flux-job #3110

Add totalview_jobid symbol into flux-job #3110

Comments

dongahn commented Aug 4, 2020

dongahn commented Aug 4, 2020

dongahn commented Aug 4, 2020

grondo commented Aug 4, 2020

dongahn commented Aug 5, 2020

grondo commented Aug 5, 2020

dongahn commented Aug 5, 2020 • edited Loading

dongahn commented Aug 5, 2020

grondo commented Aug 31, 2020

dongahn commented Aug 31, 2020

dongahn commented Aug 5, 2020 •

edited

Loading