Debugger support for both v0.11 fork and the new execution system #2163

dongahn · 2019-05-16T20:54:52Z

Per our discussion at our mailing list, the ticket is created to begin to scope the initial effort for our debuggers. For backward compatibility support for totalview, DDT and STAT, we will need a few hooks:

Support for launch mode (where the application is launched under the control of a debugger)
We will need an "ELF binary executable" that supports MPIR debug interface. flux-wreckrun being a script, it won't be able to implement this interface. One option is to create an ELF binary executable wrapper. Another option would be to add a new command like flux-debug to support our tools like: totalview --args flux debug wreckrun -N 1 app
Support for attach mode where users attach a debugger to the PID of the launch command
We need to decide which process will interface with our debuggers. No matter what we do, this should support applications that were launched with either flux wreckrun or flux submit. Expanding on flux-debug idea, one option would be to allow users to start a process with flux-debug jobid. And then users can totalview <pid of flux-debug process> This can address one key usability issue we currently have. Currently, users have to find where srun or jsrun process is running and logging on that node to do attach. With this, they can do this in any node from which they can start a flux command for the instance.
Sync support
We need to make sure the wreck supports sync operation so that the debuggers can form a barrier with the MPI starter process and each and every application processes. This probably has not been implemented in wreck? @grondo?
MPIR_partial_attach_ok support
Essentially this means the MPI processes that are not part of the process set to which we want to attach the debugger need to be able to run out of the barrier when the debugger releases the MPI starter process from the barrier.
Nested instance support
Last but not the least, how to support debugging of nested instances should be discussed in depth and designed. This is also closely related our needs for hierarchical job status query. We need to make it easier for users to find what flux instance the target job is running and to attach our debuggers to it.

The text was updated successfully, but these errors were encountered:

dongahn · 2019-05-16T21:01:13Z

@lee218llnl is tagged.

grondo · 2019-05-16T21:07:48Z

Expanding on flux-debug idea, one option would be to allow users to start a process with flux-debug jobid. And then users can totalview This can address one key usability issue we currently have.

This is a great solution to a long standing problem. The flux debug (or perhaps flux job debug in the new system) could offer two subcommands run and attach, matching your two examples above. The attach case could be made quite user friendly by emitting its pid as output so that this would work totalview $(flux job debug attach ID).

We need to make sure the wreck supports sync operation so that the debuggers can form a barrier with the MPI starter process and each and every application processes. This probably has not been implemented in wreck? @grondo?

If -o stop-children-in-exec is used on wreckrun or submit command, then wrexecd will set PTRACE_TRACEME on all tasks and issue the "sync" event once all tasks have been stopped in the exec(2) call.

dongahn · 2019-05-16T21:28:42Z

Yes I like totalview $(flux job debug attach ID)!

One question on totalview --args flux debug wreckrun -N 1 app or similar.

In order for this to work flux would have to exec the flux-debug executable (or similar) which implements MPIR debug interface. (no fork and exec). I assume this would be the case. Can someone confirm? Now there is a usability issue with it since debuggers like totalview will stop on each exec event and pop up a prompt asking "do you want to stop on this even." But I recently had RWS add support to automate this using some regular expression system so we should probably be able to streamline this by letting totalview detect this as an "uninteresting" exec from a resource manager and automatically continue it.

STAT/LauchMON doesn't yet know how to handle exec events so we need to work on this as well. (long standing issue on my plate...)

For MPIR_partial_attach_ok support, what will be required to continue those ptrace-stopped processes that are not attached by the debugger? I guess an extra event being issued from flux-debug command right after it runs out of MPIR_Breakpoint?

grondo · 2019-05-16T21:43:38Z

flux would have to exec the flux-debug executable (or similar) which implements MPIR debug interface. (no fork and exec). I assume this would be the case. Can someone confirm?

Yes -- the flux executable calls exec() without fork() on subcommands (except of course the "builtin" commands)

grondo · 2019-05-16T21:45:13Z

For MPIR_partial_attach_ok support, what will be required to continue those ptrace-stopped processes that are not attached by the debugger? I guess an extra event being issued from flux-debug command right after it runs out of MPIR_Breakpoint?

Yeah, there would be multiple ways to handle this I think. An event seems reasonable, or after debugger attach perhaps a global SIGCONT could be issued (if that doesn't affect processes attached by debugger).

dongahn · 2019-05-16T22:03:28Z

I think global SIGCONT is okay. I remember I had issues with this (when SLURM changed its behavior to send the additional SIGCONT), but I think I fixed it for STAT/LaunchMON. Let me check. I remember totalview was okay with this.

But still flux-debug needs to tell the execution system that the debugger attach has completed. (So that the execution system can send SIGINT). It seems an event from the flux-debug is needed regardless?

grondo · 2019-05-16T22:06:24Z

It seems an event from the flux-debug is needed regardless?

Yes, for wreck system a signal is an event anyway. The new execution system may have a better method for signal delivery, we'll have to think about that.

dongahn · 2019-05-16T22:06:56Z

I think global SIGCONT is okay. I remember I had issues with this (when SLURM changed its behavior to send the additional SIGCONT), but I think I fixed it for STAT/LaunchMON. Let me check. I remember totalview was okay with this.

Ok, it is LLNL/LaunchMON#16. The global SIGINT should be fine.

dongahn · 2019-05-16T22:09:54Z

Yes, for wreck system a signal is an event anyway. The new execution system may have a better method for signal delivery, we'll have to think about that.

Ok. I don't think I understand the new execution good enough yet to comment. When you say a signal, are you referring to a UNIX signal?

grondo · 2019-05-16T22:15:06Z

When you say a signal, are you referring to a UNIX signal?

Yes, sorry. In the old wreck system, sending a signal to a job is via an event. We haven't designed how signals will be delivered to jobs in the new execution system, i.e. the job shell, so that is a bit up in the air.

dongahn · 2019-05-16T22:22:54Z

Yes, sorry. In the old wreck system, sending a signal to a job is via an event. We haven't designed how signals will be delivered to jobs in the new execution system, i.e. the job shell, so that is a bit up in the air.

Ah. Make sense. Thanks.

grondo · 2019-05-20T17:54:08Z

@dongahn: I added a v0.11 specific issue at flux-framework/flux-core-v0.11#12

dongahn · 2019-06-01T04:17:37Z

OK. The PR for v0.11 solution has been posted here

dongahn · 2019-06-08T22:22:50Z

When will be the good time to add similar support for the new execution system? I will add this to my plan.

grondo · 2020-03-27T15:29:18Z

@dongahn, can this be closed now that MPIR support has been merged into new exec system?

grondo mentioned this issue May 20, 2019

debugger support flux-framework/flux-core-v0.11#12

Closed

4 tasks

grondo mentioned this issue Aug 10, 2019

flux-exec: support guest exec via flux-shell #2298

Closed

dongahn closed this as completed Mar 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debugger support for both v0.11 fork and the new execution system #2163

Debugger support for both v0.11 fork and the new execution system #2163

dongahn commented May 16, 2019

dongahn commented May 16, 2019

grondo commented May 16, 2019

dongahn commented May 16, 2019

grondo commented May 16, 2019

grondo commented May 16, 2019

dongahn commented May 16, 2019 •

edited

Loading

grondo commented May 16, 2019

dongahn commented May 16, 2019

dongahn commented May 16, 2019

grondo commented May 16, 2019

dongahn commented May 16, 2019

grondo commented May 20, 2019

dongahn commented Jun 1, 2019

dongahn commented Jun 8, 2019

grondo commented Mar 27, 2020

Debugger support for both v0.11 fork and the new execution system #2163

Debugger support for both v0.11 fork and the new execution system #2163

Comments

dongahn commented May 16, 2019

dongahn commented May 16, 2019

grondo commented May 16, 2019

dongahn commented May 16, 2019

grondo commented May 16, 2019

grondo commented May 16, 2019

dongahn commented May 16, 2019 • edited Loading

grondo commented May 16, 2019

dongahn commented May 16, 2019

dongahn commented May 16, 2019

grondo commented May 16, 2019

dongahn commented May 16, 2019

grondo commented May 20, 2019

dongahn commented Jun 1, 2019

dongahn commented Jun 8, 2019

grondo commented Mar 27, 2020

dongahn commented May 16, 2019 •

edited

Loading