-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debugger support for both v0.11 fork and the new execution system #2163
Comments
@lee218llnl is tagged. |
This is a great solution to a long standing problem. The
If |
Yes I like One question on In order for this to work STAT/LauchMON doesn't yet know how to handle For |
Yes -- the |
Yeah, there would be multiple ways to handle this I think. An event seems reasonable, or after debugger attach perhaps a global |
I think global But still |
Yes, for wreck system a signal is an event anyway. The new execution system may have a better method for signal delivery, we'll have to think about that. |
Ok, it is LLNL/LaunchMON#16. The global |
Ok. I don't think I understand the new execution good enough yet to comment. When you say a signal, are you referring to a UNIX signal? |
Yes, sorry. In the old wreck system, sending a signal to a job is via an event. We haven't designed how signals will be delivered to jobs in the new execution system, i.e. the job shell, so that is a bit up in the air. |
Ah. Make sense. Thanks. |
@dongahn: I added a v0.11 specific issue at flux-framework/flux-core-v0.11#12 |
OK. The PR for v0.11 solution has been posted here |
When will be the good time to add similar support for the new execution system? I will add this to my plan. |
@dongahn, can this be closed now that MPIR support has been merged into new exec system? |
Per our discussion at our mailing list, the ticket is created to begin to scope the initial effort for our debuggers. For backward compatibility support for totalview, DDT and STAT, we will need a few hooks:
Support for launch mode (where the application is launched under the control of a debugger)
We will need an "ELF binary executable" that supports MPIR debug interface.
flux-wreckrun
being a script, it won't be able to implement this interface. One option is to create an ELF binary executable wrapper. Another option would be to add a new command likeflux-debug
to support our tools like:totalview --args flux debug wreckrun -N 1 app
Support for attach mode where users attach a debugger to the PID of the launch command
We need to decide which process will interface with our debuggers. No matter what we do, this should support applications that were launched with either
flux wreckrun
orflux submit
. Expanding onflux-debug
idea, one option would be to allow users to start a process withflux-debug jobid
. And then users cantotalview <pid of flux-debug process>
This can address one key usability issue we currently have. Currently, users have to find where srun or jsrun process is running and logging on that node to do attach. With this, they can do this in any node from which they can start a flux command for the instance.Sync support
We need to make sure the wreck supports sync operation so that the debuggers can form a barrier with the MPI starter process and each and every application processes. This probably has not been implemented in wreck? @grondo?
MPIR_partial_attach_ok
supportEssentially this means the MPI processes that are not part of the process set to which we want to attach the debugger need to be able to run out of the barrier when the debugger releases the MPI starter process from the barrier.
Nested instance support
Last but not the least, how to support debugging of nested instances should be discussed in depth and designed. This is also closely related our needs for hierarchical job status query. We need to make it easier for users to find what flux instance the target job is running and to attach our debuggers to it.
The text was updated successfully, but these errors were encountered: