Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debugger support for both v0.11 fork and the new execution system #2163

Closed
dongahn opened this issue May 16, 2019 · 15 comments
Closed

Debugger support for both v0.11 fork and the new execution system #2163

dongahn opened this issue May 16, 2019 · 15 comments

Comments

@dongahn
Copy link
Member

dongahn commented May 16, 2019

Per our discussion at our mailing list, the ticket is created to begin to scope the initial effort for our debuggers. For backward compatibility support for totalview, DDT and STAT, we will need a few hooks:

  • Support for launch mode (where the application is launched under the control of a debugger)
    We will need an "ELF binary executable" that supports MPIR debug interface. flux-wreckrun being a script, it won't be able to implement this interface. One option is to create an ELF binary executable wrapper. Another option would be to add a new command like flux-debug to support our tools like: totalview --args flux debug wreckrun -N 1 app

  • Support for attach mode where users attach a debugger to the PID of the launch command
    We need to decide which process will interface with our debuggers. No matter what we do, this should support applications that were launched with either flux wreckrun or flux submit. Expanding on flux-debug idea, one option would be to allow users to start a process with flux-debug jobid. And then users can totalview <pid of flux-debug process> This can address one key usability issue we currently have. Currently, users have to find where srun or jsrun process is running and logging on that node to do attach. With this, they can do this in any node from which they can start a flux command for the instance.

  • Sync support
    We need to make sure the wreck supports sync operation so that the debuggers can form a barrier with the MPI starter process and each and every application processes. This probably has not been implemented in wreck? @grondo?

  • MPIR_partial_attach_ok support
    Essentially this means the MPI processes that are not part of the process set to which we want to attach the debugger need to be able to run out of the barrier when the debugger releases the MPI starter process from the barrier.

  • Nested instance support
    Last but not the least, how to support debugging of nested instances should be discussed in depth and designed. This is also closely related our needs for hierarchical job status query. We need to make it easier for users to find what flux instance the target job is running and to attach our debuggers to it.

@dongahn
Copy link
Member Author

dongahn commented May 16, 2019

@lee218llnl is tagged.

@grondo
Copy link
Contributor

grondo commented May 16, 2019

Expanding on flux-debug idea, one option would be to allow users to start a process with flux-debug jobid. And then users can totalview This can address one key usability issue we currently have.

This is a great solution to a long standing problem. The flux debug (or perhaps flux job debug in the new system) could offer two subcommands run and attach, matching your two examples above. The attach case could be made quite user friendly by emitting its pid as output so that this would work totalview $(flux job debug attach ID).

We need to make sure the wreck supports sync operation so that the debuggers can form a barrier with the MPI starter process and each and every application processes. This probably has not been implemented in wreck? @grondo?

If -o stop-children-in-exec is used on wreckrun or submit command, then wrexecd will set PTRACE_TRACEME on all tasks and issue the "sync" event once all tasks have been stopped in the exec(2) call.

@dongahn
Copy link
Member Author

dongahn commented May 16, 2019

Yes I like totalview $(flux job debug attach ID)!

One question on totalview --args flux debug wreckrun -N 1 app or similar.

In order for this to work flux would have to exec the flux-debug executable (or similar) which implements MPIR debug interface. (no fork and exec). I assume this would be the case. Can someone confirm? Now there is a usability issue with it since debuggers like totalview will stop on each exec event and pop up a prompt asking "do you want to stop on this even." But I recently had RWS add support to automate this using some regular expression system so we should probably be able to streamline this by letting totalview detect this as an "uninteresting" exec from a resource manager and automatically continue it.

STAT/LauchMON doesn't yet know how to handle exec events so we need to work on this as well. (long standing issue on my plate...)

For MPIR_partial_attach_ok support, what will be required to continue those ptrace-stopped processes that are not attached by the debugger? I guess an extra event being issued from flux-debug command right after it runs out of MPIR_Breakpoint?

@grondo
Copy link
Contributor

grondo commented May 16, 2019

flux would have to exec the flux-debug executable (or similar) which implements MPIR debug interface. (no fork and exec). I assume this would be the case. Can someone confirm?

Yes -- the flux executable calls exec() without fork() on subcommands (except of course the "builtin" commands)

@grondo
Copy link
Contributor

grondo commented May 16, 2019

For MPIR_partial_attach_ok support, what will be required to continue those ptrace-stopped processes that are not attached by the debugger? I guess an extra event being issued from flux-debug command right after it runs out of MPIR_Breakpoint?

Yeah, there would be multiple ways to handle this I think. An event seems reasonable, or after debugger attach perhaps a global SIGCONT could be issued (if that doesn't affect processes attached by debugger).

@dongahn
Copy link
Member Author

dongahn commented May 16, 2019

I think global SIGCONT is okay. I remember I had issues with this (when SLURM changed its behavior to send the additional SIGCONT), but I think I fixed it for STAT/LaunchMON. Let me check. I remember totalview was okay with this.

But still flux-debug needs to tell the execution system that the debugger attach has completed. (So that the execution system can send SIGINT). It seems an event from the flux-debug is needed regardless?

@grondo
Copy link
Contributor

grondo commented May 16, 2019

It seems an event from the flux-debug is needed regardless?

Yes, for wreck system a signal is an event anyway. The new execution system may have a better method for signal delivery, we'll have to think about that.

@dongahn
Copy link
Member Author

dongahn commented May 16, 2019

I think global SIGCONT is okay. I remember I had issues with this (when SLURM changed its behavior to send the additional SIGCONT), but I think I fixed it for STAT/LaunchMON. Let me check. I remember totalview was okay with this.

Ok, it is LLNL/LaunchMON#16. The global SIGINT should be fine.

@dongahn
Copy link
Member Author

dongahn commented May 16, 2019

Yes, for wreck system a signal is an event anyway. The new execution system may have a better method for signal delivery, we'll have to think about that.

Ok. I don't think I understand the new execution good enough yet to comment. When you say a signal, are you referring to a UNIX signal?

@grondo
Copy link
Contributor

grondo commented May 16, 2019

When you say a signal, are you referring to a UNIX signal?

Yes, sorry. In the old wreck system, sending a signal to a job is via an event. We haven't designed how signals will be delivered to jobs in the new execution system, i.e. the job shell, so that is a bit up in the air.

@dongahn
Copy link
Member Author

dongahn commented May 16, 2019

Yes, sorry. In the old wreck system, sending a signal to a job is via an event. We haven't designed how signals will be delivered to jobs in the new execution system, i.e. the job shell, so that is a bit up in the air.

Ah. Make sense. Thanks.

@grondo
Copy link
Contributor

grondo commented May 20, 2019

@dongahn: I added a v0.11 specific issue at flux-framework/flux-core-v0.11#12

@dongahn
Copy link
Member Author

dongahn commented Jun 1, 2019

OK. The PR for v0.11 solution has been posted here

@dongahn
Copy link
Member Author

dongahn commented Jun 8, 2019

When will be the good time to add similar support for the new execution system? I will add this to my plan.

@grondo
Copy link
Contributor

grondo commented Mar 27, 2020

@dongahn, can this be closed now that MPIR support has been merged into new exec system?

@dongahn dongahn closed this as completed Mar 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants