-
Notifications
You must be signed in to change notification settings - Fork 160
Parallel MPI Debugging with tmux mpi (python and c!)
Having trouble with something not working in firedrake in parallel and want to debug?
tmux-mpi
(available here) is a really handy tool that lets you launch a tmux
session for a chosen number of MPI ranks - you get a tmux
terminal "window" for each MPI rank which tracks the process running on that rank.
This is super useful for debugging.
This wiki note assumes you are aware of tmux
: it's a handy tool for creating terminal sessions with tabs and windows entirely within a terminal window (the tmux
"windows" are not actual windows - they're all accessed within a single terminal. It's really cool!).
tmux
sessions can be detached from and reattached to, e.g for running processes on a remote computer where you don't want to maintain an SSH session.
For more see the tmux documentation.
From here it is assumed you have installed tmux-mpi
and checked it works.
We will consider a simple python script called test.py
import sys
import petsc4py
petsc4py.init(sys.argv)
from petsc4py import PETSc
if PETSc.COMM_WORLD.rank == 0:
PETSc.Vec().create(comm=PETSc.COMM_SELF).view()
If you try to run this you will get an error:
$ python test.py
Vec Object: 1 MPI processes
type not yet set
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
application called MPI_Abort(MPI_COMM_WORLD, 50176059) - process 0
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=50176059
:
system msg for write_line failure : Bad file descriptor
or in parallel
$ mpiexec -n -3 python test.py
Vec Object: 1 MPI processes
type not yet set
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
application called MPI_Abort(MPI_COMM_WORLD, 50176059) - process 0
This is really straightforward and works on most platforms.
To debug, in a python debugger, the mpiexec -n -3 python test.py
example:
- Install
pdb++
(see https://pypi.org/project/pdbpp/) which is much better than standardpdb
- Start a
tmux
session with atmux
"window" for each MPI rank, each of which opens in the debugger with$ tmux-mpi 3 $(which python) -m pdb test.py
- Open a new terminal on the same computer and attach to the
tmux
session withtmux attach -t tmux-mpi
. You should see thatpdb++
has stopped at the initial import:
[2] > /Users/rwh10/firedrake/src/firedrake/deleteme.py(1)<module>()
-> import sys
(Pdb++)
Different "windows" can be jumped between with the shortcut ctrl-b n
and you can then step through the program on each rank.
Note you need to do this manually - you can't expect a step on one rank to step on another rank!
If you don't want to jump straight into the debugger but run through to a pdb
breakpoint (set with either breakpoint()
in recent python versions or via import pdb; pdb.set_trace()
in general) then you can omit the -m pdb
and simply run $ tmux-mpi 3 $(which python) test.py
.
This requires you to be on a platform where your debugger is gdb
since MPICH (which firedrake uses by default) does not play nicely with lldb
and will give very confusing results.
In practice this means that you cannot easily debug MPI C code on MacOS.
NOTE: This may now not be true, see Debugging C kernels with lldb on MacOS.
To debug, in a C debugger, the mpiexec -n -3 python test.py
example:
- Run
$ tmux-mpi 3 gdb --ex run --args $(which python) test.py
- this will create atmux
session with 3tmux
"windows" each runningtest.py
ingdb
- Open a new terminal on the same computer and attach to the
tmux
session withtmux attach -t tmux-mpi
. The program will have breaked when the seg fault happened on rank 0 whilst the other ranks continue to run - you'll be able to see this by switching windows withctrl-b n
until you see
...
[New Thread 0x7fffa4073700 (LWP 22242)]
[New Thread 0x7fffa1872700 (LWP 22243)]
Vec Object: 1 MPI processes
type not yet set
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb)
The other ranks can be broken at any time with an interupt signal (ctrl-C
) to see where you are.
This is particularly useful when trying to debug hanging programs where no rank processes have actually errored.
Note that PETSc has -start_in_debugger
argument, which cause the program on all ranks to start in the debugger in multiple xterm windows. For this example, run the following command.
$ mpiexec -n 3 $(which python) test.py -start_in_debugger
(in macos you might need to do something like export PETSC_OPTIONS='-on_error_attach_debugger lldb -debug_terminal "xterm -e"'
first)
- Reuben Nixon-Hill May 2021
Building locally
Tips
- Running Firedrake tests with different subpackage branches
- Modifying and Rebuilding PETSc and petsc4py
- Vectorisation
- Debugging C kernels with
lldb
on MacOS - Parallel MPI Debugging with
tmux-mpi
,pdb
andgdb
- Parallel MPI Debugging with VSCode and
debugpy
- Modifying generated code
- Kernel profiling with LIKWID
- breakpoint() builtin not working
- Debugging pytest with multiple processing
Developers Notes
- Upcoming meeting 2024-08-21
- 2024-08-07
- 2024-07-24
- 2024-07-17
- 2024-07-10
- 2024-06-26
- 2024-06-19
- 2024-06-05
- 2024-05-29
- 2024-05-15
- 2024-05-08
- 2024-05-01
- 2024-04-28
- 2024-04-17
- 2024-04-10
- 2024-04-03
- 2024-03-27
- 2024-03-20
- 2024-03-06
- 2024-02-28
- 2024-02-28
- 2024-02-21
- 2024-02-14
- 2024-02-07
- 2024-01-31
- 2024-01-24
- 2024-01-17
- 2024-01-10
- 2023-12-13
- 2023-12-06
- 2023-11-29
- 2023-11-22
- 2023-11-15
- 2023-11-08
- 2023-11-01
- 2023-10-25
- 2023-10-18
- 2023-10-11
- 2023-10-04
- 2023-09-27
- 2023-09-20
- 2023-09-06
- 2023-08-30
- 2023-08-23
- 2023-07-12
- 2023-07-05
- 2023-06-21
- 2023-06-14
- 2023-06-07
- 2023-05-17
- 2023-05-10
- 2023-03-08
- 2023-02-22
- 2023-02-15
- 2023-02-08
- 2023-01-18
- 2023-01-11
- 2023-12-14
- 2022-12-07
- 2022-11-23
- 2022-11-16
- 2022-11-09
- 2022-11-02
- 2022-10-26
- 2022-10-12
- 2022-10-05
- 2022-09-28
- 2022-09-21
- 2022-09-14
- 2022-09-07
- 2022-08-25
- 2022-08-11
- 2022-08-04
- 2022-07-28
- 2022-07-21
- 2022-07-07
- 2022-06-30
- 2022-06-23
- 2022-06-16
- 2022-05-26
- 2022-05-19
- 2022-05-12
- 2022-05-05
- 2022-04-21
- 2022-04-07
- 2022-03-17
- 2022-03-03
- 2022-02-24
- 2022-02-10
- 2022-02-03
- 2022-01-27
- 2022-01-20
- 2022-01-13
- 2021-12-15
- 2021-12-09
- 2021-11-25
- 2021-11-18
- 2021-11-11
- 2021-11-04
- 2021-10-28
- 2021-10-21
- 2021-10-14
- 2021-10-07
- 2021-09-30
- 2021-09-23
- 2021-09-09
- 2021-09-02
- 2021-08-26
- 2021-08-18
- 2021-08-11
- 2021-08-04
- 2021-07-28
- 2021-07-21
- 2021-07-14
- 2021-07-07
- 2021-06-30
- 2021-06-23
- 2021-06-16
- 2021-06-09
- 2021-06-02
- 2021-05-19
- 2021-05-12
- 2021-05-05
- 2021-04-28
- 2021-04-21
- 2021-04-14
- 2021-04-07
- 2021-03-17
- 2021-03-10
- 2021-02-24
- 2021-02-17
- 2021-02-10
- 2021-02-03
- 2021-01-27
- 2021-01-20
- 2021-01-13
- 2021-01-06