data movement: mpirun segfault #212

bdevcich · 2024-09-23T16:34:50Z

A segfault sometimes occurs during the dm-system-test. At this time, I don't know at what rate this occurs.

This is from a nightly run of the dm-system-test: https://github.com/NearNodeFlash/nnf-integration-test/actions/runs/10987568344/job/30503475490

not ok 6 copy-in-copy-out: file5-gfs2 in 28106ms
# (from function `test_copy_in_copy_out' in test file copy-in-copy-out.bats, line 115)
#   `#DW copy_out source=$src destination=$dest profile=no-xattr" \' failed
# 17.898s: job.exception type=exception severity=0 DWS/Rabbit interactions failed: workflow hit an error: DW Directive 1: Internal error: data movement operation failed during 'DataIn', message: signal: segmentation fault (core dumped): [2024-09-23T04:55:33] Set Group ID to 100
# [2024-09-23T04:55:33] Set Group ID to 100
# [2024-09-23T04:55:33] Set Group ID to 100
# [2024-09-23T04:55:33] Set User ID to 1064
# [2024-09-23T04:55:33] Set Group ID to 100
# [2024-09-23T04:55:33] Set Group ID to 100
# [2024-09-23T04:55:33] Set User ID to 1064
# [2024-09-23T04:55:33] Set User ID to 1064
# [2024-09-23T04:55:33] Set Group ID to 100
# [2024-09-23T04:55:33] Set User ID to 1064
# [2024-09-23T04:55:33] Set User ID to 1064
# [2024-09-23T04:55:33] Set User ID to 1064
# [2024-09-23T04:55:33] Set Group ID to 100
# [2024-09-23T04:55:33] Set User ID to 1064
# [2024-09-23T04:55:33] Set Group ID to 100
# [2024-09-23T04:55:33] Set User ID to 1064
# [2024-09-23T04:55:33] Walking /lus/global/actions-runner/dm-system-test/008d303a/src
# [2024-09-23T04:55:33] Walked 9 items in 0.002 secs (3836.136 items/sec) ...
# [2024-09-23T04:55:33] Walked 9 items in 0.002 seconds (3668.561 items/sec)
# [2024-09-23T04:55:33] Copying to /mnt/nnf/9898dbc3-7626-4196-b8dc-4f77f1be8d9e-0/rabbit-node-2-0
# [2024-09-23T04:55:33] Items: 9
# [2024-09-23T04:55:33]   Directories: 4
# [2024-09-23T04:55:33]   Files: 5
# [2024-09-23T04:55:33]   Links: 0
# [2024-09-23T04:55:33] Data: 60.000 B (12.000 B per file)
# [2024-09-23T04:55:33] Creating 4 directories
# [2024-09-23T04:55:33] Original directory exists, skip the creation: `/mnt/nnf/9898dbc3-7626-4196-b8dc-4f77f1be8d9e-0/rabbit-node-2-0' (errno=17 File exists)
# [2024-09-23T04:55:33] Creating 5 files.
# [2024-09-23T04:55:33] Copying data.
# [2024-09-23T04:55:33] Copy data: 60.000 B (60 bytes)
# [2024-09-23T04:55:33] Copy rate: 20.875 KiB/s (60 bytes in 0.003 seconds)
# [2024-09-23T04:55:33] Syncing data to disk.
# [2024-09-23T04:55:33] Sync completed in 0.055 seconds.
# [2024-09-23T04:55:33] Fixing permissions.
# [2024-09-23T04:55:33] Updated 9 items in 0.000 seconds (34591.038 items/sec)
# [2024-09-23T04:55:33] Syncing directory updates to disk.
# [2024-09-23T04:55:33] Sync completed in 0.001 seconds.
# [2024-09-23T04:55:33] Started: Sep-23-2024,04:55:33
# [2024-09-23T04:55:33] Completed: Sep-23-2024,04:55:33
# [2024-09-23T04:55:33] Seconds: 0.063
# [2024-09-23T04:55:33] Items: 9
# [2024-09-23T04:55:33]   Directories: 4
# [2024-09-23T04:55:33]   Files: 5
# [2024-09-23T04:55:33]   Links: 0
# [2024-09-23T04:55:33] Data: 60.000 B (60 bytes)
# [2024-09-23T04:55:33] Rate: 947.832 B/s (060 bytes in 0.063 seconds)
# --------------------------------------------------------------------------
# A system call failed during shared memory initialization that should
# not have.  It is likely that your MPI job will now either abort or
# experience performance degradation.
#
#   Local host:  nnf-dm-worker-xj45m
#   System call: unlink(2) /dev/shm/vader_segment.nnf-dm-worker-xj45m.8d600001.7
#   Error:       Operation not permitted (errno 1)
# --------------------------------------------------------------------------
# [nnf-dm-worker-xj45m:00480] *** Process received signal ***
# [nnf-dm-worker-xj45m:00480] Signal: Segmentation fault (11)
# [nnf-dm-worker-xj45m:00480] Signal code: Address not mapped (1)
# [nnf-dm-worker-xj45m:00480] Failing at address: (nil)
# [nnf-dm-worker-xj45m:00480] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x38d60)[0x7f8fa6a2ad60]
# [nnf-dm-worker-xj45m:00480] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x15887e)[0x7f8fa6b4a87e]
# [nnf-dm-worker-xj45m:00480] [ 2] /usr/lib/x86_64-linux-gnu/libopen-rte.so.40(+0x2c641)[0x7f8fa6ce0641]
# [nnf-dm-worker-xj45m:00480] [ 3] /usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_show_help_recv+0x167)[0x7f8fa6ce0ac7]
# [nnf-dm-worker-xj45m:00480] [ 4] /usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_rml_base_process_msg+0x3e1)[0x7f8fa6d35ac1]
# [nnf-dm-worker-xj45m:00480] [ 5] /usr/lib/x86_64-linux-gnu/libevent_core-2.1.so.7(+0x1f1ff)[0x7f8fa6be51ff]
# [nnf-dm-worker-xj45m:00480] [ 6] /usr/lib/x86_64-linux-gnu/libevent_core-2.1.so.7(event_base_loop+0x52f)[0x7f8fa6be5a9f]
# [nnf-dm-worker-xj45m:00480] [ 7] mpirun(+0x175c)[0x55e72654c75c]
# [nnf-dm-worker-xj45m:00480] [ 8] mpirun(+0x1245)[0x55e72654c245]
# [nnf-dm-worker-xj45m:00480] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7f8fa6a15d0a]
# [nnf-dm-worker-xj45m:00480] [10] mpirun(+0x116a)[0x55e72654c16a]
# [nnf-dm-worker-xj45m:00480] *** End of error message ***
#

It appears that the dcp operation is successful, but we then get the following error message. This message appears somewhat frequently and for the most part, does not result in a non-zero error code (from what I have seen).

However, I wonder if this is what is causing the segfault here.

# --------------------------------------------------------------------------
# A system call failed during shared memory initialization that should
# not have.  It is likely that your MPI job will now either abort or
# experience performance degradation.
#
#   Local host:  nnf-dm-worker-xj45m
#   System call: unlink(2) /dev/shm/vader_segment.nnf-dm-worker-xj45m.8d600001.7
#   Error:       Operation not permitted (errno 1)
# --------------------------------------------------------------------------

Here is the backtrace:

(gdb) bt full
#0  0x00007f8fa6b4a87e in __strspn_sse42 (s=0x55e727c680d0 "help-opal-shmem-mmap.txt", a=<optimized out>)
    at ../sysdeps/x86_64/multiarch/strspn-c.c:141
        value = {<optimized out>, <optimized out>}
        index = <optimized out>
        cflag = 0
        aligned = 0xd0 <error: Cannot access memory at address 0xd0>
        mask = {<optimized out>, <optimized out>}
        offset = 0
#1  0x00007f8fa6ce0641 in ?? () from /usr/lib/x86_64-linux-gnu/libopen-rte.so.40
No symbol table info available.
#2  0x00007f8fa6ce0ac7 in orte_show_help_recv () from /usr/lib/x86_64-linux-gnu/libopen-rte.so.40
No symbol table info available.
#3  0x00007f8fa6d35ac1 in orte_rml_base_process_msg () from /usr/lib/x86_64-linux-gnu/libopen-rte.so.40
No symbol table info available.
#4  0x00007f8fa6be51ff in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.1.so.7
No symbol table info available.
#5  0x00007f8fa6be5a9f in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.1.so.7
No symbol table info available.
#6  0x000055e72654c75c in orterun (argc=13, argv=0x7ffdbf045b38) at orterun.c:178
        launchst = {status = 0, active = true, jdata = 0x0}
        completest = {status = 0, active = true, jdata = 0x0}
#7  0x000055e72654c245 in main (argc=13, argv=0x7ffdbf045b38) at main.c:13
No locals.

The text was updated successfully, but these errors were encountered:

bdevcich · 2024-09-23T16:38:27Z

Core dump:
core.mpirun.0.a293e7fabf0c4a43a9de651bba7f159e.2190025.1727067333000000.zip

bdevcich · 2024-09-30T14:52:35Z

Upgrading to openmpi 4.1.6, surpasses the message:

# --------------------------------------------------------------------------
# A system call failed during shared memory initialization that should
# not have.  It is likely that your MPI job will now either abort or
# experience performance degradation.
#
#   Local host:  nnf-dm-worker-xj45m
#   System call: unlink(2) /dev/shm/vader_segment.nnf-dm-worker-xj45m.8d600001.7
#   Error:       Operation not permitted (errno 1)
# --------------------------------------------------------------------------

However, we now get this on every data movement, which is more frequent than the original warning:

    [[16880,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line
    501\n[nnf-dm-controller-manager-67cbcc74c5-s8bg6:00394] [[16880,0],0] ORTE_ERROR_LOG:
    Data unpack failed in file util/show_help.c at line 501\n[nnf-dm-controller-manager-67cbcc74c5-s8bg6:00394]
    [[16880,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line
    501\n[nnf-dm-controller-manager-67cbcc74c5-s8bg6:00394] [[16880,0],0] ORTE_ERROR_LOG:
    Data unpack failed in file util/show_help.c at line 501\n[nnf-dm-controller-manager-67cbcc74c5-s8bg6:00394]
    [[16880,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line
    501\n[nnf-dm-controller-manager-67cbcc74c5-s8bg6:00394] [[16880,0],0] ORTE_ERROR_LOG:
    Data unpack failed in file util/show_help.c at line 501\n[nnf-dm-controller-manager-67cbcc74c5-s8bg6:00394]
    [[16880,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line
    501\n[nnf-dm-controller-manager-67cbcc74c5-s8bg6:00394] [[16880,0],0] ORTE_ERROR_LOG:
    Data unpack failed in file util/show_help.c at line 501\n"

Here is line 501: https://github.com/open-mpi/ompi/blob/v4.1.x/orte/util/show_help.c#L501, which is part of orte_show_help_recv() in the original back trace above.

So it doesn't appear that this fixes anything on the root cause.

bdevcich · 2024-09-30T16:03:43Z

I think this is some flavor of open-mpi/ompi#9905.

However, we're getting an Operation not permitted (errno 1) rather then No such file or directory (errno 2)

bdevcich · 2024-10-02T16:25:10Z

What I've surmised is that this issue goes away when you remove the the --uid and --gid command line options from dcp.

With --uid and --gid:

$ mpirun --allow-run-as-root --tag-output --hostfile /tmp/hostfile dcp --xattrs none --progress 1 --uid 1060 --gid 100 /tmp/testfile /lus/global/Blake/orte_test
...
[1,0]<stdout>:[2024-10-02T16:21:31] Data: 953.674 MiB (1000000000 bytes)
[1,0]<stdout>:[2024-10-02T16:21:31] Rate: 100.416 MiB/s (1000000000 bytes in 9.497 seconds)
[nnf-dm-worker-jzl9t:00329] [[22351,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line 501
[nnf-dm-worker-jzl9t:00329] [[22351,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line 501
[nnf-dm-worker-jzl9t:00329] [[22351,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line 501
[nnf-dm-worker-jzl9t:00329] [[22351,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line 501
[nnf-dm-worker-jzl9t:00329] [[22351,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line 501
[nnf-dm-worker-jzl9t:00329] [[22351,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line 501
[nnf-dm-worker-jzl9t:00329] [[22351,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line 501
[nnf-dm-worker-jzl9t:00329] [[22351,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line 501

If you remove --uid 1060 --gid 100, this error goes away:

$ root@nnf-dm-worker-jzl9t:/localdisk/dumps# mpirun --allow-run-as-root --tag-output --hostfile /tmp/hostfile dcp --progress 1 --xattrs none /tmp/testfile /lus/global/Blake/orte_test

...
[1,0]<stdout>:[2024-10-02T16:24:36] Data: 953.674 MiB (1000000000 bytes)
[1,0]<stdout>:[2024-10-02T16:24:36] Rate: 103.163 MiB/s (1000000000 bytes in 9.244 seconds)

bdevcich · 2024-10-02T16:34:04Z

I think this is a bug in dcp with how I implemented the --uid and --gid commands.

I've tried to workaround this using setpriv in multiple ways, but there are issue with both.

First, using setpriv mpirun dcp works and is attractive because this means we can drop the dreaded --allow-run-as-root flag from mpirun and run the whole thing as the normal user. However, this does not work for lustre-to-lustre data movement. In that case, the mpirun command is initiated from the k8s-worker nodes and has to ssh to the rabbit-nodes. We'd need ssh keys for every user in that case or open up the security so that ssh connection can be established.

Second, using mpirun --allow-run-as-root setpriv dcp causes issues other issues in dcp:

[nnf-dm-worker-jzl9t:00111] PMIX ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1849
   [nnf-dm-worker-jzl9t:00118] OPAL ERROR: Unreachable in file pmix3x_client.c at line 111
   *** An error occurred in MPI_Init
   *** on a NULL communicator
   *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
   ***    and potentially your MPI job)
   [nnf-dm-worker-jzl9t:00118] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
   --------------------------------------------------------------------------
   Primary job  terminated normally, but 1 process returned
   a non-zero exit code. Per user-direction, the job has been aborted.
   --------------------------------------------------------------------------

I assume this is because of running mpirun as root and then dcp is trying to do MPI things as a different user.

bdevcich · 2024-10-02T17:01:56Z

I created an issue with dcp to discuss if there's any potential fixes there. Right now, I can think of two solutions:

Use setpriv mpirun dcp to run the whole thing as non-root and open up the security between the nnf-dm-controller pod running on the k8s-worker nodes and the rabbit to allow ssh to connect as the non-root user
Figure out how to get dcp to only do the file operations as non-root but do all the "MPI stuff" as root

Upgrade openmpi from 4.1.0 to 4.1.0 and include the UID/GID fixes from dcp. This resolves NearNodeFlash/NearNodeFlash.github.io#212. Signed-off-by: Blake Devcich <[email protected]>

Upgrade openmpi from 4.1.0 to 4.1.6 and include the UID/GID fixes from dcp. This resolves NearNodeFlash/NearNodeFlash.github.io#212. Signed-off-by: Blake Devcich <[email protected]>

github-project-automation bot moved this to 📋 Open in Issues Dashboard Sep 23, 2024

github-project-automation bot added this to Issues Dashboard Sep 23, 2024

bdevcich mentioned this issue Oct 2, 2024

dcp: MPI permissions issue with --uid and --gid hpc/mpifileutils#585

Closed

bdevcich mentioned this issue Oct 3, 2024

Openmpi 4.1.6 + dcp UID/GID fixes NearNodeFlash/nnf-mfu#23

Merged

bdevcich closed this as completed in NearNodeFlash/nnf-mfu#23 Oct 4, 2024

bdevcich closed this as completed in NearNodeFlash/nnf-mfu@b7952d2 Oct 4, 2024

github-project-automation bot moved this from 📋 Open to ✅ Closed in Issues Dashboard Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data movement: mpirun segfault #212

data movement: mpirun segfault #212

bdevcich commented Sep 23, 2024

bdevcich commented Sep 23, 2024

bdevcich commented Sep 30, 2024 •

edited

Loading

bdevcich commented Sep 30, 2024

bdevcich commented Oct 2, 2024

bdevcich commented Oct 2, 2024

bdevcich commented Oct 2, 2024

data movement: mpirun segfault #212

data movement: mpirun segfault #212

Comments

bdevcich commented Sep 23, 2024

bdevcich commented Sep 23, 2024

bdevcich commented Sep 30, 2024 • edited Loading

bdevcich commented Sep 30, 2024

bdevcich commented Oct 2, 2024

bdevcich commented Oct 2, 2024

bdevcich commented Oct 2, 2024

bdevcich commented Sep 30, 2024 •

edited

Loading