Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data movement: mpirun segfault #212

Closed
bdevcich opened this issue Sep 23, 2024 · 6 comments · Fixed by NearNodeFlash/nnf-mfu#23
Closed

data movement: mpirun segfault #212

bdevcich opened this issue Sep 23, 2024 · 6 comments · Fixed by NearNodeFlash/nnf-mfu#23

Comments

@bdevcich
Copy link
Contributor

A segfault sometimes occurs during the dm-system-test. At this time, I don't know at what rate this occurs.

This is from a nightly run of the dm-system-test: https://github.com/NearNodeFlash/nnf-integration-test/actions/runs/10987568344/job/30503475490

not ok 6 copy-in-copy-out: file5-gfs2 in 28106ms
# (from function `test_copy_in_copy_out' in test file copy-in-copy-out.bats, line 115)
#   `#DW copy_out source=$src destination=$dest profile=no-xattr" \' failed
# 17.898s: job.exception type=exception severity=0 DWS/Rabbit interactions failed: workflow hit an error: DW Directive 1: Internal error: data movement operation failed during 'DataIn', message: signal: segmentation fault (core dumped): [2024-09-23T04:55:33] Set Group ID to 100
# [2024-09-23T04:55:33] Set Group ID to 100
# [2024-09-23T04:55:33] Set Group ID to 100
# [2024-09-23T04:55:33] Set User ID to 1064
# [2024-09-23T04:55:33] Set Group ID to 100
# [2024-09-23T04:55:33] Set Group ID to 100
# [2024-09-23T04:55:33] Set User ID to 1064
# [2024-09-23T04:55:33] Set User ID to 1064
# [2024-09-23T04:55:33] Set Group ID to 100
# [2024-09-23T04:55:33] Set User ID to 1064
# [2024-09-23T04:55:33] Set User ID to 1064
# [2024-09-23T04:55:33] Set User ID to 1064
# [2024-09-23T04:55:33] Set Group ID to 100
# [2024-09-23T04:55:33] Set User ID to 1064
# [2024-09-23T04:55:33] Set Group ID to 100
# [2024-09-23T04:55:33] Set User ID to 1064
# [2024-09-23T04:55:33] Walking /lus/global/actions-runner/dm-system-test/008d303a/src
# [2024-09-23T04:55:33] Walked 9 items in 0.002 secs (3836.136 items/sec) ...
# [2024-09-23T04:55:33] Walked 9 items in 0.002 seconds (3668.561 items/sec)
# [2024-09-23T04:55:33] Copying to /mnt/nnf/9898dbc3-7626-4196-b8dc-4f77f1be8d9e-0/rabbit-node-2-0
# [2024-09-23T04:55:33] Items: 9
# [2024-09-23T04:55:33]   Directories: 4
# [2024-09-23T04:55:33]   Files: 5
# [2024-09-23T04:55:33]   Links: 0
# [2024-09-23T04:55:33] Data: 60.000 B (12.000 B per file)
# [2024-09-23T04:55:33] Creating 4 directories
# [2024-09-23T04:55:33] Original directory exists, skip the creation: `/mnt/nnf/9898dbc3-7626-4196-b8dc-4f77f1be8d9e-0/rabbit-node-2-0' (errno=17 File exists)
# [2024-09-23T04:55:33] Creating 5 files.
# [2024-09-23T04:55:33] Copying data.
# [2024-09-23T04:55:33] Copy data: 60.000 B (60 bytes)
# [2024-09-23T04:55:33] Copy rate: 20.875 KiB/s (60 bytes in 0.003 seconds)
# [2024-09-23T04:55:33] Syncing data to disk.
# [2024-09-23T04:55:33] Sync completed in 0.055 seconds.
# [2024-09-23T04:55:33] Fixing permissions.
# [2024-09-23T04:55:33] Updated 9 items in 0.000 seconds (34591.038 items/sec)
# [2024-09-23T04:55:33] Syncing directory updates to disk.
# [2024-09-23T04:55:33] Sync completed in 0.001 seconds.
# [2024-09-23T04:55:33] Started: Sep-23-2024,04:55:33
# [2024-09-23T04:55:33] Completed: Sep-23-2024,04:55:33
# [2024-09-23T04:55:33] Seconds: 0.063
# [2024-09-23T04:55:33] Items: 9
# [2024-09-23T04:55:33]   Directories: 4
# [2024-09-23T04:55:33]   Files: 5
# [2024-09-23T04:55:33]   Links: 0
# [2024-09-23T04:55:33] Data: 60.000 B (60 bytes)
# [2024-09-23T04:55:33] Rate: 947.832 B/s (060 bytes in 0.063 seconds)
# --------------------------------------------------------------------------
# A system call failed during shared memory initialization that should
# not have.  It is likely that your MPI job will now either abort or
# experience performance degradation.
#
#   Local host:  nnf-dm-worker-xj45m
#   System call: unlink(2) /dev/shm/vader_segment.nnf-dm-worker-xj45m.8d600001.7
#   Error:       Operation not permitted (errno 1)
# --------------------------------------------------------------------------
# [nnf-dm-worker-xj45m:00480] *** Process received signal ***
# [nnf-dm-worker-xj45m:00480] Signal: Segmentation fault (11)
# [nnf-dm-worker-xj45m:00480] Signal code: Address not mapped (1)
# [nnf-dm-worker-xj45m:00480] Failing at address: (nil)
# [nnf-dm-worker-xj45m:00480] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x38d60)[0x7f8fa6a2ad60]
# [nnf-dm-worker-xj45m:00480] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x15887e)[0x7f8fa6b4a87e]
# [nnf-dm-worker-xj45m:00480] [ 2] /usr/lib/x86_64-linux-gnu/libopen-rte.so.40(+0x2c641)[0x7f8fa6ce0641]
# [nnf-dm-worker-xj45m:00480] [ 3] /usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_show_help_recv+0x167)[0x7f8fa6ce0ac7]
# [nnf-dm-worker-xj45m:00480] [ 4] /usr/lib/x86_64-linux-gnu/libopen-rte.so.40(orte_rml_base_process_msg+0x3e1)[0x7f8fa6d35ac1]
# [nnf-dm-worker-xj45m:00480] [ 5] /usr/lib/x86_64-linux-gnu/libevent_core-2.1.so.7(+0x1f1ff)[0x7f8fa6be51ff]
# [nnf-dm-worker-xj45m:00480] [ 6] /usr/lib/x86_64-linux-gnu/libevent_core-2.1.so.7(event_base_loop+0x52f)[0x7f8fa6be5a9f]
# [nnf-dm-worker-xj45m:00480] [ 7] mpirun(+0x175c)[0x55e72654c75c]
# [nnf-dm-worker-xj45m:00480] [ 8] mpirun(+0x1245)[0x55e72654c245]
# [nnf-dm-worker-xj45m:00480] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7f8fa6a15d0a]
# [nnf-dm-worker-xj45m:00480] [10] mpirun(+0x116a)[0x55e72654c16a]
# [nnf-dm-worker-xj45m:00480] *** End of error message ***
#

It appears that the dcp operation is successful, but we then get the following error message. This message appears somewhat frequently and for the most part, does not result in a non-zero error code (from what I have seen).

However, I wonder if this is what is causing the segfault here.

# --------------------------------------------------------------------------
# A system call failed during shared memory initialization that should
# not have.  It is likely that your MPI job will now either abort or
# experience performance degradation.
#
#   Local host:  nnf-dm-worker-xj45m
#   System call: unlink(2) /dev/shm/vader_segment.nnf-dm-worker-xj45m.8d600001.7
#   Error:       Operation not permitted (errno 1)
# --------------------------------------------------------------------------

Here is the backtrace:

(gdb) bt full
#0  0x00007f8fa6b4a87e in __strspn_sse42 (s=0x55e727c680d0 "help-opal-shmem-mmap.txt", a=<optimized out>)
    at ../sysdeps/x86_64/multiarch/strspn-c.c:141
        value = {<optimized out>, <optimized out>}
        index = <optimized out>
        cflag = 0
        aligned = 0xd0 <error: Cannot access memory at address 0xd0>
        mask = {<optimized out>, <optimized out>}
        offset = 0
#1  0x00007f8fa6ce0641 in ?? () from /usr/lib/x86_64-linux-gnu/libopen-rte.so.40
No symbol table info available.
#2  0x00007f8fa6ce0ac7 in orte_show_help_recv () from /usr/lib/x86_64-linux-gnu/libopen-rte.so.40
No symbol table info available.
#3  0x00007f8fa6d35ac1 in orte_rml_base_process_msg () from /usr/lib/x86_64-linux-gnu/libopen-rte.so.40
No symbol table info available.
#4  0x00007f8fa6be51ff in ?? () from /usr/lib/x86_64-linux-gnu/libevent_core-2.1.so.7
No symbol table info available.
#5  0x00007f8fa6be5a9f in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent_core-2.1.so.7
No symbol table info available.
#6  0x000055e72654c75c in orterun (argc=13, argv=0x7ffdbf045b38) at orterun.c:178
        launchst = {status = 0, active = true, jdata = 0x0}
        completest = {status = 0, active = true, jdata = 0x0}
#7  0x000055e72654c245 in main (argc=13, argv=0x7ffdbf045b38) at main.c:13
No locals.
@bdevcich
Copy link
Contributor Author

@bdevcich
Copy link
Contributor Author

bdevcich commented Sep 30, 2024

Upgrading to openmpi 4.1.6, surpasses the message:

# --------------------------------------------------------------------------
# A system call failed during shared memory initialization that should
# not have.  It is likely that your MPI job will now either abort or
# experience performance degradation.
#
#   Local host:  nnf-dm-worker-xj45m
#   System call: unlink(2) /dev/shm/vader_segment.nnf-dm-worker-xj45m.8d600001.7
#   Error:       Operation not permitted (errno 1)
# --------------------------------------------------------------------------

However, we now get this on every data movement, which is more frequent than the original warning:

    [[16880,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line
    501\n[nnf-dm-controller-manager-67cbcc74c5-s8bg6:00394] [[16880,0],0] ORTE_ERROR_LOG:
    Data unpack failed in file util/show_help.c at line 501\n[nnf-dm-controller-manager-67cbcc74c5-s8bg6:00394]
    [[16880,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line
    501\n[nnf-dm-controller-manager-67cbcc74c5-s8bg6:00394] [[16880,0],0] ORTE_ERROR_LOG:
    Data unpack failed in file util/show_help.c at line 501\n[nnf-dm-controller-manager-67cbcc74c5-s8bg6:00394]
    [[16880,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line
    501\n[nnf-dm-controller-manager-67cbcc74c5-s8bg6:00394] [[16880,0],0] ORTE_ERROR_LOG:
    Data unpack failed in file util/show_help.c at line 501\n[nnf-dm-controller-manager-67cbcc74c5-s8bg6:00394]
    [[16880,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line
    501\n[nnf-dm-controller-manager-67cbcc74c5-s8bg6:00394] [[16880,0],0] ORTE_ERROR_LOG:
    Data unpack failed in file util/show_help.c at line 501\n"

Here is line 501: https://github.com/open-mpi/ompi/blob/v4.1.x/orte/util/show_help.c#L501, which is part of orte_show_help_recv() in the original back trace above.

So it doesn't appear that this fixes anything on the root cause.

@bdevcich
Copy link
Contributor Author

I think this is some flavor of open-mpi/ompi#9905.

However, we're getting an Operation not permitted (errno 1) rather then No such file or directory (errno 2)

@bdevcich
Copy link
Contributor Author

bdevcich commented Oct 2, 2024

What I've surmised is that this issue goes away when you remove the the --uid and --gid command line options from dcp.

With --uid and --gid:

$ mpirun --allow-run-as-root --tag-output --hostfile /tmp/hostfile dcp --xattrs none --progress 1 --uid 1060 --gid 100 /tmp/testfile /lus/global/Blake/orte_test
...
[1,0]<stdout>:[2024-10-02T16:21:31] Data: 953.674 MiB (1000000000 bytes)
[1,0]<stdout>:[2024-10-02T16:21:31] Rate: 100.416 MiB/s (1000000000 bytes in 9.497 seconds)
[nnf-dm-worker-jzl9t:00329] [[22351,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line 501
[nnf-dm-worker-jzl9t:00329] [[22351,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line 501
[nnf-dm-worker-jzl9t:00329] [[22351,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line 501
[nnf-dm-worker-jzl9t:00329] [[22351,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line 501
[nnf-dm-worker-jzl9t:00329] [[22351,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line 501
[nnf-dm-worker-jzl9t:00329] [[22351,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line 501
[nnf-dm-worker-jzl9t:00329] [[22351,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line 501
[nnf-dm-worker-jzl9t:00329] [[22351,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line 501

If you remove --uid 1060 --gid 100, this error goes away:

$ root@nnf-dm-worker-jzl9t:/localdisk/dumps# mpirun --allow-run-as-root --tag-output --hostfile /tmp/hostfile dcp --progress 1 --xattrs none /tmp/testfile /lus/global/Blake/orte_test

...
[1,0]<stdout>:[2024-10-02T16:24:36] Data: 953.674 MiB (1000000000 bytes)
[1,0]<stdout>:[2024-10-02T16:24:36] Rate: 103.163 MiB/s (1000000000 bytes in 9.244 seconds)

@bdevcich
Copy link
Contributor Author

bdevcich commented Oct 2, 2024

I think this is a bug in dcp with how I implemented the --uid and --gid commands.

I've tried to workaround this using setpriv in multiple ways, but there are issue with both.

First, using setpriv mpirun dcp works and is attractive because this means we can drop the dreaded --allow-run-as-root flag from mpirun and run the whole thing as the normal user. However, this does not work for lustre-to-lustre data movement. In that case, the mpirun command is initiated from the k8s-worker nodes and has to ssh to the rabbit-nodes. We'd need ssh keys for every user in that case or open up the security so that ssh connection can be established.

Second, using mpirun --allow-run-as-root setpriv dcp causes issues other issues in dcp:

[nnf-dm-worker-jzl9t:00111] PMIX ERROR: UNREACHABLE in file ptl_tcp_component.c at line 1849
   [nnf-dm-worker-jzl9t:00118] OPAL ERROR: Unreachable in file pmix3x_client.c at line 111
   *** An error occurred in MPI_Init
   *** on a NULL communicator
   *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
   ***    and potentially your MPI job)
   [nnf-dm-worker-jzl9t:00118] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
   --------------------------------------------------------------------------
   Primary job  terminated normally, but 1 process returned
   a non-zero exit code. Per user-direction, the job has been aborted.
   --------------------------------------------------------------------------

I assume this is because of running mpirun as root and then dcp is trying to do MPI things as a different user.

@bdevcich
Copy link
Contributor Author

bdevcich commented Oct 2, 2024

I created an issue with dcp to discuss if there's any potential fixes there. Right now, I can think of two solutions:

  1. Use setpriv mpirun dcp to run the whole thing as non-root and open up the security between the nnf-dm-controller pod running on the k8s-worker nodes and the rabbit to allow ssh to connect as the non-root user
  2. Figure out how to get dcp to only do the file operations as non-root but do all the "MPI stuff" as root

bdevcich added a commit to NearNodeFlash/nnf-mfu that referenced this issue Oct 3, 2024
Upgrade openmpi from 4.1.0 to 4.1.0 and include the UID/GID fixes from
dcp.

This resolves NearNodeFlash/NearNodeFlash.github.io#212.

Signed-off-by: Blake Devcich <[email protected]>
bdevcich added a commit to NearNodeFlash/nnf-mfu that referenced this issue Oct 3, 2024
Upgrade openmpi from 4.1.0 to 4.1.0 and include the UID/GID fixes from
dcp.

This resolves NearNodeFlash/NearNodeFlash.github.io#212.

Signed-off-by: Blake Devcich <[email protected]>
bdevcich added a commit to NearNodeFlash/nnf-mfu that referenced this issue Oct 3, 2024
Upgrade openmpi from 4.1.0 to 4.1.6 and include the UID/GID fixes from
dcp.

This resolves NearNodeFlash/NearNodeFlash.github.io#212.

Signed-off-by: Blake Devcich <[email protected]>
bdevcich added a commit to NearNodeFlash/nnf-mfu that referenced this issue Oct 3, 2024
Upgrade openmpi from 4.1.0 to 4.1.6 and include the UID/GID fixes from
dcp.

This resolves NearNodeFlash/NearNodeFlash.github.io#212.

Signed-off-by: Blake Devcich <[email protected]>
@github-project-automation github-project-automation bot moved this from 📋 Open to ✅ Closed in Issues Dashboard Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Closed
Development

Successfully merging a pull request may close this issue.

1 participant