-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data movement: mpirun segfault #212
Comments
Upgrading to openmpi 4.1.6, surpasses the message:
However, we now get this on every data movement, which is more frequent than the original warning:
Here is line 501: https://github.com/open-mpi/ompi/blob/v4.1.x/orte/util/show_help.c#L501, which is part of So it doesn't appear that this fixes anything on the root cause. |
I think this is some flavor of open-mpi/ompi#9905. However, we're getting an |
What I've surmised is that this issue goes away when you remove the the With
If you remove
|
I think this is a bug in dcp with how I implemented the I've tried to workaround this using First, using Second, using
I assume this is because of running |
I created an issue with dcp to discuss if there's any potential fixes there. Right now, I can think of two solutions:
|
Upgrade openmpi from 4.1.0 to 4.1.0 and include the UID/GID fixes from dcp. This resolves NearNodeFlash/NearNodeFlash.github.io#212. Signed-off-by: Blake Devcich <[email protected]>
Upgrade openmpi from 4.1.0 to 4.1.0 and include the UID/GID fixes from dcp. This resolves NearNodeFlash/NearNodeFlash.github.io#212. Signed-off-by: Blake Devcich <[email protected]>
Upgrade openmpi from 4.1.0 to 4.1.6 and include the UID/GID fixes from dcp. This resolves NearNodeFlash/NearNodeFlash.github.io#212. Signed-off-by: Blake Devcich <[email protected]>
Upgrade openmpi from 4.1.0 to 4.1.6 and include the UID/GID fixes from dcp. This resolves NearNodeFlash/NearNodeFlash.github.io#212. Signed-off-by: Blake Devcich <[email protected]>
A segfault sometimes occurs during the dm-system-test. At this time, I don't know at what rate this occurs.
This is from a nightly run of the dm-system-test: https://github.com/NearNodeFlash/nnf-integration-test/actions/runs/10987568344/job/30503475490
It appears that the
dcp
operation is successful, but we then get the following error message. This message appears somewhat frequently and for the most part, does not result in a non-zero error code (from what I have seen).However, I wonder if this is what is causing the segfault here.
Here is the backtrace:
The text was updated successfully, but these errors were encountered: