-
-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check MPI error codes #2385
Check MPI error codes #2385
Conversation
Related to #2058 ? |
cpp/dolfinx/common/MPI.cpp
Outdated
return size; | ||
} | ||
//----------------------------------------------------------------------------- | ||
void dolfinx::MPI::assert_and_throw(MPI_Comm comm, int error_code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the function name is quite right. assert
might imply that the check is skipped in non-debug mode, and the function doesn't 'throw' an exception; it just aborts.
cpp/dolfinx/common/MPI.h
Outdated
@@ -74,6 +74,13 @@ int rank(MPI_Comm comm); | |||
/// communicator | |||
int size(MPI_Comm comm); | |||
|
|||
/// @brief Checks wether an error code returned by an MPI | |||
/// function is equal to MPI_SUCCESS. If the check fails then | |||
/// throw a runtime error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'runtime error' is a bit vague. Does it just 'abort'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I need to review documentation and function name.
I was throwing a runtime error before, but it can cause a deadlock when MPI_Abort
fails.
This can happen when the communicator is NULL (reusing a communicator that has been freed) followed by a barrier with MPI_COMM_WORLD
.
So I think a sensible solution is to forcibly abort the execution while still printing the error message.
In the following code
MPI_Allreduce
fails silently (but an error code is actually returned).This PR adds a function to check whether an error code returned by an MPI function is equal to MPI_SUCCESS. If the check fails then it prints a useful error message and aborts.
Fixes #2058.