-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outermost mpi::Scope does not restore prior communicator #186
Comments
Looping in also @MayeulDestouches who brought this to my attention. |
Thanks for bringing this to my attention. I will try to reproduce, create a test and fix asap. |
Following reproducer doesn't reproduce it for me... Could you double check? #include <iostream>
#include "eckit/runtime/Main.h"
#include "eckit/mpi/Comm.h"
#include "atlas/parallel/mpi/mpi.h"
int main(int argc, char* argv[]) {
eckit::Main::initialise(argc,argv);
int irank = eckit::mpi::comm().rank();
auto print_default_comm_name = [&] {
if (irank == 0 ) {
std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;
}
};
eckit::mpi::comm().split(irank, "comm1");
eckit::mpi::comm().split(irank, "comm2");
eckit::mpi::setCommDefault("comm1");
print_default_comm_name();
{
atlas::mpi::Scope scope("comm2");
print_default_comm_name();
}
print_default_comm_name();
return 0;
} Output:
|
@wdeconinck Thanks for taking a look. I confirm I see the same (= correct) output from your code snippet. Now to figure out the difference between this case and what JEDI is doing... I'll post an update when I know more. |
Thanks @fmahebert and @wdeconinck for investigating this. I've tried playing with Willem's code snippet, but I'm unable to reproduce the issue either. Apparently we've missed something on what triggers the error... |
This code snippet more closely mirrors what's being done in the particular JEDI scenario that @MayeulDestouches and I were testing:
And the output for this is,
(where the last line was expected to be comm2...) Something about constructing the StructuredColumns appears to be resetting the default MPI comm. We had guessed it was the @wdeconinck Can you confirm whether you see this same output and/or whether that matches your expectation? |
I'm wondering if for our use-case we might be better off using the |
Thank you for the new reproducer. I could simplify it a bit: #include <iostream>
#include "eckit/runtime/Main.h"
#include "eckit/mpi/Comm.h"
#include "atlas/parallel/mpi/mpi.h"
int main(int argc, char* argv[]) {
eckit::Main::initialise(argc,argv);
int irank = eckit::mpi::comm().rank();
auto print_default_comm_name = [&] {
if (irank == 0 ) {
std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;
}
};
eckit::mpi::comm().split(irank, "comm1");
eckit::mpi::comm().split(irank, "comm2");
auto make_structured_columns = [&] {
atlas::mpi::Scope scope("comm2");
};
eckit::mpi::setCommDefault("comm1");
print_default_comm_name();
make_structured_columns();
print_default_comm_name();
eckit::mpi::setCommDefault("comm2");
print_default_comm_name();
make_structured_columns();
print_default_comm_name();
return 0;
} This is definitely unintended behaviour which I want to fix. |
What happened?
The outermost atlas
mpi::Scope
does not restore the prior default eckit MPI communicator when it goes out of scope.See code snippet below.
I suspect the problem is this:
Scope
destructor callsCommStack::pop
pop
decrementssize_
but then usesname()
, which evaluatesstack_[size_-1];
Scope
, this is no longer a valid indexingPerhaps one solution would be to
push
the current eckit default communicator when creating an outermostScope
object? Basically, ifsize_ == 0
, then push the current default THEN push the new name?What are the steps to reproduce the bug?
This code...
prints out...
Whereas I would expect the last output to be
default eckit comm = comm1
.Version
0.36
Platform (OS and architecture)
Linux x86_64
Relevant log output
No response
Accompanying data
No response
Organisation
JCSDA
The text was updated successfully, but these errors were encountered: