Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outermost mpi::Scope does not restore prior communicator #186

Open
fmahebert opened this issue Apr 17, 2024 · 8 comments
Open

Outermost mpi::Scope does not restore prior communicator #186

fmahebert opened this issue Apr 17, 2024 · 8 comments

Comments

@fmahebert
Copy link
Contributor

fmahebert commented Apr 17, 2024

What happened?

The outermost atlas mpi::Scope does not restore the prior default eckit MPI communicator when it goes out of scope.

See code snippet below.

I suspect the problem is this:

  • the Scope destructor calls CommStack::pop
  • the pop decrements size_ but then uses name(), which evaluates stack_[size_-1];
  • when destructing the outermost Scope, this is no longer a valid indexing

Perhaps one solution would be to push the current eckit default communicator when creating an outermost Scope object? Basically, if size_ == 0, then push the current default THEN push the new name?

What are the steps to reproduce the bug?

This code...

eckit::mpi::addComm("comm1", 123);  // as if from fortran
eckit::mpi::addComm("comm2", 124);  // as if from fortran
eckit::mpi::setCommDefault("comm1");
std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;
{
  atlas::mpi::Scope scope("comm2");
  std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;
}
std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;

prints out...

default eckit comm = comm1
default eckit comm = comm2
default eckit comm = world

Whereas I would expect the last output to be default eckit comm = comm1.

Version

0.36

Platform (OS and architecture)

Linux x86_64

Relevant log output

No response

Accompanying data

No response

Organisation

JCSDA

@fmahebert
Copy link
Contributor Author

Looping in also @MayeulDestouches who brought this to my attention.

@wdeconinck
Copy link
Member

Thanks for bringing this to my attention. I will try to reproduce, create a test and fix asap.

@wdeconinck
Copy link
Member

Following reproducer doesn't reproduce it for me... Could you double check?

#include <iostream>

#include "eckit/runtime/Main.h"
#include "eckit/mpi/Comm.h"

#include "atlas/parallel/mpi/mpi.h"

int main(int argc, char* argv[]) {
    eckit::Main::initialise(argc,argv);
    int irank = eckit::mpi::comm().rank();
    auto print_default_comm_name = [&] {
       if (irank == 0 ) {
           std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;
       }
    };
    eckit::mpi::comm().split(irank, "comm1");
    eckit::mpi::comm().split(irank, "comm2");
    eckit::mpi::setCommDefault("comm1");

    print_default_comm_name();
    {
        atlas::mpi::Scope scope("comm2");
        print_default_comm_name();
    }
    print_default_comm_name();
    return 0;
}

Output:

default eckit comm = comm1
default eckit comm = comm2
default eckit comm = comm1

@fmahebert
Copy link
Contributor Author

@wdeconinck Thanks for taking a look. I confirm I see the same (= correct) output from your code snippet. Now to figure out the difference between this case and what JEDI is doing... I'll post an update when I know more.

@MayeulDestouches
Copy link

Thanks @fmahebert and @wdeconinck for investigating this. I've tried playing with Willem's code snippet, but I'm unable to reproduce the issue either. Apparently we've missed something on what triggers the error...

@fmahebert
Copy link
Contributor Author

This code snippet more closely mirrors what's being done in the particular JEDI scenario that @MayeulDestouches and I were testing:

#include <iostream>

#include "eckit/runtime/Main.h"
#include "eckit/mpi/Comm.h"

#include "atlas/grid/Grid.h"
#include "atlas/grid/Partitioner.h"
#include "atlas/functionspace/StructuredColumns.h"
#include "atlas/parallel/mpi/mpi.h"

int main(int argc, char* argv[]) {
    eckit::Main::initialise(argc,argv);

    int irank = eckit::mpi::comm().rank();
    auto print_default_comm_name = [&] {
       if (irank == 0 ) {
           std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;
       }
    };

    eckit::mpi::comm().split(irank, "comm1");
    eckit::mpi::comm().split(irank, "comm2");

    auto make_structured_columns = [] {
        const atlas::Grid grid("F15");
        const atlas::grid::Partitioner partitioner("equal_regions");
        const atlas::functionspace::StructuredColumns sc(grid, partitioner);
    };

    eckit::mpi::setCommDefault("comm1");
    print_default_comm_name();
    make_structured_columns();
    print_default_comm_name();

    eckit::mpi::setCommDefault("comm2");
    print_default_comm_name();
    make_structured_columns();
    print_default_comm_name();

    return 0;
}

And the output for this is,

default eckit comm = comm1
default eckit comm = comm1
default eckit comm = comm2
default eckit comm = comm1

(where the last line was expected to be comm2...)

Something about constructing the StructuredColumns appears to be resetting the default MPI comm. We had guessed it was the mpi::Scope in the StructuredColumns constructor, but that may have been too hasty of a guess.

@wdeconinck Can you confirm whether you see this same output and/or whether that matches your expectation?

@fmahebert
Copy link
Contributor Author

I'm wondering if for our use-case we might be better off using the mpi_comm config option for constructing the atlas objects, and not using the setCommDefault. @wdeconinck do you have a recommendation?

@wdeconinck
Copy link
Member

Thank you for the new reproducer. I could simplify it a bit:

#include <iostream>

#include "eckit/runtime/Main.h"
#include "eckit/mpi/Comm.h"

#include "atlas/parallel/mpi/mpi.h"

int main(int argc, char* argv[]) {
    eckit::Main::initialise(argc,argv);
    int irank = eckit::mpi::comm().rank();
    auto print_default_comm_name = [&] {
       if (irank == 0 ) {
           std::cout << "default eckit comm = " << eckit::mpi::comm().name() << std::endl;
       }
    };
    eckit::mpi::comm().split(irank, "comm1");
    eckit::mpi::comm().split(irank, "comm2");

    auto make_structured_columns = [&] {
       atlas::mpi::Scope scope("comm2");
   };

    eckit::mpi::setCommDefault("comm1");
    print_default_comm_name();
    make_structured_columns();
    print_default_comm_name();

    eckit::mpi::setCommDefault("comm2");
    print_default_comm_name();
    make_structured_columns();
    print_default_comm_name();
 
    return 0;
}

This is definitely unintended behaviour which I want to fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants