Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stability of Cray MPI plugin #109

Open
mattaezell opened this issue Nov 1, 2023 · 11 comments
Open

Stability of Cray MPI plugin #109

mattaezell opened this issue Nov 1, 2023 · 11 comments

Comments

@mattaezell
Copy link

The readme notes: The plugins and scripts in flux-coral2 are being actively developed and are not yet stable. Is the Cray MPI part more stable? Supporting Cray MPI is widely interesting (outside of CORAL-2), so I'm curious if that code makes sense to "graduate" into flux-core?

@garlick
Copy link
Member

garlick commented Nov 1, 2023

Hi Matt -

In general I think we're trying to avoid saddling flux-core with the weirdness that comes along for each advanced technology system, based on lessons learned with the Slurm code base over the years.

Note that there is still some pending work on support for Cray MPICH in the shasta stack:

And also note that we don't yet have this stack running in production, although we certainly have early adopters porting codes and running small jobs and such.

I believe Cray MPICH can also bootstrap with the "normal" libpmi2.so support offered by flux-core. In the shasta stack, it's not the way Cray wanted to go. Instead, Cray provides their own PMI implementation, which we have to bootstrap instead of directly bootstrapping Cray MPICH. (I apologize if this is old news to you - I know Oak Ridge has a long history with Cray!)

I think the CORAL-2 team is pretty focused on getting the rabbit support working right now for El Cap, so those MPI issues are on the back burner. The El Cap rollout demands are another reason why flux-coral2 is best kept in its own repo - it may need to change quickly and we don't want to have to push through a flux-core tag for every little thing that comes up on that schedule.

@mattaezell
Copy link
Author

In general I think we're trying to avoid saddling flux-core the weirdness that comes along for each advanced technology system, based on lessons learned with the Slurm code base over the years.

Understood. I consider Cray MPI support a little more generic than ATS, but I get the point.

Note that there is still some pending work on support for Cray MPICH in the shasta stack:

* [libpals: improve port-distribution mechanism #28](https://github.com/flux-framework/flux-coral2/issues/28)

I don't think this is an issue with flux-under-slurm since 2 different jobs can overlap their ports

* [MPI: Integrate with HPE's CXI library for allocating VNIs #24](https://github.com/flux-framework/flux-coral2/issues/24)

VNIs will be a problem. With flux-under-slurm all the flux jobs (steps here) in a Slurm job will share the Slurm-provided VNI. This can be problematic for concurrent steps, as there will be a conflict with the PID_BASE. Since we don't have a "global" flux running, we wouldn't have an arbiter to pass out VNIs even if we had a privileged way of doing it.

And also note that we don't yet have this stack running in production, although we certainly have early adopters porting codes and running small jobs and such.

This would just be experimental to support some workloads that want to run more concurrent steps than slurmctld can sensibly handle.

I believe Cray MPICH can also bootstrap with the "normal" libpmi2.so support offered by flux-core. In the shasta stack, it's not the way Cray wanted to go. Instead, Cray provides their own PMI implementation, which we have to bootstrap instead of directly bootstrapping Cray MPICH. (I apologize if this is old news to you - I know Oak Ridge has a long history with Cray!)

Ah. I tried to flux run a binary using Cray mpich and it just hung. I'll play around with options to see if I can get it to work, as that would be preferred to pulling in this plugin if I don't need it.

I think the CORAL-2 team is pretty focused on getting the rabbit support working right now for El Cap, so those MPI issues are on the back burner. The El Cap rollout demands are another reason why flux-coral2 is best kept in its own repo - it may need to change quickly and we don't want to have to push through a flux-core tag for every little thing that comes up on that schedule.

Understood. Thanks for the info so far.

@garlick
Copy link
Member

garlick commented Nov 1, 2023

VNIs will be a problem. With flux-under-slurm all the flux jobs (steps here) in a Slurm job will share the Slurm-provided VNI. This can be problematic for concurrent steps, as there will be a conflict with the PID_BASE. Since we don't have a "global" flux running, we wouldn't have an arbiter to pass out VNIs even if we had a privileged way of doing it.

Hmm, is it possible to disable VNIs for flux jobs (taking the place of slurm job steps) to get around this temporarily? Or is this a complete show stopper right now?

Ah. I tried to flux run a binary using Cray mpich and it just hung. I'll play around with options to see if I can get it to work, as that would be preferred to pulling in this plugin if I don't need it.

I can do a little testing on our end to see how far I get here with that. You might need to specify -opmi=simple,libpmi2 on the flux run command line. -overbose=2 on a 2 task job is sometimes useful for getting a PMI trace. Finally, one potential source of problems could be environment variables set by slurm "leaking through" to the MPI programs started by flux and confusing them.

@garlick
Copy link
Member

garlick commented Nov 1, 2023

Oops I'm confusing my PMI client and server options. The above -opmi=simple,libpmi2 option is not going to help.

The goal shuld be to get the MPI program to find flux's libpmi2.so before cray's, so you might have to set LD_LIBRARY_PATH to point to flux's libdir. For example, on my test system:

LD_LIBRARY_PATH=/usr/lib/aarch64-linux-gnu/flux flux run -n2 flux pmi -v --method=libpmi2 barrier
libpmi2: using /usr/lib/aarch64-linux-gnu/flux/libpmi2.so
libpmi2: initialize: rank=0 size=2 name=ƒciddyLYSZu: success
libpmi2: using /usr/lib/aarch64-linux-gnu/flux/libpmi2.so
libpmi2: initialize: rank=1 size=2 name=ƒciddyLYSZu: success
libpmi2: barrier: success
libpmi2: barrier: success
libpmi2: barrier: success
libpmi2: barrier: success
ƒciddyLYSZu: completed pmi barrier on 2 tasks in 0.000s.
libpmi2: finalize: success
libpmi2: finalize: success

flux pmi is a test client that is sometimes a useful stand-in for MPI when trying to verify simple things.

@mattaezell
Copy link
Author

Hmm, is it possible to disable VNIs for flux jobs (taking the place of slurm job steps) to get around this temporarily? Or is this a complete show stopper right now?

I think it's an edge case that's only a problem for multi-node jobs sharing the same node. Most use cases are either small (sub-node, so many jobs on a node) or large (one job spans multiple nodes and fills each of them up).

The goal shuld be to get the MPI program to find flux's libpmi2.so before cray's, so you might have to set LD_LIBRARY_PATH to point to flux's libdir.

I reinstalled without the coral2 plugins, and that seemed to work:

ezy@borg041:~> export LD_LIBRARY_PATH=~/flux/install/lib/flux:$LD_LIBRARY_PATH
ezy@borg041:~> ~/flux/install/bin/flux run -N2 -n4 ~/mpi_hello/mpihi
Hello World from 0 of 4
Hello World from 1 of 4
Hello World from 3 of 4
Hello World from 2 of 4
ezy@borg041:~>

I don't know if Cray's PMI does anything special that flux's doesn't, but I'll head down this path. Thanks!

@garlick
Copy link
Member

garlick commented Nov 1, 2023

I think it's an edge case that's only a problem for multi-node jobs sharing the same node. Most use cases are either small (sub-node, so many jobs on a node) or large (one job spans multiple nodes and fills each of them up).

Makes sense - thanks.

I also did basically the same experiment you just did with a hello world program on one of our precursor systems and it worked ok.

Let us know if you run into more problems.

@trws
Copy link
Member

trws commented May 17, 2024

Is this still an issue? I just realized, we have a known potential issue in not assigning VNIs which could cause problems specifically when enough ranks of a job share a node.

@garlick
Copy link
Member

garlick commented May 17, 2024

You mean is #24 (VNI support) still an issue? Still on the back burner AFAIK.

@trws
Copy link
Member

trws commented May 17, 2024

I actually was wondering if we still see issues with multi-rank/multi-node, because the VNI issue could cause that by exhausting resources on the NIC.

@garlick
Copy link
Member

garlick commented May 17, 2024

I am not aware of that ever being a problem or I have forgotten. It'd be good to have a separate, new issue on that if it is a problem or is likely to become one.

@trws
Copy link
Member

trws commented May 17, 2024

We talked about it when I came back from ADAC, the NICs use the VNI to separate resources, so if too many things run without setting one on the NIC it can lead to resource exhaustion. The MPI subdivides the range it's given, so normally that's ok, but if we're not bootstrapping the MPI normally or if a user runs more than one multi-node job across the same set of nodes (or a single shared node I suppose) then it could be an issue. I would have sworn we made an issue for it at the time, but my internet is failing right now and I'm just hoping I can post this. 😬 will either find it or make a new one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants