-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stability of Cray MPI plugin #109
Comments
Hi Matt - In general I think we're trying to avoid saddling flux-core with the weirdness that comes along for each advanced technology system, based on lessons learned with the Slurm code base over the years. Note that there is still some pending work on support for Cray MPICH in the shasta stack:
And also note that we don't yet have this stack running in production, although we certainly have early adopters porting codes and running small jobs and such. I believe Cray MPICH can also bootstrap with the "normal" I think the CORAL-2 team is pretty focused on getting the rabbit support working right now for El Cap, so those MPI issues are on the back burner. The El Cap rollout demands are another reason why flux-coral2 is best kept in its own repo - it may need to change quickly and we don't want to have to push through a flux-core tag for every little thing that comes up on that schedule. |
Understood. I consider Cray MPI support a little more generic than ATS, but I get the point.
I don't think this is an issue with flux-under-slurm since 2 different jobs can overlap their ports
VNIs will be a problem. With flux-under-slurm all the flux jobs (steps here) in a Slurm job will share the Slurm-provided VNI. This can be problematic for concurrent steps, as there will be a conflict with the PID_BASE. Since we don't have a "global" flux running, we wouldn't have an arbiter to pass out VNIs even if we had a privileged way of doing it.
This would just be experimental to support some workloads that want to run more concurrent steps than slurmctld can sensibly handle.
Ah. I tried to
Understood. Thanks for the info so far. |
Hmm, is it possible to disable VNIs for flux jobs (taking the place of slurm job steps) to get around this temporarily? Or is this a complete show stopper right now?
I can do a little testing on our end to see how far I get here with that. You might need to specify |
Oops I'm confusing my PMI client and server options. The above The goal shuld be to get the MPI program to find flux's
|
I think it's an edge case that's only a problem for multi-node jobs sharing the same node. Most use cases are either small (sub-node, so many jobs on a node) or large (one job spans multiple nodes and fills each of them up).
I reinstalled without the coral2 plugins, and that seemed to work:
I don't know if Cray's PMI does anything special that flux's doesn't, but I'll head down this path. Thanks! |
Makes sense - thanks. I also did basically the same experiment you just did with a hello world program on one of our precursor systems and it worked ok. Let us know if you run into more problems. |
Is this still an issue? I just realized, we have a known potential issue in not assigning VNIs which could cause problems specifically when enough ranks of a job share a node. |
You mean is #24 (VNI support) still an issue? Still on the back burner AFAIK. |
I actually was wondering if we still see issues with multi-rank/multi-node, because the VNI issue could cause that by exhausting resources on the NIC. |
I am not aware of that ever being a problem or I have forgotten. It'd be good to have a separate, new issue on that if it is a problem or is likely to become one. |
We talked about it when I came back from ADAC, the NICs use the VNI to separate resources, so if too many things run without setting one on the NIC it can lead to resource exhaustion. The MPI subdivides the range it's given, so normally that's ok, but if we're not bootstrapping the MPI normally or if a user runs more than one multi-node job across the same set of nodes (or a single shared node I suppose) then it could be an issue. I would have sworn we made an issue for it at the time, but my internet is failing right now and I'm just hoping I can post this. 😬 will either find it or make a new one |
The readme notes:
The plugins and scripts in flux-coral2 are being actively developed and are not yet stable.
Is the Cray MPI part more stable? Supporting Cray MPI is widely interesting (outside of CORAL-2), so I'm curious if that code makes sense to "graduate" into flux-core?The text was updated successfully, but these errors were encountered: