MPI: Integrate with HPE's CXI library for allocating VNIs #24

jameshcorbett · 2022-08-23T20:11:38Z

Per the latest batch of emails with Cray, it looks like the Shasta APIs can be used a la carte by the WLM.

APIs that seem like ones that we would use:
HMS's Hardware Inventory API (combined with the ClusterStor Inventory API)
The ATOM node health check API
Based on the latest emails and discussions, we will not be using either of these interfaces. We will only be using the node health checks through the default TOSS4 utility (nodediag?).

CXI library for allocating/setting up VNIs - requires root

The text was updated successfully, but these errors were encountered:

jameshcorbett · 2022-08-23T20:52:33Z

Slingshot requires the use of VNIs (think of VLANs). If you use the same VNI for everything, eventually you exhaust endpoints on the switches. Slurm will be using one VNI per job step.

For Flux:

Day 0: pre-allocate X VNIs per sub-instance, then launches within that sub-instance round-robin across VNIs
Day 1: allow users to request extra VNIs per sub-instance
Day 2: bespoke setuid binary that when run at top-level will do anything but when run as a user, it is limited to what the top-level constrained things to

garlick · 2022-08-23T21:47:52Z

See also: https://github.com/SchedMD/slurm/tree/master/src/plugins/switch/hpe_slingshot

garlick · 2023-03-08T16:56:53Z

VNI tagging was brought up again recently in a (not public) TOSS issue: https://lc.llnl.gov/jira/browse/TOSS-5932

This statement from the issue seemed like a good description of the problem:

As part of the changes that implement VNI tagging on the HPE Slingshot NIC, as of Slingshot 2.0.1 the default CXI Service has been disabled. This means that deployments must implement additional host-side configuration (using the job scheduler plug-ins for example) implement VNI tagging, or explicitly re-enable the default service to have applications operate as in previous releases. (This also means that CXI diagnostics need to pass the VNI information on the command line). HPE recommends fully implementing VNI tagging for isolating RDMA traffic to protect against memory writes from nodes not known to be part of the job. Refer to Section 8.3. of the HPE Slingshot Operations Guide - Customer for more information.

trws · 2024-05-17T16:46:02Z

I don't see the context elsewhere, or another issue, so I'll add it here. We need to implement VNI assignment at least local to each node. The switch reconfiguration, which is the part that has performance and interface concerns, we don't need to deal with, but it's also possible to exhaust resources on the NIC if we don't. From what I understand, there are two parts to this.

Actually set up a range of VNIs on the NIC. I think slurm does this per-job-step, but for us it makes much more sense to set up a range per system-level job that over-allocates.
Use the appropriate environment variable to set the offset into the VNI range the current job should use. This doesn't require any privilege, so we can do this in every level below the system level.

In principle, as a start at least, I think we could actually just do (2) and it would work, but it wouldn't provide any protection against inappropriate cross-job/cross-user RDMAs.

garlick · 2024-07-25T14:18:41Z

In the scheduler meeting yesterday we discussed moving this forward:

doing a high level design with HPE now that the flux architecture is probably better understood by them
lining up any documentation required, such as CXI API docs
decide on the scope. Can we definitively say we won't need to talk to the fabric manager (for example).

HPE referred us to their slurm slingshot plugin (source pointer was posted in an earlier comment). Possibly also interesting at this early stage are the slurm config options for slingshot, documented here. There some slingshot related srun options as well.

garlick mentioned this issue Nov 1, 2023

Stability of Cray MPI plugin #109

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI: Integrate with HPE's CXI library for allocating VNIs #24

MPI: Integrate with HPE's CXI library for allocating VNIs #24

jameshcorbett commented Aug 23, 2022

jameshcorbett commented Aug 23, 2022

garlick commented Aug 23, 2022

garlick commented Mar 8, 2023

trws commented May 17, 2024

garlick commented Jul 25, 2024

MPI: Integrate with HPE's CXI library for allocating VNIs #24

MPI: Integrate with HPE's CXI library for allocating VNIs #24

Comments

jameshcorbett commented Aug 23, 2022

jameshcorbett commented Aug 23, 2022

garlick commented Aug 23, 2022

garlick commented Mar 8, 2023

trws commented May 17, 2024

garlick commented Jul 25, 2024