Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI: Integrate with HPE's CXI library for allocating VNIs #24

Open
jameshcorbett opened this issue Aug 23, 2022 · 5 comments
Open

MPI: Integrate with HPE's CXI library for allocating VNIs #24

jameshcorbett opened this issue Aug 23, 2022 · 5 comments

Comments

@jameshcorbett
Copy link
Member

Per the latest batch of emails with Cray, it looks like the Shasta APIs can be used a la carte by the WLM.

APIs that seem like ones that we would use:

HMS's Hardware Inventory API (combined with the ClusterStor Inventory API)
The ATOM node health check API

Based on the latest emails and discussions, we will not be using either of these interfaces. We will only be using the node health checks through the default TOSS4 utility (nodediag?).

CXI library for allocating/setting up VNIs - requires root

@jameshcorbett
Copy link
Member Author

Slingshot requires the use of VNIs (think of VLANs). If you use the same VNI for everything, eventually you exhaust endpoints on the switches. Slurm will be using one VNI per job step.

For Flux:

Day 0: pre-allocate X VNIs per sub-instance, then launches within that sub-instance round-robin across VNIs
Day 1: allow users to request extra VNIs per sub-instance
Day 2: bespoke setuid binary that when run at top-level will do anything but when run as a user, it is limited to what the top-level constrained things to

@garlick
Copy link
Member

garlick commented Aug 23, 2022

@garlick
Copy link
Member

garlick commented Mar 8, 2023

VNI tagging was brought up again recently in a (not public) TOSS issue: https://lc.llnl.gov/jira/browse/TOSS-5932

This statement from the issue seemed like a good description of the problem:

As part of the changes that implement VNI tagging on the HPE Slingshot NIC, as of Slingshot 2.0.1 the default CXI Service has been disabled. This means that deployments must implement additional host-side configuration (using the job scheduler plug-ins for example) implement VNI tagging, or explicitly re-enable the default service to have applications operate as in previous releases. (This also means that CXI diagnostics need to pass the VNI information on the command line). HPE recommends fully implementing VNI tagging for isolating RDMA traffic to protect against memory writes from nodes not known to be part of the job. Refer to Section 8.3. of the HPE Slingshot Operations Guide - Customer for more information.

@trws
Copy link
Member

trws commented May 17, 2024

I don't see the context elsewhere, or another issue, so I'll add it here. We need to implement VNI assignment at least local to each node. The switch reconfiguration, which is the part that has performance and interface concerns, we don't need to deal with, but it's also possible to exhaust resources on the NIC if we don't. From what I understand, there are two parts to this.

  1. Actually set up a range of VNIs on the NIC. I think slurm does this per-job-step, but for us it makes much more sense to set up a range per system-level job that over-allocates.
  2. Use the appropriate environment variable to set the offset into the VNI range the current job should use. This doesn't require any privilege, so we can do this in every level below the system level.

In principle, as a start at least, I think we could actually just do (2) and it would work, but it wouldn't provide any protection against inappropriate cross-job/cross-user RDMAs.

@garlick
Copy link
Member

garlick commented Jul 25, 2024

In the scheduler meeting yesterday we discussed moving this forward:

  • doing a high level design with HPE now that the flux architecture is probably better understood by them
  • lining up any documentation required, such as CXI API docs
  • decide on the scope. Can we definitively say we won't need to talk to the fabric manager (for example).

HPE referred us to their slurm slingshot plugin (source pointer was posted in an earlier comment). Possibly also interesting at this early stage are the slurm config options for slingshot, documented here. There some slingshot related srun options as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants