Replies: 14 comments
-
Note: Still a Work In Progress. There are some more details that I want to add about job level information and the behavior of put/get. |
Beta Was this translation helpful? Give feedback.
-
Thanks @jjhursey for putting this together. As you mention, MPI has this use-case. Presumably SHMEM does as well. Are there other popular libraries that we can list as having this use-case? |
Beta Was this translation helpful? Give feedback.
-
I updated the description with a few of the specific job/process level information items. I'll take the WIP tag off this so we can have a discussion and further expand on this. |
Beta Was this translation helpful? Give feedback.
-
For a list of the job/process level see section 11.1.3 (PMIx_server_register_nspace) for a list. Document Link Suggestion from the teleconf:
|
Beta Was this translation helpful? Give feedback.
-
There are lots of other use cases for this functionality, e.g. tools (I/O middleware, performance tools, etc.), programming model runtimes (MPI of course and others), probably anything that doesn't rely on MPI and wants to bootstrap communication. This is kind of a meta point: Do we want to write all of these use cases wrt to how it is done in the current version of PMIx (i.e. naming the API calls to use), or do we want to simply map out the general functionality needed by the use case (independent of what PMIx does)? I thought we were doing the latter since in some cases we may be describing use cases for which there is no interface yet. |
Beta Was this translation helpful? Give feedback.
-
The goal of the use case is to present a description of a problem with guidance on how it might be solved in PMIx. The author can describe it in a broad sense or with specifics about interfaces they think would be helpful from PMIx. The PMIx community then can help the author link it to existing PMIx interfaces that might address the use case. If new interfaces are needed then they can be worked out. The idea was to have a low bar for suggesting use cases and engaging with the community (no need to be a PMIx expert). Then the PMIx community can engage to help see how it fits into what is currently defined by PMIx, and what might need to be defined still. |
Beta Was this translation helpful? Give feedback.
-
Notes from the meeting yesterday relevant to this issue:
I'm not sure exactly what we want the final form of the use-cases to look like, but I think the most useful would be both of what you describe: the general functionality/use-case and the specifics for each sub use-case (what does MPI require, what does a debugger require, what does a kublet (or other cloud technology) require, etc). If a particular sub use-case is exactly the same as another (which may be the case for MPI and SHMEM, idk), then that can be called out too. |
Beta Was this translation helpful? Give feedback.
-
Updated the original post with a snapshot of the use-case from our Google Drive drafts folder: https://drive.google.com/open?id=1eN7aBxyzPD0a_GJFq1KH2ZHpoONj76op |
Beta Was this translation helpful? Give feedback.
-
PMIx performance tool: |
Beta Was this translation helpful? Give feedback.
-
Above is as an example of business card exchange |
Beta Was this translation helpful? Give feedback.
-
I updated the Google Drive version of this use case to make a clearer distinction of the role of The new text starts in the paragraph starting with:
Let me know what you think. |
Beta Was this translation helpful? Give feedback.
-
Thanks @jjhursey! I left a comment in the google drive version too, but for anyone having trouble accessing that, my question was w.r.t. this sentence:
What is the "runtime environment" in this scenario? I presume the PMIx server, but I'm not sure. |
Beta Was this translation helpful? Give feedback.
-
I updated the google drive version to clarify that it is the RM that it is talking about. |
Beta Was this translation helpful? Give feedback.
-
The v5.0.x PR #328 included this use case. The issue will remain open for further discussion on this topic. |
Beta Was this translation helpful? Give feedback.
-
Brief Description
Multi-process communication libraries, such as MPI, need to establish communication channels between a set of those processes. Each process needs to share connectivity information (a.k.a. Business Cards) with all other processes before communication channels can be established. The runtime environment must provide a mechanism for the efficient exchange of this connectivity information. Additional information about the current state of the job (e.g., number of processes globally and locally) and of how the process was started (e.g., process binding) are also helpful.
Use Case Details
Note: The Instant-On wire-up mechanism is a separate, related use case.
Multi-process communication libraries, such as MPI, need to establish communication channels between a set of those processes. Each process needs to share connectivity information (a.k.a. Business Cards) with all other processes before communication channels can be established. This connectivity information may take the form of one or more unique strings that allow a different process to establish a communication channel with the originator.
Each process provides their business card to PMIx via one or more
PMIx_Put
operations to store the tuple of{UID, key, value}
. TheUID
is the unique name for this process in the PMIx universe (i.e.,namespace
andrank
). Thekey
is a unique key that other processes can reference generically (note that since theUID
is also associated with thekey
there is no need to make thekey
uniquely named per process). Thevalue
is the string representation of the connectivity information.Some business card information is meant for remote processes (e.g., TCP or InfiniBand addresses) while others are meant only for local processes (e.g., shared memory information). As such a
scope
should be associated with thePMIx_Put
operation to differentiate this intention.The
PMIx_Put
operations may be cached local to the process. Once allPMIx_Put
operations have been called each process should callPMIx_Commit
to push those values to the local PMIx server. Note that in a multi-library configuration each library mayPMIx_Put
thenPMIx_Commit
values - so there may be multiplePMIx_Commit
calls before a Business Card Exchange is activated.After calling
PMIx_Commit
a process can activate the Business Card Exchange collective operation by callingPMIx_Fence
. ThePMIx_Fence
operation is collective over the set of processes specified in the argument set. That allows for the collective to span a subset of a namespace or multiple namespaces. After the completion of thePMIx_Fence
operation, the dataPMIx_Put
by other processes is available to the local process through a call toPMIx_Get
which returns the key/value pairs necessary to establish the connection(s) with the other processes.The
PMIx_Fence
operation must have a "Synchronize Only" mode that works as a barrier operation. This is helpful if the communication library requires a synchronization before leaving initialization or starting finalization, for example.The
PMIx_Fence
operation should have a "Sparse" mode in addition to a "Full" mode for the data exchange. The "Full" mode will fully exchange all Business Card information to all other processes. This is helpful for tightly communicating applications. The "Sparse" mode will dynamically pull the connectivity information on-demand from inside ofPMIx_Get
(if it is not already available locally). This is helpful for sparsely communicating applications. Since which mode is best for an application cannot be inferred by the PMIx library the caller must specify which mode works best for their application.The
PMIx_Fence
operation should have an option for the end user to specify which mode they desire for this operation.Additional information about the current state of the job (e.g., number of processes globally and locally) and of how the process was started (e.g., process binding) are also helpful. This "job level" information must be available immediately after
PMIx_Init
without the need for any explicit synchronization.The number of processes globally in the namespace and this process's rank within that namespace is important to know before establishing the Business Card information to best allocate resources.
The number of processes local to the node and this process's local rank is important to know before establishing the Business Card information to help the caller determine the scope of the put operation. For example, to designate a leader to set up a shared memory segment of the proper size before putting that information into the locally scoped Business Card information.
The number of processes local to a remote node is also helpful to know before establishing the Business Card information. This information is useful to pre-establish local resources before that remote node starts to initiate a connection or to determine the number of connections that need to be advertised in the Business Card when it is sent out.
Note that some of the job level information may change over the course of the job in a dynamic application.
Interfaces
Keys
The following job level information is useful to have before establishing Business Card information:
PMIX_NODE_LIST
List of nodes in the jobPMIX_NUM_NODES
Number of nodes in the jobPMIX_NODEID
Node ID where this process is locatedPMIX_JOB_SIZE
Number of processes globally in the jobPMIX_PROC_MAP
Mapping of processes to nodes in this jobPMIX_LOCAL_PEERS
List of local processes on this node in this jobPMIX_LOCAL_SIZE
Number of processes local to this node.For each process this information is also useful (note that any one process may want to access this list of information about any other process in the system):
PMIX_RANK
My global rank in the jobPMIX_LOCAL_RANK
My local rank on this node for this jobPMIX_GLOBAL_RANK
My global rank across all namespacesPMIX_LOCALITY_STRING
Process binding on this nodePMIX_HOSTNAME
hostname associated with this process (useful for queries about remote processes)There are other keys that are helpful to have before a synchronization point, this is not meant to be a comprehensive list.
References
Beta Was this translation helpful? Give feedback.
All reactions