Show MPI connectivity map during MPI_INIT #30

ompiteam · 2014-10-01T15:58:56Z

It has long been discussed, and I swear there was a ticket about this
at some point but I can't find it now. So I'm filing a new one --
close this as a dupe if someone can find an older one.

OMPI currently uses a negative ACK system to indicate if high-speed
networks are not used for MPI communications. For example, if you
have the openib BTL available but it can't find any active ports in a
given MPI process, it'll display a warning message.

But some users want a ''positive'' acknowledgement of what networks
are being used for MPI communications (this can also help with
regression testing, per a thread on the MTT mailing list). HP MPI
offers this feature, for example. It would be nice to have a simple
MCA parameter that will cause MCW rank 0 to output a connectivity map
during MPI_INIT.

Complications:

In some cases, OMPI doesn't know which networks will be used for
communications with each MPI process peer; we only know which ones
we'll try to use when connections are actually established (per
OMPI's lazy connection model for the OB1 PML). But I think that
even outputting this information will be useful.
Connectivity between MPI processes are likely to be non-uniform.
E.g., MCW rank 0 may use the sm btl to communicate with some MPI
processes, but a different btl to communicate with others. This is
almost certainly a different view than other processes have. The
connectivity information needs to be conveyed on a process-pair
basis (e.g., a 2D chart).
Since we have to span multiple PMLs, this may require an addition
to the PML API.

A first cut could display a simple 2D chart of how OMPI thinks it may
send MPI traffic from each process to each process. Perhaps something
like (OB1 6 process job, 2 processes on each of 3 hosts):

MCW rank 0     1     2     3     4     5
0        self  sm    tcp   tcp   tcp   tcp
1        sm    self  tcp   tcp   tcp   tcp
2        tcp   tcp   self  sm    tcp   tcp
3        tcp   tcp   sm    self  tcp   tcp
4        tcp   tcp   tcp   tcp   self  sm
5        tcp   tcp   tcp   tcp   sm    self

Note that the upper and lower triangular portions of the map are the
same, but it's probably more human-readable if both are output.
However, multiple built-in output formats could be useful, such as:

Human readable, full map (see above)
Human readable, abbreviated (see below for some ideas on this)
Machine parsable, full map
Machine parsable, abbreviated

It may also be worthwhile to investigate a few huersitics to compress
the graph where possible. Some random ideas in this direction:

The above example could be represented as:

MPI connectivty map, listed by process:
X->X: self
X<->X+1, X in {0,2,4}: sm
other: tcp

Another example:

MPI connectivty map, listed by process:
X->X: self
other: tcp

Another example:

MPI connectivty map, listed by process:
all: CM PML, MX MTL

Perhaps something could be done with "exceptions" -- e.g., where
the openib BTL is being used for inter-node connectivity ''except''
for one node (where IB is malfunctioning, and OMPI fell back to
TCP) -- this is a common case that users/sysadmins want to detect.

Another useful concept might be to show some information about each
endpoint in the connectivity map. E.g., show a list of TCP endpoints
on each process, by interface name and/or IP address. Similar for
other transports. This kind of information can show when/if
multi-rail scenarios are active, etc. For example:

MCW rank 0     1     2     3     4     5
0        self      sm        tcp:eth0  tcp:eth0  tcp:eth0  tcp:eth0
1        sm        self      tcp:eth0  tcp:eth0  tcp:eth0  tcp:eth0
2        tcp:eth0  tcp:eth0  self      sm        tcp:eth0  tcp:eth0
3        tcp:eth0  tcp:eth0  sm        self      tcp:eth0  tcp:eth0
4        tcp:eth0  tcp:eth0  tcp:eth0  tcp:eth0  self      sm
5        tcp:eth0  tcp:eth0  tcp:eth0  tcp:eth0  sm        self

With more information such as interface names, compression of the
output becomes much more important, such as:

MPI connectivty map, listed by process:
X->X: self
X<->X+1, X in {0,2,4}: sm
other: tcp:eth0,eth1

Note that these ideas can certainly be implemented in stages; there's
no need to do everything at once.

The text was updated successfully, but these errors were encountered:

ompiteam · 2014-10-01T15:58:57Z

Imported from trac issue 1207. Created by jsquyres on 2008-02-06T13:06:04, last modified: 2014-04-22T16:23:42

ompiteam · 2014-10-01T15:58:57Z

Trac comment by bosilca on 2008-02-06 13:20:24:

At one point we should starting of thinking about how to trim down the size of the MPI shared library. While I agree that such information is useful for the user, I don't think it need to go deeply inside the library. I see it more like an additional tool/utility bundled with Open MPI.

ompiteam · 2014-10-01T15:58:58Z

Trac comment by jjhursey on 2008-02-06 15:15:15:

I agree that conceptually this could be a useful tool (or really an addition to the orte-ps/ompi-ps tool).

Actually when I was originally designing orte-ps I talked with some people at Sun about their [http://docs.sun.com/source/819-4131-10/DisplayingJobInformation.html mpps] command. I tried to model orte-ps around their mpps command. mpps has a bit more functionality than orte-ps, and some of that is due to the current limitations of tools in Open MPI.

The limitation is that tools connect though the HNP which is an ORTE layer application, so it has no (or extremely limited) knowledge of OMPI layer constructs. So a tool is unable to access information about OMPI level collective and point-to-point constructs, for example.

In the short term, it is easiest to implement this at the moment is to have the Rank 0 process dump this information. In the long term we may want to consider looking again at how tools interact with the MPI job, and think about how we can create a ompi-ps command that displays OMPI layer information.

ompiteam · 2014-10-01T15:58:58Z

Trac comment by jjhursey on 2008-02-07 07:21:41:

As another idea for a compressed representation. It would be useful to only display the unique set of parameters used in the job. This is really what MTT is going to want in the short term since capturing and querying a 2D space of connectivity information is difficult.

So the following:

MCW rank 0     1     2     3     4     5
0        self  sm    tcp   tcp   tcp   tcp
1        sm    self  tcp   tcp   tcp   tcp
2        tcp   tcp   self  sm    tcp   tcp
3        tcp   tcp   sm    self  tcp   tcp
4        tcp   tcp   tcp   tcp   self  sm
5        tcp   tcp   tcp   tcp   sm    self

Would be represented as:

MPI connectivity map (Active components):
PML: ob1
BTL: tcp,sm,self

ompiteam · 2014-10-01T15:58:58Z

Trac comment by tdd on 2008-02-07 08:08:59:

Replying to [comment:1 bosilca]:

At one point we should starting of thinking about how to trim down the size of the MPI shared library. While I agree that such information is useful for the user, I don't think it need to go deeply inside the library. I see it more like an additional tool/utility bundled with Open MPI.

I think I disagree here in that you really want this information coming from the actual code so one could detect issues with the actual (B/M)TL picking algorithm and risk possible divergence of the utility and the actual code. I also think you lose a nice quick way for a user to confirm their run really is going the appropriate TLs without having to run 2 programs. I know that last point may seem silly.

Note, Sun's original CT base had this feature as a part of an env-var named MPI_SHOW_INTERFACES where it would show different amounts of verbosity for each level thus giving one either a broad idea how things are connected to a real detail view of all the decisions the library considers chosing for BTLs and interfaces. This proved incredibly helpful in debugging complicated customer networks.

ompiteam · 2014-10-01T15:58:59Z

Trac comment by jsquyres on 2008-02-07 08:24:12:

Replying to [comment:4 tdd]:

Replying to [comment:1 bosilca]:

At one point we should starting of thinking about how to trim down the size of the MPI shared library. While I agree that such information is useful for the user, I don't think it need to go deeply inside the library. I see it more like an additional tool/utility bundled with Open MPI.

I think I disagree here in that you really want this information coming from the actual code so one could detect issues with the actual (B/M)TL picking algorithm

...I think George is talking about a different issue (just overall reducing the size of the MPI library). I agree that this is a good thing to do, and perhaps we can modularize features like this (e.g., make the display map functionality be a DSO plugin that is loaded on demand), but I think that that is outside the scope of this ticket. Please create a new ticket for that kind of functionality; the display map functionality can easily be fit into a plugin framework someday if desired.

Thanks.

ompiteam · 2014-10-01T15:58:59Z

Trac comment by jsquyres on 2008-02-07 08:25:11:

Replying to [comment:4 tdd]:

Note, Sun's original CT base had this feature as a part of an env-var named MPI_SHOW_INTERFACES where it would show different amounts of verbosity for each level thus giving one either a broad idea how things are connected to a real detail view of all the decisions the library considers chosing for BTLs and interfaces. This proved incredibly helpful in debugging complicated customer networks.

Terry: can you include some samples of what the output looked like from when users invoked MPI_SHOW_INTERFACES?

ompiteam · 2014-10-01T15:59:00Z

Trac comment by tdd on 2008-02-07 09:04:38:

Replying to [comment:6 jsquyres]:

Replying to [comment:4 tdd]:

Note, Sun's original CT base had this feature as a part of an env-var named MPI_SHOW_INTERFACES where it would show different amounts of verbosity for each level thus giving one either a broad idea how things are connected to a real detail view of all the decisions the library considers chosing for BTLs and interfaces. This proved incredibly helpful in debugging complicated customer networks.

Terry: can you include some samples of what the output looked like from when users invoked MPI_SHOW_INTERFACES?

Ok first here is a description of the option which doesn't completely jive with what you are proposing:

MPI_SHOW_INTERFACES

    When set to 1, 2 or 3, information regarding which interfaces are being used by an MPI application prints to stdout. Set MPI_SHOW_INTERFACES to 1 to print the selected internode interface. Set it to 2 to print all the interfaces and their rankings. Set it to 3 for verbose output. The default value, 0, does not print information to stdout.

The following are some examples of its usage:

burl-ct-v40z-0 129 =>setenv MPI_SHOW_INTERFACES 1
burl-ct-v40z-0 130 =>mprun -np 4 -Ns -W initme.6

(j34, r2): using "shm" PM from burl-ct-v40z-1 to burl-ct-v40z-1 
(j34, r2): using "tcp" PM from burl-ct-v40z-1 to burl-ct-v40z-0 
(j34, r0): using "tcp" PM from burl-ct-v40z-1 to burl-ct-v40z-0 
(j34, r0): using "shm" PM from burl-ct-v40z-1 to burl-ct-v40z-1 
(j34, r3): using "tcp" PM from burl-ct-v40z-0 to burl-ct-v40z-1 
(j34, r1): using "tcp" PM from burl-ct-v40z-0 to burl-ct-v40z-1 
(j34, r3): using "shm" PM from burl-ct-v40z-0 to burl-ct-v40z-0 
(j34, r1): using "shm" PM from burl-ct-v40z-0 to burl-ct-v40z-0

burl-ct-v40z-0 131 =>setenv MPI_SHOW_INTERFACES 2 
burl-ct-v40z-0 132 =>mprun -np 4 -Ns -W initme.6

(TCP j43, burl-ct-v40z-0, r3): using interface bge1 (IP=10.8.31.85) to burl-ct-v40z-1
(TCP j43, burl-ct-v40z-0, r1): using interface bge1 (IP=10.8.31.85) to burl-ct-v40z-1
(TCP j43, burl-ct-v40z-1, r2): using interface bge1 (IP=10.8.31.83) to burl-ct-v40z-0 (TCP j43, burl-ct-v40z-1, r0): using interface bge1 (IP=10.8.31.83) to burl-ct-v40z-0

burl-ct-v40z-0 135 =>setenv MPI_SHOW_INTERFACES 3
burl-ct-v40z-0 136 =>mprun -np 2 -Ns initme.6

(tcp j64, burl-ct-v40z-1, r0): interface 0 "lo0" netrank=230 (IP=127.0.0.1) 
(tcp j64, burl-ct-v40z-0, r0): interface 0 "lo0" netrank=230 (IP=127.0.0.1) 
(tcp j64, burl-ct-v40z-0, r0): interface 1 "bge1" netrank=47 (IP=10.8.31.83) 
(tcp j64, burl-ct-v40z-0, r0): interface 2 "ibd0" netrank=1002 (IP=192.168.1.100) 
(tcp j64, burl-ct-v40z-0, r0): interface 3 "default" netrank=1003 (IP=10.8.31.83) 
(tcp j64, burl-ct-v40z-1, r0): interface 1 "bge1" netrank=47 (IP=10.8.31.85) 
(tcp j64, burl-ct-v40z-1, r0): interface 2 "ibd0" netrank=1002 (IP=192.168.1.101)
(tcp j64, burl-ct-v40z-1, r0): interface 3 "default" netrank=1003 (IP=10.8.31.85)
(TCP j64, burl-ct-v40z-1, r0): using interface bge1 (IP=10.8.31.83) to burl-ct-v40z-0
(tcp j64, burl-ct-v40z-0, r1): interface 0 "lo0" netrank=230 (IP=127.0.0.1) 
(tcp j64, burl-ct-v40z-1, r1): interface 0 "lo0" netrank=230 (IP=127.0.0.1) 
(tcp j64, burl-ct-v40z-1, r1): interface 1 "bge1" netrank=47 (IP=10.8.31.85) 
(tcp j64, burl-ct-v40z-1, r1): interface 2 "ibd0" netrank=1002 (IP=192.168.1.101)
(tcp j64, burl-ct-v40z-1, r1): interface 3 "default" netrank=1003 (IP=10.8.31.85) 
(tcp j64, burl-ct-v40z-0, r1): interface 1 "bge1" netrank=47 (IP=10.8.31.83) 
(tcp j64, burl-ct-v40z-0, r1): interface 2 "ibd0" netrank=1002 (IP=192.168.1.100) 
(tcp j64, burl-ct-v40z-0, r1): interface 3 "default" netrank=1003 (IP=10.8.31.83) 
(TCP j64, burl-ct-v40z-0, r1): using interface bge1 (IP=10.8.31.85) to burl-ct-v40z-1

ompiteam · 2014-10-01T15:59:00Z

Trac comment by jjhursey on 2008-02-07 10:28:30:

(In [17398]) A quick try at ticket refs https://svn.open-mpi.org/trac/ompi/ticket/1207.

Here we are processing the BML structure attached to ompi_proc_t well after
add_procs has been called.

Currently only Rank 0 displays data, and makes no attempt to gather information
from other ranks. I still need to add the MCA parameters to enable/disable this
feature along with a bunch of other stuff.

Examples from this commit on 2 nodes of IU's Odin Machine:

shell$ mpirun -np 6 -mca btl tcp,sm,self hello
[odin001.cs.indiana.edu:28548] Connected to Process 0 on odin001 via: self
[odin001.cs.indiana.edu:28548] Connected to Process 1 on odin001 via: sm
[odin001.cs.indiana.edu:28548] Connected to Process 2 on odin001 via: sm
[odin001.cs.indiana.edu:28548] Connected to Process 3 on odin001 via: sm
[odin001.cs.indiana.edu:28548] Connected to Process 4 on odin002 via: tcp
[odin001.cs.indiana.edu:28548] Connected to Process 4 on odin002 via: tcp
[odin001.cs.indiana.edu:28548] Connected to Process 5 on odin002 via: tcp
[odin001.cs.indiana.edu:28548] Connected to Process 5 on odin002 via: tcp
[odin001.cs.indiana.edu:28548] Unique connection types: self,sm,tcp
(Hello World) I am 0 of 6 running on odin001.cs.indiana.edu (PID 28548)
(Hello World) I am 1 of 6 running on odin001.cs.indiana.edu (PID 28549)
(Hello World) I am 2 of 6 running on odin001.cs.indiana.edu (PID 28550)
(Hello World) I am 3 of 6 running on odin001.cs.indiana.edu (PID 28551)
(Hello World) I am 4 of 6 running on odin002.cs.indiana.edu (PID 7809)
(Hello World) I am 5 of 6 running on odin002.cs.indiana.edu (PID 7810)

In this example you can see that we have 2 tcp connections to odin002 for each
process, since Odin has 2 tcp interfaces to each machine.

shell$ mpirun -np 6 -mca btl tcp,sm,openib,self hello
[odin001.cs.indiana.edu:28566] Connected to Process 0 on odin001 via: self
[odin001.cs.indiana.edu:28566] Connected to Process 1 on odin001 via: sm
[odin001.cs.indiana.edu:28566] Connected to Process 2 on odin001 via: sm
[odin001.cs.indiana.edu:28566] Connected to Process 3 on odin001 via: sm
[odin001.cs.indiana.edu:28566] Connected to Process 4 on odin002 via: openib
[odin001.cs.indiana.edu:28566] Connected to Process 5 on odin002 via: openib
[odin001.cs.indiana.edu:28566] Unique connection types: self,sm,openib
(Hello World) I am 0 of 6 running on odin001.cs.indiana.edu (PID 28566)
(Hello World) I am 1 of 6 running on odin001.cs.indiana.edu (PID 28567)
(Hello World) I am 2 of 6 running on odin001.cs.indiana.edu (PID 28568)
(Hello World) I am 3 of 6 running on odin001.cs.indiana.edu (PID 28569)
(Hello World) I am 4 of 6 running on odin002.cs.indiana.edu (PID 7820)
(Hello World) I am 5 of 6 running on odin002.cs.indiana.edu (PID 7821)

The above also occurs when passing no mca arguments. But here you can see that
tcp is not being used due to exclucivity rules in Open MPI. So even though
we specified -mca btl tcp,sm,openib,self only self,sm,openib are
being used.

ompiteam · 2014-10-01T15:59:01Z

Trac comment by jsquyres on 2008-02-09 08:07:37:

We had a long discussion about this on the phone (Terry, George,
Jeff). The general conclusion is that this feature comes down to two
items: printing the connectivity map and a new/better "MPI preconnect all".

= Print Connectivity Map =

Does not require a PML/BTL/MTL interface change -- we can just put
a new string (char*) on the endpoint base structure.
- The endpoint constructor will default it to NULL
- Still need to figure out who frees the string -- it's not
  necessarily the endpoint (see below for scalability issues).
  Should there be a flag on the endpoint about whether the endpoint
  frees it or not?
- BTLs/MTLs can fill in a string value or leave it NULL; if they
  fill it in, they can fill it differently depending on the
  verbosity level (see below).
- The print_the_map() functionality will simply traverse all the
  endpoints hanging of each ompi_proc_t and use that to print the
  strings.
- Need to implement a gather-like operation where MCW rank 0 will
  gather all the connectivity information from all other processes
  and print out some kind of map.
How accurate the print_the_map() information is depends on whether
MPI preconnect has been invoked or not.
- If a preconnect has not been invoked, the information is a "best
  guess" (i.e., components that successfully opened/initialized and
  passed around info in the modex, and then were subject to
  first-cut elimination such as BTL exclusivity). But it may not
  represent the ''final'' list of components that will be used for
  connectivity. However, in many (most?) scenarios, this is likely
  a "good enough" estimation (e.g., homogeneous clusters with one
  high speed network).
- If a preconnect ''has'' been invoked, then the information should
  be accurate because any endpoints that will not be used will have
  been trimmed from the ompi_proc_t entries.
- Keep the preconnect functionality separate from the "print the
  map" functionality; users can choose what level of accuracy they
  want separate from what level of print "verbosity" they want (see
  below).
Have an MCA param indicating the verbosity level, perhaps something
like Sun CT6's:
- 0 = print nothing.
- 1 = print local interface info (e.g., "mthca0:0", "eth1", etc.).
  This can be scalable if MTLs/BTLs are careful (e.g., put same
  pointer value on each endpoint; don't dup the string for each
  endpoint). However, the issues of "who frees the string?" comes
  up -- if the ''same'' pointer is on every endpoint, then somehow
  we have to know to only free it once (see above).
- 2 = print local+remote interface info (e.g.,
  "mthca0:0:[guid]->[guid]",
  "eth1:192.168.0.1:1234->eth0:192.168.0.2:5678"). Note that this
  is a memory hog (and potentially unscalable) because each
  endpoint's string is unique!
- 3 = print out both ends of a connection every time a connection
  is made (i.e., not necessarily print out a map during MPI_INIT,
  but print the information asynchronously as it happens). Haven't
  quite figured out how this one will work yet (MTLs and BTLs might
  have to do this themselves?); perhaps this will be a later
  feature.

= New "preconnect all" functionaliy =

Should completely replace old MPI preconnect functionality.
Need a new PML interface function: connect_all() that will connect
this process to all others that it knows about (i.e., all
ompi_proc_t's that it's aware of, which takes care of the MPI-2
dynamics cases). The main idea is to use the new active-message
functionality to send an AM message tag to the remote PML peer.
The message will cause a no-op function to occur on the other side,
but it will force the connection to be made.
- For BTL-related PMLs: do a btl_alloc() followed by a btl_send().
  Loop over the btl_send's until they all complete or fail (i.e.,
  keep checking the ones that return RESOURCE_BUSY).
- For MTL-related PMLs: the function may be a no-op if there's no
  way to ''guarantee'' that connections are made. Or it may use
  the same general technique as the BTL-related PMLs: send an AM
  tag to its remote PML peer that causes a no-op on the remote
  side, but forces the connection to be made. The MTL may have
  specific knowledge about what needs to be done to ''force'' a
  connection of its lower layer.

ompiteam · 2014-10-01T15:59:01Z

Trac comment by jsquyres on 2008-02-25 21:47:03:

In taking a first-pass at the "print the map" functionality, I'm running into two problems:

OB1 doesn't seem to set procs[x]->proc_pml. I think I know where to fix this, though. But clearly it isn't used anywhere.
The PML and BML do not seem to actually have a "base" enpoint_t struct. pml.h just declares the mca_pml_base_endpoint_t class and then let each pml provide its own definition (ob1 shuffles off the definition to each btl). Hence, there is no common endpoint struct for me to add a string field to (or traverse). I can fix this by having a tiny mca_pml_base_endpoint_t struct that each pml must then use as a super/base kind of member (in ob1's case, this means that the btl's must use it).

I wanted to run this by everyone before doing it, since it would be a bit bigger change than we thought...

ompiteam · 2014-10-01T15:59:02Z

Trac comment by jsquyres on 2008-03-19 16:10:25:

George and I talked about this...

We decided that since there really is no "base" endpoint_t structure, it would be better to add a PML interface function that takes a proc pointer as input and returns (at least) set of strings back for the endpoints on that proc (guaranteed to be 1 or more). The detail in the string depends on the verbosity level (it's probably easiest to pass the verbosity level in to the PML function so that not every PML/BTL/MTL will need to do the MCA parameter lookup).
It ''may'' also be useful to return some additional information about each endpoint, such as a pointer to its component, a bool indicating whether it's connected or not, etc.
This function can be used by ompi_mpi_init(); it can loop over calling the pml for each proc. The resulting data needs to be gatherv'ed to MPI_COMM_WORLD 0 and displayed (per above in this ticket).
Note that this scheme will require new interface functions on the PML, BTL, and MTL.

ompiteam · 2014-10-01T15:59:02Z

Trac comment by jsquyres on 2008-03-19 16:10:47:

(In [17881]) Playground for implementing the "print the MPI connection map"
functionality.

Refs https://svn.open-mpi.org/trac/ompi/ticket/1207.

ompiteam · 2014-10-01T15:59:03Z

Trac comment by jsquyres on 2008-03-19 16:16:36:

Split the "new MPI preconnect" functionality out into its own ticket: https://svn.open-mpi.org/trac/ompi/ticket/1249.

ompiteam · 2014-10-01T15:59:03Z

Trac comment by jsquyres on 2008-05-29 20:12:06:

This unfortunately didn't make the cut for v1.3.

ompiteam · 2014-10-01T15:59:04Z

Trac comment by jsquyres on 2008-07-24 18:59:23:

See the SVN tree source:/tmp-public/connect-map; Josh did some initial work in there.

ompiteam · 2014-10-01T15:59:05Z

Trac comment by rhc on 2008-08-22 11:52:17:

Jeff asked that I add this here - it represents a request from some power-users at LANL, but I suspect others may want it too:

It seems to me that having OMPI report out (when a param is set, of course!) the
processor to which each process is bound would be a good thing. Doing that scalably is
perhaps a tad tricky - just having each process blurt it out on its stdout would be
simple, but perhaps unusable. What I am thinking is to have the paffinity action take
place a little earlier in mpi_init so that the data can be reported back as part of
some other message (avoid yet another comm), and then let mpirun output some nicely
formatted proc vs processor map.

ompiteam · 2014-10-01T15:59:05Z

Trac comment by jsquyres on 2009-01-12 13:14:44:

This really needs to get done for v1.4. Bumping up to critical.

ompiteam · 2014-10-01T15:59:05Z

Trac comment by jsquyres on 2009-05-07 07:44:16:

With the change in release methodology, what we used to call "v1.4" is now called "v1.5".

ompiteam · 2014-10-01T15:59:06Z

Trac comment by jsquyres on 2011-07-12 10:31:01:

Bumping to v1.5.5.

ompiteam · 2014-10-01T15:59:06Z

Trac comment by brbarret on 2013-01-09 12:13:40:

This isn't a critical issue for 1.7.

usnic: fix minor typo

HCOLL: Fix hcoll supported datatype checks corretcly

jsquyres · 2016-02-24T13:22:06Z

@gpaulsen @markalle We had a lengthy discussion about this connectivity map yesterday during the 2016 Feb Dallas Open MPI dev meeting, and then a further lengthy conversation about this at dinner last night. Main points:

Platform MPI shows several things in their -prot output:
1. See https://gist.github.com/markalle/52fb42c701267c49b450 for an example
2. Per-host detail of IP address and which MCW rank processes are on that host
3. Brief NxN chart showing connectivity methods
4. Processor affinity info
We agreed that -- at least under a certain size -- users like to see the NxN map. Printing on a per-server basis seems much more scalable than a per-process basis.
- The thought is that individual components can report their "short name" (e.g., "SHM", "TCP", "IB", etc.) to be reported in the map. Maybe put a 3-letter cap on the short name, or somesuch.
We also like the idea of printing exceptions to that map (i.e., if one server is different than the others)
We also agree that this functionality must be optional. It will likely incur additional startup / shutdown cost.
Platform MPI's affinity info is expressed in terms of physical Linux virtual processor IDs. We'd want logical core (or hyperthread) IDs for an Open MPI display.
We'd also like to see, in the per-host detail:
- A listing of all the network interfaces (perhaps only the interfaces being used...?): eth7, mlx4_0, usnic_1, ...etc.
- A human name for the network type of each interface: Ethernet, InfiniBand, Omnipath, ...etc.
- Relevant addressing info for each interface: e.g., for IP-based interfaces, display the IP address/netmask, for IB-based interfaces, the GID+LID. ...et.
- Protocol type for the connection: RC, UD, TCP, UDP, ...etc.
Concerns were brought up that MPI point-to-point connectivity could be different than one-sided and/or collective connectivity. An idea was floated that perhaps we could have some new infrastructure -- possibly in the MPI base? -- that allows self-reporting of connectivity.
- E.g., a BTL can make a function call when it makes a connection to a peer identifying:
  - The network interface name
  - Human name for the network type
  - Short name for the connection NxN map
  - Addressing info (as a string)
  - Framework/component name that made the connection
  - Human name for the protocol used
- In this way, the infrastructure can basically de-duplicate all these reports (e.g., if both a BTL and an OSC report a connection to a common peer), and we'll have an accurate listing of all connections from all types of MPI communications
- To be clear, BTLs, MTLs, and possibly some of the PMLs will need to report this info. So will any OSCs that make their own connections, and any COLLs.
- Further, this infrastructure can be invoked to obtain all the connection information so far. Two common use cases:
  1. Print this info during MPI_INIT. In this case, you almost certainly want to do a "preconnect" type of call first, to force all PML-based connections to all peers to be created.
  2. Print this info during MPI_FINALIZE. In this case, you do not need "preconnect" functionality in MPI_INIT, but rather only show what connections were actually created during the run.
- In both cases, the info can be gathered -- either to MCW rank 0 or mpirun? -- and prettyprinted nicely.

jsquyres · 2016-10-04T23:52:00Z

IBM is taking on this feature enhancement.

jjhursey · 2017-01-26T21:19:18Z

See PR #2825

hppritcha · 2017-02-14T15:19:46Z

Moving to 3. x as it probably will not get in to 2.1.

bwbarrett · 2018-03-02T00:11:44Z

@jjhursey I'm going to punt this off any milestone, since it looks like it has died out on your side.

jjhursey · 2018-03-03T18:19:08Z

That's fine. I'll add this to the face-to-face meeting to see where we are at again. @markalle maybe we can chat about this again before the meeting sometime.

jjhursey · 2018-03-22T16:57:07Z

Some discussion at the March 2018 Face-to-Face meeting.

Step 1: Display basic table for pt2pt connections (output only)

Need to bring in the '-prot' framework
Need to add interface to pml/mtl/btl to get a "short name" (e.g., "yalla", "ob1") and a "long name" with information about, for example, interfaces used.
For the long names think about aggregation ability.

Step 2: Future

Add table for osc and maybe coll components
Another effort to select subsets of components based upon interconnect (-TCP, -UCX)

@markalle and @jjhursey will investigate a PR for Step 1.

…open-mpi#30) Signed-off-by: Joseph Schuchart <[email protected]>

Add support for fallback to previous coll module on non-commutative operations (open-mpi#30) Replace mutexes by atomic operations. Use the correct nbc request type (open-mpi#31) * coll/base: document type casts in ompi_coll_base_retain_* Other minor fixes. Signed-off-by: George Bosilca <[email protected]> Signed-off-by: Joseph Schuchart <[email protected]>

- Add support for fallback to previous coll module on non-commutative operations (#30) - Replace mutexes by atomic operations. - Use the correct nbc request type (for both ibcast and ireduce) * coll/base: document type casts in ompi_coll_base_retain_* - add module-wide topology cache - use standard instead of synchronous send and add mca parameter to control mode of initial send in ireduce/ibcast - reduce number of memory allocations - call the default request completion. - Remove the requests from the Fortran lookup conversion tables before completing and free it. Signed-off-by: George Bosilca <[email protected]> Signed-off-by: Joseph Schuchart <[email protected]> Co-authored-by: Joseph Schuchart <[email protected]>

This is a meta commit, that encapsulate all the ADAPT commits in the master into a single PR for 4.1. The master commits included here are: fe73586, a4be3bb, d712645, c2970a3, e59bde9, ee592f3 and c98e387. Here is a detailed list of added capabilities: * coll/adapt: Fix naming conventions and C11 atomic use * coll/adapt: Remove unused component field in module * Consistent handling of zero counts in the MPI API. * Correctly handle non-blocking collectives tags * As it is possible to have multiple outstanding non-blocking collectives provided by different collective modules, we need a consistent mechanism to allow them to select unique tags for each instance of a collective. * Add support for fallback to previous coll module on non-commutative operations (#30) * Replace mutexes by atomic operations. * Use the correct nbc request type (for both ibcast and ireduce) * coll/base: document type casts in ompi_coll_base_retain_* * add module-wide topology cache * use standard instead of synchronous send and add mca parameter to control mode of initial send in ireduce/ibcast * reduce number of memory allocations * call the default request completion. * Remove the requests from the Fortran lookup conversion tables before completing and free it. * piggybacking Bull functionalities Signed-off-by: Xi Luo <[email protected]> Signed-off-by: George Bosilca <[email protected]> Signed-off-by: Marc Sergent <[email protected]> Co-authored-by: Joseph Schuchart <[email protected]> Co-authored-by: Lemarinier, Pierre <[email protected]> Co-authored-by: pierrele <[email protected]>

jjhursey · 2021-03-17T13:56:47Z

This was implemented in master and v5.0.x branches as the --mca ompi_display_comm command line option. See the following PRs for more details:

Open MPI 'hook' component:
- Adding -mca comm_method to print table of communication methods #5507
- Update hook component to use enum MCA parameter #8313
PRRTE CLI option
- Add --display-comm{-finalize} to OMPI CLI openpmix/prrte#711

ompiteam self-assigned this Oct 1, 2014

ompiteam added this to the Open MPI 1.9 milestone Oct 1, 2014

ompiteam added enhancement Severity: minor labels Oct 1, 2014

rhc54 pushed a commit that referenced this issue Oct 31, 2014

Merge pull request #30 from jsquyres/topic/usnic-typo-fix

5b26952

usnic: fix minor typo

yosefe pushed a commit to yosefe/ompi that referenced this issue Mar 5, 2015

Merge pull request open-mpi#30 from bureddy/mellanox-v1.8

842d3a9

HCOLL: Fix hcoll supported datatype checks corretcly

jsquyres modified the milestones: Future, Open MPI v2.0.0 Jun 25, 2015

jsquyres assigned jjhursey, gpaulsen and markalle and unassigned ompiteam Oct 4, 2016

jsquyres modified the milestones: v2.x, Future Oct 4, 2016

hppritcha modified the milestones: v3.0.0, v2.x Feb 14, 2017

jjhursey mentioned this issue Feb 14, 2017

hook/prot: Connectivity Map #2825

Closed

hppritcha modified the milestones: v3.1.0, v3.0.0 Mar 14, 2017

bwbarrett removed this from the v3.1.0 milestone Mar 2, 2018

jjhursey assigned jjhursey and unassigned jjhursey Mar 22, 2018

hjelmn mentioned this issue Jan 15, 2019

Issues running with Open UCX 1.4 on Cray XC40 #6084

Closed

devreal added a commit to devreal/ompi that referenced this issue Sep 9, 2020

ADAPT: fallback to previous coll module on non-commutative operations (…

e7f0d53

…open-mpi#30) Signed-off-by: Joseph Schuchart <[email protected]>

bosilca mentioned this issue Sep 9, 2020

Fix some corner cases with ADAPT #8039

Merged

jjhursey unassigned gpaulsen Mar 17, 2021

jjhursey closed this as completed Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show MPI connectivity map during MPI_INIT #30

Show MPI connectivity map during MPI_INIT #30

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

jsquyres commented Feb 24, 2016

jsquyres commented Oct 4, 2016

jjhursey commented Jan 26, 2017

hppritcha commented Feb 14, 2017

bwbarrett commented Mar 2, 2018

jjhursey commented Mar 3, 2018

jjhursey commented Mar 22, 2018

jjhursey commented Mar 17, 2021

Show MPI connectivity map during MPI_INIT #30

Show MPI connectivity map during MPI_INIT #30

Comments

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

ompiteam commented Oct 1, 2014

jsquyres commented Feb 24, 2016

jsquyres commented Oct 4, 2016

jjhursey commented Jan 26, 2017

hppritcha commented Feb 14, 2017

bwbarrett commented Mar 2, 2018

jjhursey commented Mar 3, 2018

jjhursey commented Mar 22, 2018

jjhursey commented Mar 17, 2021