Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show MPI connectivity map during MPI_INIT #30

Closed
ompiteam opened this issue Oct 1, 2014 · 29 comments
Closed

Show MPI connectivity map during MPI_INIT #30

ompiteam opened this issue Oct 1, 2014 · 29 comments

Comments

@ompiteam
Copy link
Contributor

ompiteam commented Oct 1, 2014

It has long been discussed, and I swear there was a ticket about this
at some point but I can't find it now. So I'm filing a new one --
close this as a dupe if someone can find an older one.


OMPI currently uses a negative ACK system to indicate if high-speed
networks are not used for MPI communications. For example, if you
have the openib BTL available but it can't find any active ports in a
given MPI process, it'll display a warning message.

But some users want a ''positive'' acknowledgement of what networks
are being used for MPI communications (this can also help with
regression testing, per a thread on the MTT mailing list). HP MPI
offers this feature, for example. It would be nice to have a simple
MCA parameter that will cause MCW rank 0 to output a connectivity map
during MPI_INIT.

Complications:

  • In some cases, OMPI doesn't know which networks will be used for
    communications with each MPI process peer; we only know which ones
    we'll try to use when connections are actually established (per
    OMPI's lazy connection model for the OB1 PML). But I think that
    even outputting this information will be useful.
  • Connectivity between MPI processes are likely to be non-uniform.
    E.g., MCW rank 0 may use the sm btl to communicate with some MPI
    processes, but a different btl to communicate with others. This is
    almost certainly a different view than other processes have. The
    connectivity information needs to be conveyed on a process-pair
    basis (e.g., a 2D chart).
  • Since we have to span multiple PMLs, this may require an addition
    to the PML API.

A first cut could display a simple 2D chart of how OMPI thinks it may
send MPI traffic from each process to each process. Perhaps something
like (OB1 6 process job, 2 processes on each of 3 hosts):

MCW rank 0     1     2     3     4     5
0        self  sm    tcp   tcp   tcp   tcp
1        sm    self  tcp   tcp   tcp   tcp
2        tcp   tcp   self  sm    tcp   tcp
3        tcp   tcp   sm    self  tcp   tcp
4        tcp   tcp   tcp   tcp   self  sm
5        tcp   tcp   tcp   tcp   sm    self

Note that the upper and lower triangular portions of the map are the
same, but it's probably more human-readable if both are output.
However, multiple built-in output formats could be useful, such as:

  • Human readable, full map (see above)
  • Human readable, abbreviated (see below for some ideas on this)
  • Machine parsable, full map
  • Machine parsable, abbreviated

It may also be worthwhile to investigate a few huersitics to compress
the graph where possible. Some random ideas in this direction:

  • The above example could be represented as:
MPI connectivty map, listed by process:
X->X: self
X<->X+1, X in {0,2,4}: sm
other: tcp
  • Another example:
MPI connectivty map, listed by process:
X->X: self
other: tcp
  • Another example:
MPI connectivty map, listed by process:
all: CM PML, MX MTL
  • Perhaps something could be done with "exceptions" -- e.g., where
    the openib BTL is being used for inter-node connectivity ''except''
    for one node (where IB is malfunctioning, and OMPI fell back to
    TCP) -- this is a common case that users/sysadmins want to detect.

Another useful concept might be to show some information about each
endpoint in the connectivity map. E.g., show a list of TCP endpoints
on each process, by interface name and/or IP address. Similar for
other transports. This kind of information can show when/if
multi-rail scenarios are active, etc. For example:

MCW rank 0     1     2     3     4     5
0        self      sm        tcp:eth0  tcp:eth0  tcp:eth0  tcp:eth0
1        sm        self      tcp:eth0  tcp:eth0  tcp:eth0  tcp:eth0
2        tcp:eth0  tcp:eth0  self      sm        tcp:eth0  tcp:eth0
3        tcp:eth0  tcp:eth0  sm        self      tcp:eth0  tcp:eth0
4        tcp:eth0  tcp:eth0  tcp:eth0  tcp:eth0  self      sm
5        tcp:eth0  tcp:eth0  tcp:eth0  tcp:eth0  sm        self

With more information such as interface names, compression of the
output becomes much more important, such as:

MPI connectivty map, listed by process:
X->X: self
X<->X+1, X in {0,2,4}: sm
other: tcp:eth0,eth1

Note that these ideas can certainly be implemented in stages; there's
no need to do everything at once.

@ompiteam ompiteam self-assigned this Oct 1, 2014
@ompiteam ompiteam added this to the Open MPI 1.9 milestone Oct 1, 2014
@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Imported from trac issue 1207. Created by jsquyres on 2008-02-06T13:06:04, last modified: 2014-04-22T16:23:42

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by bosilca on 2008-02-06 13:20:24:

At one point we should starting of thinking about how to trim down the size of the MPI shared library. While I agree that such information is useful for the user, I don't think it need to go deeply inside the library. I see it more like an additional tool/utility bundled with Open MPI.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jjhursey on 2008-02-06 15:15:15:

I agree that conceptually this could be a useful tool (or really an addition to the orte-ps/ompi-ps tool).

Actually when I was originally designing orte-ps I talked with some people at Sun about their [http://docs.sun.com/source/819-4131-10/DisplayingJobInformation.html mpps] command. I tried to model orte-ps around their mpps command. mpps has a bit more functionality than orte-ps, and some of that is due to the current limitations of tools in Open MPI.

The limitation is that tools connect though the HNP which is an ORTE layer application, so it has no (or extremely limited) knowledge of OMPI layer constructs. So a tool is unable to access information about OMPI level collective and point-to-point constructs, for example.

In the short term, it is easiest to implement this at the moment is to have the Rank 0 process dump this information. In the long term we may want to consider looking again at how tools interact with the MPI job, and think about how we can create a ompi-ps command that displays OMPI layer information.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jjhursey on 2008-02-07 07:21:41:

As another idea for a compressed representation. It would be useful to only display the unique set of parameters used in the job. This is really what MTT is going to want in the short term since capturing and querying a 2D space of connectivity information is difficult.

So the following:

MCW rank 0     1     2     3     4     5
0        self  sm    tcp   tcp   tcp   tcp
1        sm    self  tcp   tcp   tcp   tcp
2        tcp   tcp   self  sm    tcp   tcp
3        tcp   tcp   sm    self  tcp   tcp
4        tcp   tcp   tcp   tcp   self  sm
5        tcp   tcp   tcp   tcp   sm    self

Would be represented as:

MPI connectivity map (Active components):
PML: ob1
BTL: tcp,sm,self

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by tdd on 2008-02-07 08:08:59:

Replying to [comment:1 bosilca]:

At one point we should starting of thinking about how to trim down the size of the MPI shared library. While I agree that such information is useful for the user, I don't think it need to go deeply inside the library. I see it more like an additional tool/utility bundled with Open MPI.

I think I disagree here in that you really want this information coming from the actual code so one could detect issues with the actual (B/M)TL picking algorithm and risk possible divergence of the utility and the actual code. I also think you lose a nice quick way for a user to confirm their run really is going the appropriate TLs without having to run 2 programs. I know that last point may seem silly.

Note, Sun's original CT base had this feature as a part of an env-var named MPI_SHOW_INTERFACES where it would show different amounts of verbosity for each level thus giving one either a broad idea how things are connected to a real detail view of all the decisions the library considers chosing for BTLs and interfaces. This proved incredibly helpful in debugging complicated customer networks.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2008-02-07 08:24:12:

Replying to [comment:4 tdd]:

Replying to [comment:1 bosilca]:

At one point we should starting of thinking about how to trim down the size of the MPI shared library. While I agree that such information is useful for the user, I don't think it need to go deeply inside the library. I see it more like an additional tool/utility bundled with Open MPI.

I think I disagree here in that you really want this information coming from the actual code so one could detect issues with the actual (B/M)TL picking algorithm

...I think George is talking about a different issue (just overall reducing the size of the MPI library). I agree that this is a good thing to do, and perhaps we can modularize features like this (e.g., make the display map functionality be a DSO plugin that is loaded on demand), but I think that that is outside the scope of this ticket. Please create a new ticket for that kind of functionality; the display map functionality can easily be fit into a plugin framework someday if desired.

Thanks.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2008-02-07 08:25:11:

Replying to [comment:4 tdd]:

Note, Sun's original CT base had this feature as a part of an env-var named MPI_SHOW_INTERFACES where it would show different amounts of verbosity for each level thus giving one either a broad idea how things are connected to a real detail view of all the decisions the library considers chosing for BTLs and interfaces. This proved incredibly helpful in debugging complicated customer networks.

Terry: can you include some samples of what the output looked like from when users invoked MPI_SHOW_INTERFACES?

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by tdd on 2008-02-07 09:04:38:

Replying to [comment:6 jsquyres]:

Replying to [comment:4 tdd]:

Note, Sun's original CT base had this feature as a part of an env-var named MPI_SHOW_INTERFACES where it would show different amounts of verbosity for each level thus giving one either a broad idea how things are connected to a real detail view of all the decisions the library considers chosing for BTLs and interfaces. This proved incredibly helpful in debugging complicated customer networks.

Terry: can you include some samples of what the output looked like from when users invoked MPI_SHOW_INTERFACES?

Ok first here is a description of the option which doesn't completely jive with what you are proposing:

MPI_SHOW_INTERFACES

    When set to 1, 2 or 3, information regarding which interfaces are being used by an MPI application prints to stdout. Set MPI_SHOW_INTERFACES to 1 to print the selected internode interface. Set it to 2 to print all the interfaces and their rankings. Set it to 3 for verbose output. The default value, 0, does not print information to stdout.

The following are some examples of its usage:

burl-ct-v40z-0 129 =>setenv MPI_SHOW_INTERFACES 1
burl-ct-v40z-0 130 =>mprun -np 4 -Ns -W initme.6

(j34, r2): using "shm" PM from burl-ct-v40z-1 to burl-ct-v40z-1 
(j34, r2): using "tcp" PM from burl-ct-v40z-1 to burl-ct-v40z-0 
(j34, r0): using "tcp" PM from burl-ct-v40z-1 to burl-ct-v40z-0 
(j34, r0): using "shm" PM from burl-ct-v40z-1 to burl-ct-v40z-1 
(j34, r3): using "tcp" PM from burl-ct-v40z-0 to burl-ct-v40z-1 
(j34, r1): using "tcp" PM from burl-ct-v40z-0 to burl-ct-v40z-1 
(j34, r3): using "shm" PM from burl-ct-v40z-0 to burl-ct-v40z-0 
(j34, r1): using "shm" PM from burl-ct-v40z-0 to burl-ct-v40z-0

burl-ct-v40z-0 131 =>setenv MPI_SHOW_INTERFACES 2 
burl-ct-v40z-0 132 =>mprun -np 4 -Ns -W initme.6

(TCP j43, burl-ct-v40z-0, r3): using interface bge1 (IP=10.8.31.85) to burl-ct-v40z-1
(TCP j43, burl-ct-v40z-0, r1): using interface bge1 (IP=10.8.31.85) to burl-ct-v40z-1
(TCP j43, burl-ct-v40z-1, r2): using interface bge1 (IP=10.8.31.83) to burl-ct-v40z-0 (TCP j43, burl-ct-v40z-1, r0): using interface bge1 (IP=10.8.31.83) to burl-ct-v40z-0

burl-ct-v40z-0 135 =>setenv MPI_SHOW_INTERFACES 3
burl-ct-v40z-0 136 =>mprun -np 2 -Ns initme.6

(tcp j64, burl-ct-v40z-1, r0): interface 0 "lo0" netrank=230 (IP=127.0.0.1) 
(tcp j64, burl-ct-v40z-0, r0): interface 0 "lo0" netrank=230 (IP=127.0.0.1) 
(tcp j64, burl-ct-v40z-0, r0): interface 1 "bge1" netrank=47 (IP=10.8.31.83) 
(tcp j64, burl-ct-v40z-0, r0): interface 2 "ibd0" netrank=1002 (IP=192.168.1.100) 
(tcp j64, burl-ct-v40z-0, r0): interface 3 "default" netrank=1003 (IP=10.8.31.83) 
(tcp j64, burl-ct-v40z-1, r0): interface 1 "bge1" netrank=47 (IP=10.8.31.85) 
(tcp j64, burl-ct-v40z-1, r0): interface 2 "ibd0" netrank=1002 (IP=192.168.1.101)
(tcp j64, burl-ct-v40z-1, r0): interface 3 "default" netrank=1003 (IP=10.8.31.85)
(TCP j64, burl-ct-v40z-1, r0): using interface bge1 (IP=10.8.31.83) to burl-ct-v40z-0
(tcp j64, burl-ct-v40z-0, r1): interface 0 "lo0" netrank=230 (IP=127.0.0.1) 
(tcp j64, burl-ct-v40z-1, r1): interface 0 "lo0" netrank=230 (IP=127.0.0.1) 
(tcp j64, burl-ct-v40z-1, r1): interface 1 "bge1" netrank=47 (IP=10.8.31.85) 
(tcp j64, burl-ct-v40z-1, r1): interface 2 "ibd0" netrank=1002 (IP=192.168.1.101)
(tcp j64, burl-ct-v40z-1, r1): interface 3 "default" netrank=1003 (IP=10.8.31.85) 
(tcp j64, burl-ct-v40z-0, r1): interface 1 "bge1" netrank=47 (IP=10.8.31.83) 
(tcp j64, burl-ct-v40z-0, r1): interface 2 "ibd0" netrank=1002 (IP=192.168.1.100) 
(tcp j64, burl-ct-v40z-0, r1): interface 3 "default" netrank=1003 (IP=10.8.31.83) 
(TCP j64, burl-ct-v40z-0, r1): using interface bge1 (IP=10.8.31.85) to burl-ct-v40z-1 

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jjhursey on 2008-02-07 10:28:30:

(In [17398]) A quick try at ticket refs https://svn.open-mpi.org/trac/ompi/ticket/1207.

Here we are processing the BML structure attached to ompi_proc_t well after
add_procs has been called.

Currently only Rank 0 displays data, and makes no attempt to gather information
from other ranks. I still need to add the MCA parameters to enable/disable this
feature along with a bunch of other stuff.

Examples from this commit on 2 nodes of IU's Odin Machine:

shell$ mpirun -np 6 -mca btl tcp,sm,self hello
[odin001.cs.indiana.edu:28548] Connected to Process 0 on odin001 via: self
[odin001.cs.indiana.edu:28548] Connected to Process 1 on odin001 via: sm
[odin001.cs.indiana.edu:28548] Connected to Process 2 on odin001 via: sm
[odin001.cs.indiana.edu:28548] Connected to Process 3 on odin001 via: sm
[odin001.cs.indiana.edu:28548] Connected to Process 4 on odin002 via: tcp
[odin001.cs.indiana.edu:28548] Connected to Process 4 on odin002 via: tcp
[odin001.cs.indiana.edu:28548] Connected to Process 5 on odin002 via: tcp
[odin001.cs.indiana.edu:28548] Connected to Process 5 on odin002 via: tcp
[odin001.cs.indiana.edu:28548] Unique connection types: self,sm,tcp
(Hello World) I am 0 of 6 running on odin001.cs.indiana.edu (PID 28548)
(Hello World) I am 1 of 6 running on odin001.cs.indiana.edu (PID 28549)
(Hello World) I am 2 of 6 running on odin001.cs.indiana.edu (PID 28550)
(Hello World) I am 3 of 6 running on odin001.cs.indiana.edu (PID 28551)
(Hello World) I am 4 of 6 running on odin002.cs.indiana.edu (PID 7809)
(Hello World) I am 5 of 6 running on odin002.cs.indiana.edu (PID 7810)

In this example you can see that we have 2 tcp connections to odin002 for each
process, since Odin has 2 tcp interfaces to each machine.

shell$ mpirun -np 6 -mca btl tcp,sm,openib,self hello
[odin001.cs.indiana.edu:28566] Connected to Process 0 on odin001 via: self
[odin001.cs.indiana.edu:28566] Connected to Process 1 on odin001 via: sm
[odin001.cs.indiana.edu:28566] Connected to Process 2 on odin001 via: sm
[odin001.cs.indiana.edu:28566] Connected to Process 3 on odin001 via: sm
[odin001.cs.indiana.edu:28566] Connected to Process 4 on odin002 via: openib
[odin001.cs.indiana.edu:28566] Connected to Process 5 on odin002 via: openib
[odin001.cs.indiana.edu:28566] Unique connection types: self,sm,openib
(Hello World) I am 0 of 6 running on odin001.cs.indiana.edu (PID 28566)
(Hello World) I am 1 of 6 running on odin001.cs.indiana.edu (PID 28567)
(Hello World) I am 2 of 6 running on odin001.cs.indiana.edu (PID 28568)
(Hello World) I am 3 of 6 running on odin001.cs.indiana.edu (PID 28569)
(Hello World) I am 4 of 6 running on odin002.cs.indiana.edu (PID 7820)
(Hello World) I am 5 of 6 running on odin002.cs.indiana.edu (PID 7821)

The above also occurs when passing no mca arguments. But here you can see that
tcp is not being used due to exclucivity rules in Open MPI. So even though
we specified -mca btl tcp,sm,openib,self only self,sm,openib are
being used.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2008-02-09 08:07:37:

We had a long discussion about this on the phone (Terry, George,
Jeff). The general conclusion is that this feature comes down to two
items: printing the connectivity map and a new/better "MPI preconnect all".

= Print Connectivity Map =

  • Does not require a PML/BTL/MTL interface change -- we can just put
    a new string (char*) on the endpoint base structure.
    • The endpoint constructor will default it to NULL
    • Still need to figure out who frees the string -- it's not
      necessarily the endpoint (see below for scalability issues).
      Should there be a flag on the endpoint about whether the endpoint
      frees it or not?
    • BTLs/MTLs can fill in a string value or leave it NULL; if they
      fill it in, they can fill it differently depending on the
      verbosity level (see below).
    • The print_the_map() functionality will simply traverse all the
      endpoints hanging of each ompi_proc_t and use that to print the
      strings.
    • Need to implement a gather-like operation where MCW rank 0 will
      gather all the connectivity information from all other processes
      and print out some kind of map.
  • How accurate the print_the_map() information is depends on whether
    MPI preconnect has been invoked or not.
    • If a preconnect has not been invoked, the information is a "best
      guess" (i.e., components that successfully opened/initialized and
      passed around info in the modex, and then were subject to
      first-cut elimination such as BTL exclusivity). But it may not
      represent the ''final'' list of components that will be used for
      connectivity. However, in many (most?) scenarios, this is likely
      a "good enough" estimation (e.g., homogeneous clusters with one
      high speed network).
    • If a preconnect ''has'' been invoked, then the information should
      be accurate because any endpoints that will not be used will have
      been trimmed from the ompi_proc_t entries.
    • Keep the preconnect functionality separate from the "print the
      map" functionality; users can choose what level of accuracy they
      want separate from what level of print "verbosity" they want (see
      below).
  • Have an MCA param indicating the verbosity level, perhaps something
    like Sun CT6's:
    • 0 = print nothing.
    • 1 = print local interface info (e.g., "mthca0:0", "eth1", etc.).
      This can be scalable if MTLs/BTLs are careful (e.g., put same
      pointer value on each endpoint; don't dup the string for each
      endpoint). However, the issues of "who frees the string?" comes
      up -- if the ''same'' pointer is on every endpoint, then somehow
      we have to know to only free it once (see above).
    • 2 = print local+remote interface info (e.g.,
      "mthca0:0:[guid]->[guid]",
      "eth1:192.168.0.1:1234->eth0:192.168.0.2:5678"). Note that this
      is a memory hog (and potentially unscalable) because each
      endpoint's string is unique!
    • 3 = print out both ends of a connection every time a connection
      is made (i.e., not necessarily print out a map during MPI_INIT,
      but print the information asynchronously as it happens). Haven't
      quite figured out how this one will work yet (MTLs and BTLs might
      have to do this themselves?); perhaps this will be a later
      feature.

= New "preconnect all" functionaliy =

  • Should completely replace old MPI preconnect functionality.
  • Need a new PML interface function: connect_all() that will connect
    this process to all others that it knows about (i.e., all
    ompi_proc_t's that it's aware of, which takes care of the MPI-2
    dynamics cases). The main idea is to use the new active-message
    functionality to send an AM message tag to the remote PML peer.
    The message will cause a no-op function to occur on the other side,
    but it will force the connection to be made.
    • For BTL-related PMLs: do a btl_alloc() followed by a btl_send().
      Loop over the btl_send's until they all complete or fail (i.e.,
      keep checking the ones that return RESOURCE_BUSY).
    • For MTL-related PMLs: the function may be a no-op if there's no
      way to ''guarantee'' that connections are made. Or it may use
      the same general technique as the BTL-related PMLs: send an AM
      tag to its remote PML peer that causes a no-op on the remote
      side, but forces the connection to be made. The MTL may have
      specific knowledge about what needs to be done to ''force'' a
      connection of its lower layer.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2008-02-25 21:47:03:

In taking a first-pass at the "print the map" functionality, I'm running into two problems:

  1. OB1 doesn't seem to set procs[x]->proc_pml. I think I know where to fix this, though. But clearly it isn't used anywhere.
  2. The PML and BML do not seem to actually have a "base" enpoint_t struct. pml.h just declares the mca_pml_base_endpoint_t class and then let each pml provide its own definition (ob1 shuffles off the definition to each btl). Hence, there is no common endpoint struct for me to add a string field to (or traverse). I can fix this by having a tiny mca_pml_base_endpoint_t struct that each pml must then use as a super/base kind of member (in ob1's case, this means that the btl's must use it).

I wanted to run this by everyone before doing it, since it would be a bit bigger change than we thought...

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2008-03-19 16:10:25:

George and I talked about this...

  1. We decided that since there really is no "base" endpoint_t structure, it would be better to add a PML interface function that takes a proc pointer as input and returns (at least) set of strings back for the endpoints on that proc (guaranteed to be 1 or more). The detail in the string depends on the verbosity level (it's probably easiest to pass the verbosity level in to the PML function so that not every PML/BTL/MTL will need to do the MCA parameter lookup).
  2. It ''may'' also be useful to return some additional information about each endpoint, such as a pointer to its component, a bool indicating whether it's connected or not, etc.
  3. This function can be used by ompi_mpi_init(); it can loop over calling the pml for each proc. The resulting data needs to be gatherv'ed to MPI_COMM_WORLD 0 and displayed (per above in this ticket).
  4. Note that this scheme will require new interface functions on the PML, BTL, and MTL.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2008-03-19 16:10:47:

(In [17881]) Playground for implementing the "print the MPI connection map"
functionality.

Refs https://svn.open-mpi.org/trac/ompi/ticket/1207.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2008-03-19 16:16:36:

Split the "new MPI preconnect" functionality out into its own ticket: https://svn.open-mpi.org/trac/ompi/ticket/1249.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2008-05-29 20:12:06:

This unfortunately didn't make the cut for v1.3.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2008-07-24 18:59:23:

See the SVN tree source:/tmp-public/connect-map; Josh did some initial work in there.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by rhc on 2008-08-22 11:52:17:

Jeff asked that I add this here - it represents a request from some power-users at LANL, but I suspect others may want it too:

It seems to me that having OMPI report out (when a param is set, of course!) the
processor to which each process is bound would be a good thing. Doing that scalably is
perhaps a tad tricky - just having each process blurt it out on its stdout would be
simple, but perhaps unusable. What I am thinking is to have the paffinity action take
place a little earlier in mpi_init so that the data can be reported back as part of
some other message (avoid yet another comm), and then let mpirun output some nicely
formatted proc vs processor map.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2009-01-12 13:14:44:

This really needs to get done for v1.4. Bumping up to critical.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2009-05-07 07:44:16:

With the change in release methodology, what we used to call "v1.4" is now called "v1.5".

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by jsquyres on 2011-07-12 10:31:01:

Bumping to v1.5.5.

@ompiteam
Copy link
Contributor Author

ompiteam commented Oct 1, 2014

Trac comment by brbarret on 2013-01-09 12:13:40:

This isn't a critical issue for 1.7.

rhc54 pushed a commit that referenced this issue Oct 31, 2014
yosefe pushed a commit to yosefe/ompi that referenced this issue Mar 5, 2015
HCOLL: Fix hcoll supported datatype checks corretcly
@jsquyres jsquyres modified the milestones: Future, Open MPI v2.0.0 Jun 25, 2015
@jsquyres
Copy link
Member

@gpaulsen @markalle We had a lengthy discussion about this connectivity map yesterday during the 2016 Feb Dallas Open MPI dev meeting, and then a further lengthy conversation about this at dinner last night. Main points:

  • Platform MPI shows several things in their -prot output:
    1. See https://gist.github.com/markalle/52fb42c701267c49b450 for an example
    2. Per-host detail of IP address and which MCW rank processes are on that host
    3. Brief NxN chart showing connectivity methods
    4. Processor affinity info
  • We agreed that -- at least under a certain size -- users like to see the NxN map. Printing on a per-server basis seems much more scalable than a per-process basis.
    • The thought is that individual components can report their "short name" (e.g., "SHM", "TCP", "IB", etc.) to be reported in the map. Maybe put a 3-letter cap on the short name, or somesuch.
  • We also like the idea of printing exceptions to that map (i.e., if one server is different than the others)
  • We also agree that this functionality must be optional. It will likely incur additional startup / shutdown cost.
  • Platform MPI's affinity info is expressed in terms of physical Linux virtual processor IDs. We'd want logical core (or hyperthread) IDs for an Open MPI display.
  • We'd also like to see, in the per-host detail:
    • A listing of all the network interfaces (perhaps only the interfaces being used...?): eth7, mlx4_0, usnic_1, ...etc.
    • A human name for the network type of each interface: Ethernet, InfiniBand, Omnipath, ...etc.
    • Relevant addressing info for each interface: e.g., for IP-based interfaces, display the IP address/netmask, for IB-based interfaces, the GID+LID. ...et.
    • Protocol type for the connection: RC, UD, TCP, UDP, ...etc.
  • Concerns were brought up that MPI point-to-point connectivity could be different than one-sided and/or collective connectivity. An idea was floated that perhaps we could have some new infrastructure -- possibly in the MPI base? -- that allows self-reporting of connectivity.
    • E.g., a BTL can make a function call when it makes a connection to a peer identifying:
      • The network interface name
      • Human name for the network type
      • Short name for the connection NxN map
      • Addressing info (as a string)
      • Framework/component name that made the connection
      • Human name for the protocol used
    • In this way, the infrastructure can basically de-duplicate all these reports (e.g., if both a BTL and an OSC report a connection to a common peer), and we'll have an accurate listing of all connections from all types of MPI communications
    • To be clear, BTLs, MTLs, and possibly some of the PMLs will need to report this info. So will any OSCs that make their own connections, and any COLLs.
    • Further, this infrastructure can be invoked to obtain all the connection information so far. Two common use cases:
      1. Print this info during MPI_INIT. In this case, you almost certainly want to do a "preconnect" type of call first, to force all PML-based connections to all peers to be created.
      2. Print this info during MPI_FINALIZE. In this case, you do not need "preconnect" functionality in MPI_INIT, but rather only show what connections were actually created during the run.
    • In both cases, the info can be gathered -- either to MCW rank 0 or mpirun? -- and prettyprinted nicely.

@jsquyres
Copy link
Member

jsquyres commented Oct 4, 2016

IBM is taking on this feature enhancement.

@jsquyres jsquyres assigned jjhursey, gpaulsen and markalle and unassigned ompiteam Oct 4, 2016
@jsquyres jsquyres modified the milestones: v2.x, Future Oct 4, 2016
@jjhursey
Copy link
Member

See PR #2825

@hppritcha
Copy link
Member

Moving to 3. x as it probably will not get in to 2.1.

@hppritcha hppritcha modified the milestones: v3.0.0, v2.x Feb 14, 2017
@hppritcha hppritcha modified the milestones: v3.1.0, v3.0.0 Mar 14, 2017
@bwbarrett
Copy link
Member

@jjhursey I'm going to punt this off any milestone, since it looks like it has died out on your side.

@bwbarrett bwbarrett removed this from the v3.1.0 milestone Mar 2, 2018
@jjhursey
Copy link
Member

jjhursey commented Mar 3, 2018

That's fine. I'll add this to the face-to-face meeting to see where we are at again. @markalle maybe we can chat about this again before the meeting sometime.

@jjhursey
Copy link
Member

Some discussion at the March 2018 Face-to-Face meeting.

Step 1: Display basic table for pt2pt connections (output only)

  • Need to bring in the '-prot' framework
  • Need to add interface to pml/mtl/btl to get a "short name" (e.g., "yalla", "ob1") and a "long name" with information about, for example, interfaces used.
  • For the long names think about aggregation ability.

Step 2: Future

  • Add table for osc and maybe coll components
  • Another effort to select subsets of components based upon interconnect (-TCP, -UCX)

@markalle and @jjhursey will investigate a PR for Step 1.

@jjhursey jjhursey assigned jjhursey and unassigned jjhursey Mar 22, 2018
devreal added a commit to devreal/ompi that referenced this issue Sep 9, 2020
devreal pushed a commit to devreal/ompi that referenced this issue Sep 11, 2020
Add support for fallback to previous coll module on non-commutative operations (open-mpi#30)
Replace mutexes by atomic operations.
Use the correct nbc request type (open-mpi#31)
* coll/base: document type casts in ompi_coll_base_retain_*
Other minor fixes.

Signed-off-by: George Bosilca <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
bosilca added a commit that referenced this issue Sep 18, 2020
- Add support for fallback to previous coll module on non-commutative operations (#30)
- Replace mutexes by atomic operations.
- Use the correct nbc request type (for both ibcast and ireduce)
  * coll/base: document type casts in ompi_coll_base_retain_*
- add module-wide topology cache
- use standard instead of synchronous send and add mca parameter to control mode of initial send in ireduce/ibcast
- reduce number of memory allocations
- call the default request completion.
  - Remove the requests from the Fortran lookup conversion tables before completing
    and free it.

Signed-off-by: George Bosilca <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>

Co-authored-by: Joseph Schuchart <[email protected]>
jsquyres pushed a commit that referenced this issue Sep 23, 2020
This is a meta commit, that encapsulate all the ADAPT commits in the master
into a single PR for 4.1. The master commits included here are:
fe73586, a4be3bb, d712645, c2970a3, e59bde9, ee592f3 and c98e387.

Here is a detailed list of added capabilities:
* coll/adapt: Fix naming conventions and C11 atomic use
* coll/adapt: Remove unused component field in module
* Consistent handling of zero counts in the MPI API.
* Correctly handle non-blocking collectives tags
  * As it is possible to have multiple outstanding non-blocking collectives
    provided by different collective modules, we need a consistent
    mechanism to allow them to select unique tags for each instance of a
    collective.
* Add support for fallback to previous coll module on non-commutative operations (#30)
* Replace mutexes by atomic operations.
* Use the correct nbc request type (for both ibcast and ireduce)
  * coll/base: document type casts in ompi_coll_base_retain_*
* add module-wide topology cache
* use standard instead of synchronous send and add mca parameter to control mode of initial send in ireduce/ibcast
* reduce number of memory allocations
* call the default request completion.
  * Remove the requests from the Fortran lookup conversion tables before completing
    and free it.
* piggybacking Bull functionalities

Signed-off-by: Xi Luo <[email protected]>
Signed-off-by: George Bosilca <[email protected]>
Signed-off-by: Marc Sergent <[email protected]>
Co-authored-by: Joseph Schuchart <[email protected]>
Co-authored-by: Lemarinier, Pierre <[email protected]>
Co-authored-by: pierrele <[email protected]>
@jjhursey
Copy link
Member

This was implemented in master and v5.0.x branches as the --mca ompi_display_comm command line option. See the following PRs for more details:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants