including openib in btl list causes horrible vader or sm performance. #1252

gpaulsen · 2015-12-21T22:39:04Z

On master branch

I observe a strange behavior. I think that openib may be using too large of a hammer for numa membinding, possibly setting the wrong memory binding policy for the vader and sm shared memory segments. I've only come to this conclusion empirically based on performance numbers.

For example, I have a RHEL 6.5 node with a single Mellanox Technologies MT25204 [InfiniHost III Lx HCA] ConnectX-3 card with a single port active.

Bad Latency run single host:

$  mpirun -host "mpi03" -np 4 --bind-to core --report-bindings --mca btl openib,vader,self ./ping_pong_ring.x2
[mpi03:12941] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]
[mpi03:12941] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..]
[mpi03:12941] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]
[mpi03:12941] MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..]
[0:mpi03] ping-pong 0 bytes ...
0 bytes: 7.11 usec/msg
[1:mpi03] ping-pong 0 bytes ...
0 bytes: 7.10 usec/msg
[2:mpi03] ping-pong 0 bytes ...
0 bytes: 7.15 usec/msg
[3:mpi03] ping-pong 0 bytes ...
0 bytes: 7.17 usec/msg

Similar behavior with sm:

$ mpirun -host "mpi03" -np 4 --bind-to core --report-bindings --mca btl openib,sm,self ./ping_pong_ring.x2
[mpi03:14928] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]
[mpi03:14928] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..]
[mpi03:14928] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]
[mpi03:14928] MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..]
[0:mpi03] ping-pong 0 bytes ...
0 bytes: 7.45 usec/msg
[1:mpi03] ping-pong 0 bytes ...
0 bytes: 7.38 usec/msg
[2:mpi03] ping-pong 0 bytes ...
0 bytes: 7.35 usec/msg
[3:mpi03] ping-pong 0 bytes ...
0 bytes: 7.38 usec/msg

When I remove openib results look much better:

$ mpirun -host "mpi03" -np 4 --bind-to core --report-bindings --mca btl vader,self ./ping_pong_ring.x2
[mpi03:15819] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]
[mpi03:15819] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..]
[mpi03:15819] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]
[mpi03:15819] MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..]
[0:mpi03] ping-pong 0 bytes ...
0 bytes: 0.50 usec/msg
[1:mpi03] ping-pong 0 bytes ...
0 bytes: 0.50 usec/msg
[2:mpi03] ping-pong 0 bytes ...
0 bytes: 0.49 usec/msg
[3:mpi03] ping-pong 0 bytes ...
0 bytes: 0.51 usec/msg

Similar behavior with sm (though it's half as fast as vader):

$ mpirun -host "mpi03" -np 4 --bind-to core --report-bindings --mca btl sm,self ./ping_pong_ring.x2
[mpi03:16608] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]
[mpi03:16608] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..]
[mpi03:16608] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]
[mpi03:16608] MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..]
[0:mpi03] ping-pong 0 bytes ...
0 bytes: 0.98 usec/msg
[1:mpi03] ping-pong 0 bytes ...
0 bytes: 1.00 usec/msg
[2:mpi03] ping-pong 0 bytes ...
0 bytes: 0.95 usec/msg
[3:mpi03] ping-pong 0 bytes ...
0 bytes: 0.93 usec/msg

If I disable binding explicitly with --bind-to none, even when specifying openib I see the expected results (with either vader or sm, but now sm is the same speed as vader... weird):

$ mpirun -host "mpi03" -np 4 --bind-to none --report-bindings --mca btl openib,vader,self ./ping_pong_ring.x2
[mpi03:20206] MCW rank 1 is not bound (or bound to all available processors)
[mpi03:20205] MCW rank 0 is not bound (or bound to all available processors)
[mpi03:20207] MCW rank 2 is not bound (or bound to all available processors)
[mpi03:20208] MCW rank 3 is not bound (or bound to all available processors)
[0:mpi03] ping-pong 0 bytes ...
0 bytes: 0.50 usec/msg
[1:mpi03] ping-pong 0 bytes ...
0 bytes: 0.50 usec/msg
[2:mpi03] ping-pong 0 bytes ...
0 bytes: 0.50 usec/msg
[3:mpi03] ping-pong 0 bytes ...
0 bytes: 0.49 usec/msg

$ mpirun -host "mpi03" -np 4 --bind-to none --report-bindings --mca btl openib,sm,self ./ping_pong_ring.x2
[mpi03:21058] MCW rank 0 is not bound (or bound to all available processors)
[mpi03:21059] MCW rank 1 is not bound (or bound to all available processors)
[mpi03:21060] MCW rank 2 is not bound (or bound to all available processors)
[mpi03:21061] MCW rank 3 is not bound (or bound to all available processors)
[0:mpi03] ping-pong 0 bytes ...
0 bytes: 0.50 usec/msg
[1:mpi03] ping-pong 0 bytes ...
0 bytes: 0.51 usec/msg
[2:mpi03] ping-pong 0 bytes ...
0 bytes: 0.51 usec/msg
[3:mpi03] ping-pong 0 bytes ...
0 bytes: 0.49 usec/msg

Finally just for completeness... the best 0 byte ping pong ring times I could get was with --bind-to core --map-by core:

$ mpirun -host "mpi03" -np 4 --bind-to core --map-by core --report-bindings --mca btl vader,self ./ping_pong_ring.x2
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
[mpi03:32149] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..]
[mpi03:32149] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..]
[mpi03:32149] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../..][../../../../../../../..]
[mpi03:32149] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../..][../../../../../../../..]
[0:mpi03] ping-pong 0 bytes ...
0 bytes: 0.37 usec/msg
[1:mpi03] ping-pong 0 bytes ...
0 bytes: 0.37 usec/msg
[2:mpi03] ping-pong 0 bytes ...
0 bytes: 0.38 usec/msg
[3:mpi03] ping-pong 0 bytes ...
0 bytes: 0.38 usec/msg

I've attached my source for ping_pong_ring.c:

ping_pong_ring.txt

gpaulsen · 2015-12-21T22:41:52Z

@hjelmn - Jeff Squyres said I should tag you on this, that you might have some insight.

hjelmn · 2015-12-21T23:24:06Z

Yeah. This happens because we are polling the completion queue. I have been planning to work on this when i get a chance.

hppritcha · 2015-12-22T17:02:35Z

@gpaulsen which milestone would you like to give this. If it involves a lot of changes I'd prefer it to be in 2.1.0 or maybe 2.0.1? (I'll add the 2.0.1 milestone in a bit)

gpaulsen · 2015-12-22T17:29:46Z

I was hoping for 2.0, (I didn't see a 2.0 label), as this makes openib runs pretty useless.
This is NOT a blocker for IBM, but I assumed it was a blocker for Open MPI Community.

jsquyres · 2016-01-04T19:40:46Z

Hey @hjelmn, can we talk about this tomorrow on the call? I'm wondering:

Should the openib BTL just be smart enough to not register its progress function when its add_procs() hasn't found any peers, and/or del_procs() has removed all peers?
- As a side effect: is the openib BTL progress function required to process incoming connections? Or is this not even an issue if the BTL determines that it can't talk to any peers? More specifically: is there ever a race condition -- e.g., involving dynamic processes -- where an incoming connection request could arrive in a BTL (such as the openib BTL) before the BTL knew that an incoming connection could be coming from that peer?
I'm guessing that other BTL will also have this (set of) issues. Do you know?
If so, how much of a change does this represent to openib (and potentially others)? I.e., do we need this for 2.0.0, or can it wait until 2.1.0? I ask because Open MPI has had this behavior for a long time, so it's not technically a regression. But it would be a nice performance issue to fix.

jsquyres · 2016-01-05T17:04:31Z

Discussion on the call...

This seems to be because of how we now add all endpoints for all BTLs (because of the new add_procs() work). Hence, even though a process may be 100% on a single server, if network BTLs determine that they could run (e.g., openib), the openib BTL modules will be added in the PML. Therefore we queue up their progress functions and poll them -- even though they will never be used (unless a dynamic process comes along and uses them).

There's a secondary implication: one-sided atomics. It's unfortunate that CPU atomics != network hardware atomics, so we have to go 100% one way or the other. The current implementation uses the network stack atomics (for the same reason as above: a dynamic process may come along and therefore require the use of network atomics).

@hjelmn is going to look at a slightly different approach: have opal_progress() monitor the progress functions that it calls. Progress functions that don't return any events for a long time will get downgraded in priority and called less frequently.

gpaulsen · 2016-01-05T17:07:23Z

We discussed in today's call that this is a complex issue with the new dynamic add procs, and that we must include network endpoints for two reasons. One is that if we use any network atomics, all atomics must be network atomics. The other is to support spawning a new job on another node.

Nathan's proposed solution is to add a decay function to the progression loop, so that any components that are not actually progressing anything, won't get called as often.

bosilca · 2016-01-05T17:52:53Z

Talking specifically about IB, we should take advantage of the fact that it is a connection-based network and only register the progress callback when there are established connections. More generally, a more reasonable approach will be to delay the progress registration until there is something to progress (this is under the assumption that the connection establishment is handled by a separate thread).

jsquyres · 2016-01-05T18:06:47Z

Agreed: that, too.

gpaulsen · 2016-01-11T16:52:11Z

I should add that this was build Multi-threaded, for x86 with the GNU compilers.

gpaulsen · 2016-01-20T15:56:54Z

I'm ready to retest. Please comment in this issue with which PR to try.

hppritcha · 2016-01-21T15:09:29Z

@hjelmn do you have an eta for the backoff fix?

hjelmn · 2016-01-21T15:36:32Z

Should have it ready to test later today.

hjelmn · 2016-01-22T23:54:28Z

hmm, I see why I am not getting the same level of slowdown as you. The problem is less the progress function (which adds ~ 100ns) and more the asynchronous progress thread (connections, errors, etc). It should be completely quiet in this case but something is clearly causing the thread to wake up.

@gpaulsen Could you run the reproducer with a debug build and the -mca btl_base_verbose 100 option and send me the output?

jsquyres · 2016-01-25T18:14:28Z

I just pinged @gpaulsen -- he's going to check this out today.

hjelmn · 2016-01-25T18:15:18Z

Good. I am sure a btl log will should what is causing the async thread to wake up.

gpaulsen · 2016-01-26T00:08:06Z

I emailed output directly to Nathan.
Strangely I was only able to reproduce with a non-debug build. -g did not show the issue (though I didn't try very hard).

jsquyres · 2016-01-26T02:58:10Z

FWIW, for large text outputs, we typically tend to use the gist.github.com service, and then post the link here.

jsquyres · 2016-01-29T13:14:57Z

@gpaulsen @hjelmn Where are we on this issue?

gpaulsen · 2016-01-29T15:31:24Z

I've been trying to instrument the code to provide additional output, but when I do, I see nothing new in the output. I'll have some more time after 3pm central to work on this again today.

gpaulsen · 2016-01-29T23:31:00Z

Just finished a web-x with Jeff, and we put some more opal_outputs in and around the openib component. It looks like OpenIB IS calling it's progress thread, even on a single host.

Only reproduce the bad performance without --enable-debug.
I have compiled WITH Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)
Command:

cat hosts
mpi03 slots=4
mpirun --hostfile hosts -np 4 --bind-to core --report-bindings --mca btl_openib_verbose 100 --mca btl_base_verbose 100 --mca btl openib,vader,self ./ping_pong_ring.x2

I don't see any calls to btl_openib_async_device, udcm_cq_event_dispatch.

I DO see calls to: btl_openib_component_open, and udcm_module_init (which returns success).

And when I'm running I see many many calls to the progress function: btl_openib_component_progress()

Jeff thinks this might be evidence for your initial thoughts @hjelmn. Could you please take another look?

gpaulsen · 2016-02-01T15:41:30Z

@hjelmn, Any luck? Do you want to do a shared screen with me?

hjelmn · 2016-02-01T17:22:15Z

I still think this is unrelated to the progress function since --bind-to none helps as well. I will look deeper and see what is different between the optimized and debug builds that could be having an impact.

hjelmn · 2016-02-01T17:25:27Z

Can you put an opal output in the loop in progress_engine in opal/runtime/opal_progress_threads.c? That should be waking up almost never during the steady-state.

gpaulsen · 2016-02-01T19:38:46Z

That is correct. I've verified that each rank is only calling opal/runtime/progress_engine() once, but we're still getting many many calls in the openib progress function

hjelmn · 2016-02-01T21:07:57Z

@gpaulsen Strange. I will have to do some more digging to figure out what might be going on. I measured the affect of the openib btl's progress engine on a shared memory ping-pong and it is at best 50-100 ns. I tested this by just setting btl_progress to NULL in the btl and comparing to the normal latency.

jsquyres · 2016-02-01T22:17:16Z

@gpaulsen Can you try removing the openib progress function (i.e., maybe just hack up the code to not register the openib progress function) and see what happens?

I know @hjelmn says it only takes something like 40ns, but there's also caching effects, and vader/sm are sooo sensitive to that kind of stuff. If it's easy, it's worth trying.

hjelmn · 2016-02-04T23:37:28Z

Well, can throw out the udcm cause. We did some more digging and it is indeed the openib progress function. Problem is whatever is causing the slowdown for shared memory is also causing a slowdown for internode ping-pong. The slowdown only happens for processes on the second socket of either node. @gpaulsen is going to experiment with different ofed/mofed versions to see if the slowdown is a ofed bug.

hjelmn · 2016-02-04T23:40:24Z

As such this may not be a blocker for 2.0.0. We certainly can add the code to reduce the number of calls to the openib progress function when the btl is not in active use or we could to not poll the openib completion queues unless there are connections but this would fix the symptom not the cause.

jsquyres · 2016-02-04T23:45:30Z

Sounds like this could use some testing on other people's IB hardware -- do the same thing as @gpaulsen (including an MPI process that is NUMA-far from the HCA) and see if they get the same kind of bad performance.

nysal · 2016-02-05T04:49:21Z

Is this sandy bridge? I vaguely recollect adding a stall loop before calling ibv_poll_cq() for processes on the second socket to work around a hardware issue.

gpaulsen · 2016-02-05T12:12:29Z

Yes! This IS a Sandy Bridge. Intel Xeon E5-2660.

gpaulsen · 2016-02-05T14:48:52Z

So, I talked with @nysal. He remembers finding this problem internally at IBM with another PE-MPI, and putting in a spin delay before ibv_poll_cq() when running on non-first socket on Sandybridge with certain OFED versions. That's all he remembers.

gpaulsen · 2016-02-05T14:54:13Z

@hjelmn, The two "good" changes to openib that we made yesterday, do you want to get those upstream, or would you like me to? That might be good practice for me getting stuff upstream, unless you wanted to go and look at other stuff too.

Also, I am not as confident that it's only ibv_poll_cq's fault for the horrible performance. I really thought that I NULLed out the progression pointer as stated above. I'm going to try to reproduce that today.
Also, Isn't that ibv_poll_cq() call in openib on another thread from the shared memory? WHY would that slow down the shared memory progression? or is ALL BTL progression done on that other thread?
Finally, if the new version of MOFED doesn't fix this, does it warrant trying your progression decay function? Or would you prefer to try to detect the Sandybridge architecture and put a spin loop before the ibv_poll_cq()? or just ignore it for this hardware?

hppritcha · 2016-02-05T15:09:57Z

Okay this brings back memories. On Cray for processes on the remote socket from the Aries/hcae in craypich we had to add a back off for functions we knew polled cache lines that were also potentially written to by the I/o device. It was a bug in the sandy bridge north ridge.

Craypich only turned on the back off if rank was on remote socket and it was sandybridge. Problem was fixed in ivy ridge as I recall.

hjelmn · 2016-02-05T15:41:42Z

Ok, it sounds like we can fix this a similar way in Open MPI. We can easily detect both sandybridge and a remote socket. I will take a crack at writing this later today.

hjelmn · 2016-02-05T15:50:23Z

Looks like we have a Sandy Bridge system with QLogic. Let me see if I can get the same slowdown with this system.

hjelmn · 2016-02-05T19:51:38Z

No luck getting a similar slowdown with libibverbs 1.0.8 from RHELL 6.7 with QLogic. Will have to test the workaround on the system @gpaulsen is running on.

jsquyres · 2016-02-09T17:16:23Z

Update from discussion on Feb 9 2016 webex: it may not be the well-known Sandy Bridge bug -- it seems like the latency added is too high (the Sandy Bridge bug only aded hundreds of nanoseconds, not multiple microseconds). @gpaulsen is going to test with MXM/RC to see if he can duplicate the issue -- if it's a hardware / driver issue, it should show up with MXM, too.

That being said, we still want the progressive backoff progress functionality, but probably not for v2.0.0. A good target would likely be v2.1.x.

hjelmn · 2016-02-09T17:20:18Z

@jsquyres multiple microseconds, not milli.

jsquyres · 2016-02-09T17:22:51Z

@hjelmn Thanks -- I edited/fixed the comment.

gpaulsen · 2016-02-09T17:23:39Z

I THINK it's showing up with MxM also. I can reproduce with just MxM across 2 nodes (fast on first socket and slow on 2nd), or MxM on same node. The thing is, I'm getting nasty messages saying that MxM was unable to be opened, so Now I don't know WHAT is happening.

[gpaulsen@mpi03 ompibase]$ mpirun --hostfile hosts -np 2 --bind-to core --dispy-map --report-bindings --mca mtl mxm --map-by core --cpu-set 9,11 ./ping_pong_ring.x3 64 | & tee mxm_one_node_both_2nd_socket.out
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
 Data for JOB [2344,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: mpi03   Num slots: 16   Max slots: 0    Num procs: 2
        Process OMPI jobid: [2344,1] App: 0 Process rank: 0 Bound: socket 1[core 9[hwt 0-1]]:[../../../../../../../..][../BB/../../../../../..]
        Process OMPI jobid: [2344,1] App: 0 Process rank: 1 Bound: socket 1[core 11[hwt 0-1]]:[../../../../../../../..][../../../BB/../../../..]

 =============================================================
[mpi03:07262] MCW rank 0 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..]
[mpi03:07262] MCW rank 1 bound to socket 1[core 11[hwt 0-1]]: [../../../../../../../..][../../../BB/../../../..]
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      mpi03
Framework: mtl
Component: mxm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      mpi03
Framework: mtl
Component: mxm
--------------------------------------------------------------------------
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
[0:mpi03] ping-pong 64 bytes ...
64 bytes: 10.64 usec/msg
64 bytes: 6.01 MB/sec
[1:mpi03] ping-pong 64 bytes ...
64 bytes: 10.66 usec/msg
64 bytes: 6.00 MB/sec

And then if I run the same command with --cpu_set 1,4 (both on first socket), I see that they're bound correctly each to a different core on the first socket, and I get good latency. BUT I still see this "Could not find component" error message.

Is my command correct? Is it using MxM intra-node? Or is it falling back to TCP or SHMEM or something else

gpaulsen · 2016-02-09T17:26:51Z

ANOTHER thing...

I do NOT see this behavior with Platform MPI using VERBS RC, but it's not calling ibv_poll_cq():

[gpaulsen@mpi03 ompibase]$ /opt/ibm/platform_mpi/bin/mpirun -np 2 -intra=nic -prot -IBV -aff=bandwidth,v ~/bin/ppr.x 64
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
Host 0 -- ip 9.21.55.63 -- ranks 0 - 1

 host | 0
======|======
    0 : IBV

 Prot -  All Intra-node communication is: IBV

Host 0 -- ip 9.21.55.63 -- [0+16 1+17 2+18 3+19 4+20 5+21 6+22 7+23],[8+24 9+25 10+26 11+27 12+28 13+29 14+30 15+31]
- R0: [11 00 00 00 00 00 00 00],--  : 0x10001
- R1: --,[11 00 00 00 00 00 00 00]  : 0x1000100
[0:mpi03] ping-pong 64 bytes ...
64 bytes: 1.67 usec/msg
64 bytes: 38.35 MB/sec
[1:mpi03] ping-pong 64 bytes ...
64 bytes: 1.68 usec/msg
64 bytes: 38.17 MB/sec

When I try Platform MPI and turn on SRQ mode (which I think calls ibv_poll_cq) I do notice the change, but it's not this horrible performance. Perhaps the SRQ mode introduces enough of a delay that we're not hitting this issue? i'm checking to see if we call ibv_poll_cq in this mode.

[gpaulsen@mpi03 ompibase]$ /opt/ibm/platform_mpi/bin/mpirun -srq -np 2 -intra=nic -prot -IBV -aff=bandwidth,v ~/bin/ppr.x 64
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
Host 0 -- ip 9.21.55.63 -- ranks 0 - 1

 host | 0
======|======
    0 : IBV

 Prot -  All Intra-node communication is: IBV

Host 0 -- ip 9.21.55.63 -- [0+16 1+17 2+18 3+19 4+20 5+21 6+22 7+23],[8+24 9+25 10+26 11+27 12+28 13+29 14+30 15+31]
- R0: [11 00 00 00 00 00 00 00],--  : 0x10001
- R1: --,[11 00 00 00 00 00 00 00]  : 0x1000100
[0:mpi03] ping-pong 64 bytes ...
64 bytes: 2.38 usec/msg
64 bytes: 26.91 MB/sec
[1:mpi03] ping-pong 64 bytes ...
64 bytes: 2.36 usec/msg
64 bytes: 27.08 MB/sec

jladd-mlnx · 2016-02-09T17:39:27Z

Geoff, looks like you haven't built your OMPI with MXM support.

On Tue, Feb 9, 2016 at 12:23 PM, Geoff Paulsen [email protected]
wrote:

I THINK it's showing up with MxM also. I can reproduce with just MxM
across 2 nodes (fast on first socket and slow on 2nd), or MxM on same node.
The thing is, I'm getting nasty messages saying that MxM was unable to be
opened, so Now I don't know WHAT is happening.
`[gpaulsen@mpi03 ompibase]$ mpirun --hostfile hosts -np 2 --bind-to core
--dispy-map --report-bindings --mca mtl mxm --map-by core --cpu-set 9,11
./ping_pong_ring.x3 64 | & tee mxm_one_node_both_2nd_socket.out
libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs1
Data for JOB [2344,1] offset 0

======================== JOB MAP ========================

Data for node: mpi03 Num slots: 16 Max slots: 0 Num procs: 2
Process OMPI jobid: [2344,1] App: 0 Process rank: 0 Bound: socket 1[core
9[hwt 0-1]]:[../../../../../../../..][../BB/../../../../../..]
Process OMPI jobid: [2344,1] App: 0 Process rank: 1 Bound: socket 1[core
11[hwt 0-1]]:[../../../../../../../..][../../../BB/../../../..]

[mpi03:07262] MCW rank 0 bound to socket 1[core 9[hwt 0-1]]:
[../../../../../../../..][../BB/../../../../../..]
[mpi03:07262] MCW rank 1 bound to socket 1[core 11[hwt 0-1]]:
[../../../../../../../..][../../../BB/../../../..]

A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.

Host: mpi03
Framework: mtl

Component: mxm

A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.

Host: mpi03
Framework: mtl
Component: mxm

libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs1
[0:mpi03] ping-pong 64 bytes ...
64 bytes: 10.64 usec/msg
64 bytes: 6.01 MB/sec
[1:mpi03] ping-pong 64 bytes ...
64 bytes: 10.66 usec/msg
64 bytes: 6.00 MB/sec
`

And then if I run the same command with --cpu_set 1,4 (both on first
socket), I see that they're bound correctly each to a different core on the
first socket, and I get good latency. BUT I still see this "Could not find
component" error message.

Is my command correct? Is it using MxM intra-node? Or is it falling back
to TCP or SHMEM or something else?

—
Reply to this email directly or view it on GitHub
#1252 (comment).

jladd-mlnx · 2016-02-09T17:40:00Z

Your experiment 'With MXM' is not valid. It's falling back onto the BTL.

On Tue, Feb 9, 2016 at 12:39 PM, Joshua Ladd [email protected] wrote:

Geoff, looks like you haven't built your OMPI with MXM support.

On Tue, Feb 9, 2016 at 12:23 PM, Geoff Paulsen [email protected]
wrote:

I THINK it's showing up with MxM also. I can reproduce with just MxM
across 2 nodes (fast on first socket and slow on 2nd), or MxM on same node.
The thing is, I'm getting nasty messages saying that MxM was unable to be
opened, so Now I don't know WHAT is happening.
`[gpaulsen@mpi03 ompibase]$ mpirun --hostfile hosts -np 2 --bind-to core
--dispy-map --report-bindings --mca mtl mxm --map-by core --cpu-set 9,11
./ping_pong_ring.x3 64 | & tee mxm_one_node_both_2nd_socket.out
libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs1
Data for JOB [2344,1] offset 0

======================== JOB MAP ========================

Data for node: mpi03 Num slots: 16 Max slots: 0 Num procs: 2
Process OMPI jobid: [2344,1] App: 0 Process rank: 0 Bound: socket 1[core
9[hwt 0-1]]:[../../../../../../../..][../BB/../../../../../..]
Process OMPI jobid: [2344,1] App: 0 Process rank: 1 Bound: socket 1[core
11[hwt 0-1]]:[../../../../../../../..][../../../BB/../../../..]

[mpi03:07262] MCW rank 0 bound to socket 1[core 9[hwt 0-1]]:
[../../../../../../../..][../BB/../../../../../..]
[mpi03:07262] MCW rank 1 bound to socket 1[core 11[hwt 0-1]]:
[../../../../../../../..][../../../BB/../../../..]

A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.

Host: mpi03
Framework: mtl

Component: mxm

A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.

Host: mpi03
Framework: mtl
Component: mxm

libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs1
[0:mpi03] ping-pong 64 bytes ...
64 bytes: 10.64 usec/msg
64 bytes: 6.01 MB/sec
[1:mpi03] ping-pong 64 bytes ...
64 bytes: 10.66 usec/msg
64 bytes: 6.00 MB/sec
`

And then if I run the same command with --cpu_set 1,4 (both on first
socket), I see that they're bound correctly each to a different core on the
first socket, and I get good latency. BUT I still see this "Could not find
component" error message.

Is my command correct? Is it using MxM intra-node? Or is it falling back
to TCP or SHMEM or something else?

—
Reply to this email directly or view it on GitHub
#1252 (comment).

gpaulsen · 2016-02-09T18:00:56Z

Ah, yes. Drat. Thanks. ompi_info |grep -i mxm shows nothing...
rebuilding. Thanks.

gpaulsen · 2016-02-15T13:59:47Z

I updated to latest master, and can no longer reproduce.
I'll try to reproduce on 2.0 branch tuesday morning (I'm off today).

gpaulsen · 2016-02-25T15:54:11Z

I reran with open MPI 2.0 branch, and the latency hit is about 500ns to RDMA via openib to the far socket on haswell.

jladd-mlnx · 2016-02-25T17:15:43Z

Nice. Exactly what I would expect.

On Thu, Feb 25, 2016 at 10:54 AM, Geoff Paulsen [email protected]
wrote:

I reran with open MPI 2.0 branch, and the latency hit is about 500ns to
RDMA via openib to the far socket on haswell.

—
Reply to this email directly or view it on GitHub
#1252 (comment).

gpaulsen · 2016-02-25T17:22:29Z

okay great good to know. I have the additional item to test on 1.10 to see if it was a regression, but we're pretty sure it's not a regression. Assuming it's not, we'll close this.

jladd-mlnx · 2016-02-25T23:33:56Z

Geoff, did you mean Sandy Bridge?

On Thu, Feb 25, 2016 at 12:22 PM, Geoff Paulsen [email protected]
wrote:

okay great good to know. I have the additional item to test on 1.10 to see
if it was a regression, but we're pretty sure it's not a regression.
Assuming it's not, we'll close this.

—
Reply to this email directly or view it on GitHub
#1252 (comment).

gpaulsen · 2016-02-25T23:41:24Z

Oh right sorry. Sandy Bridge.: Intel(R) Xeon(R) CPU E5-2660

jladd-mlnx · 2016-02-26T00:00:36Z

OK. This is expected on SB.

On Thu, Feb 25, 2016 at 6:41 PM, Geoff Paulsen [email protected]
wrote:

Oh right sorry. Sandy Bridge.: Intel(R) Xeon(R) CPU E5-2660

—
Reply to this email directly or view it on GitHub
#1252 (comment).

hppritcha · 2016-03-01T23:35:38Z

So what's the outcome here? I thought we'd decided that this could be closed in discussions last week?

gpaulsen · 2016-03-02T14:09:51Z

Yes I think so too. I committed to test with the 1.10 to prove it's not a regression, but that seems very unlikely, and I'm swamped at the moment.

…erbose coll/base verbose, and neg priority cleanup

gpaulsen added bug Severity: blocker labels Dec 22, 2015

hppritcha added this to the v2.0.0 milestone Dec 22, 2015

gpaulsen closed this as completed Mar 2, 2016

jsquyres added a commit to jsquyres/ompi that referenced this issue Aug 23, 2016

Merge pull request open-mpi#1252 from jjhursey/topic/v2.x-coll-base-v…

c71996e

…erbose coll/base verbose, and neg priority cleanup

including openib in btl list causes horrible vader or sm performance. #1252

including openib in btl list causes horrible vader or sm performance. #1252

Comments

gpaulsen commented Dec 21, 2015

gpaulsen commented Dec 21, 2015

hjelmn commented Dec 21, 2015

hppritcha commented Dec 22, 2015

gpaulsen commented Dec 22, 2015

jsquyres commented Jan 4, 2016

jsquyres commented Jan 5, 2016

gpaulsen commented Jan 5, 2016

bosilca commented Jan 5, 2016

jsquyres commented Jan 5, 2016

gpaulsen commented Jan 11, 2016

gpaulsen commented Jan 20, 2016

hppritcha commented Jan 21, 2016

hjelmn commented Jan 21, 2016

hjelmn commented Jan 22, 2016

jsquyres commented Jan 25, 2016

hjelmn commented Jan 25, 2016

gpaulsen commented Jan 26, 2016

jsquyres commented Jan 26, 2016

jsquyres commented Jan 29, 2016

gpaulsen commented Jan 29, 2016

gpaulsen commented Jan 29, 2016

gpaulsen commented Feb 1, 2016

hjelmn commented Feb 1, 2016

hjelmn commented Feb 1, 2016

gpaulsen commented Feb 1, 2016

hjelmn commented Feb 1, 2016

jsquyres commented Feb 1, 2016

hjelmn commented Feb 4, 2016

hjelmn commented Feb 4, 2016

jsquyres commented Feb 4, 2016

nysal commented Feb 5, 2016

gpaulsen commented Feb 5, 2016

gpaulsen commented Feb 5, 2016

gpaulsen commented Feb 5, 2016

hppritcha commented Feb 5, 2016

hjelmn commented Feb 5, 2016

hjelmn commented Feb 5, 2016

hjelmn commented Feb 5, 2016

jsquyres commented Feb 9, 2016

hjelmn commented Feb 9, 2016

jsquyres commented Feb 9, 2016

gpaulsen commented Feb 9, 2016

gpaulsen commented Feb 9, 2016

jladd-mlnx commented Feb 9, 2016

Component: mxm

jladd-mlnx commented Feb 9, 2016

Component: mxm

gpaulsen commented Feb 9, 2016

gpaulsen commented Feb 15, 2016

gpaulsen commented Feb 25, 2016

jladd-mlnx commented Feb 25, 2016

gpaulsen commented Feb 25, 2016

jladd-mlnx commented Feb 25, 2016

gpaulsen commented Feb 25, 2016

jladd-mlnx commented Feb 26, 2016

hppritcha commented Mar 1, 2016

gpaulsen commented Mar 2, 2016