-
Notifications
You must be signed in to change notification settings - Fork 864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
including openib in btl list causes horrible vader or sm performance. #1252
Comments
@hjelmn - Jeff Squyres said I should tag you on this, that you might have some insight. |
Yeah. This happens because we are polling the completion queue. I have been planning to work on this when i get a chance. |
@gpaulsen which milestone would you like to give this. If it involves a lot of changes I'd prefer it to be in 2.1.0 or maybe 2.0.1? (I'll add the 2.0.1 milestone in a bit) |
I was hoping for 2.0, (I didn't see a 2.0 label), as this makes openib runs pretty useless. |
Hey @hjelmn, can we talk about this tomorrow on the call? I'm wondering:
|
Discussion on the call... This seems to be because of how we now add all endpoints for all BTLs (because of the new There's a secondary implication: one-sided atomics. It's unfortunate that CPU atomics != network hardware atomics, so we have to go 100% one way or the other. The current implementation uses the network stack atomics (for the same reason as above: a dynamic process may come along and therefore require the use of network atomics). @hjelmn is going to look at a slightly different approach: have |
We discussed in today's call that this is a complex issue with the new dynamic add procs, and that we must include network endpoints for two reasons. One is that if we use any network atomics, all atomics must be network atomics. The other is to support spawning a new job on another node. Nathan's proposed solution is to add a decay function to the progression loop, so that any components that are not actually progressing anything, won't get called as often. |
Talking specifically about IB, we should take advantage of the fact that it is a connection-based network and only register the progress callback when there are established connections. More generally, a more reasonable approach will be to delay the progress registration until there is something to progress (this is under the assumption that the connection establishment is handled by a separate thread). |
Agreed: that, too. |
I should add that this was build Multi-threaded, for x86 with the GNU compilers. |
I'm ready to retest. Please comment in this issue with which PR to try. |
@hjelmn do you have an eta for the backoff fix? |
Should have it ready to test later today. |
hmm, I see why I am not getting the same level of slowdown as you. The problem is less the progress function (which adds ~ 100ns) and more the asynchronous progress thread (connections, errors, etc). It should be completely quiet in this case but something is clearly causing the thread to wake up. @gpaulsen Could you run the reproducer with a debug build and the -mca btl_base_verbose 100 option and send me the output? |
I just pinged @gpaulsen -- he's going to check this out today. |
Good. I am sure a btl log will should what is causing the async thread to wake up. |
I emailed output directly to Nathan. |
FWIW, for large text outputs, we typically tend to use the gist.github.com service, and then post the link here. |
I've been trying to instrument the code to provide additional output, but when I do, I see nothing new in the output. I'll have some more time after 3pm central to work on this again today. |
Just finished a web-x with Jeff, and we put some more opal_outputs in and around the openib component. It looks like OpenIB IS calling it's progress thread, even on a single host. Only reproduce the bad performance without --enable-debug.
I don't see any calls to btl_openib_async_device, udcm_cq_event_dispatch. I DO see calls to: btl_openib_component_open, and udcm_module_init (which returns success). And when I'm running I see many many calls to the progress function: btl_openib_component_progress() Jeff thinks this might be evidence for your initial thoughts @hjelmn. Could you please take another look? |
@hjelmn, Any luck? Do you want to do a shared screen with me? |
I still think this is unrelated to the progress function since --bind-to none helps as well. I will look deeper and see what is different between the optimized and debug builds that could be having an impact. |
Can you put an opal output in the loop in progress_engine in opal/runtime/opal_progress_threads.c? That should be waking up almost never during the steady-state. |
That is correct. I've verified that each rank is only calling opal/runtime/progress_engine() once, but we're still getting many many calls in the openib progress function |
@gpaulsen Strange. I will have to do some more digging to figure out what might be going on. I measured the affect of the openib btl's progress engine on a shared memory ping-pong and it is at best 50-100 ns. I tested this by just setting btl_progress to NULL in the btl and comparing to the normal latency. |
@gpaulsen Can you try removing the openib progress function (i.e., maybe just hack up the code to not register the openib progress function) and see what happens? I know @hjelmn says it only takes something like 40ns, but there's also caching effects, and vader/sm are sooo sensitive to that kind of stuff. If it's easy, it's worth trying. |
Well, can throw out the udcm cause. We did some more digging and it is indeed the openib progress function. Problem is whatever is causing the slowdown for shared memory is also causing a slowdown for internode ping-pong. The slowdown only happens for processes on the second socket of either node. @gpaulsen is going to experiment with different ofed/mofed versions to see if the slowdown is a ofed bug. |
As such this may not be a blocker for 2.0.0. We certainly can add the code to reduce the number of calls to the openib progress function when the btl is not in active use or we could to not poll the openib completion queues unless there are connections but this would fix the symptom not the cause. |
Sounds like this could use some testing on other people's IB hardware -- do the same thing as @gpaulsen (including an MPI process that is NUMA-far from the HCA) and see if they get the same kind of bad performance. |
Is this sandy bridge? I vaguely recollect adding a stall loop before calling ibv_poll_cq() for processes on the second socket to work around a hardware issue. |
Yes! This IS a Sandy Bridge. Intel Xeon E5-2660. |
So, I talked with @nysal. He remembers finding this problem internally at IBM with another PE-MPI, and putting in a spin delay before ibv_poll_cq() when running on non-first socket on Sandybridge with certain OFED versions. That's all he remembers. |
@hjelmn, The two "good" changes to openib that we made yesterday, do you want to get those upstream, or would you like me to? That might be good practice for me getting stuff upstream, unless you wanted to go and look at other stuff too. Also, I am not as confident that it's only ibv_poll_cq's fault for the horrible performance. I really thought that I NULLed out the progression pointer as stated above. I'm going to try to reproduce that today. |
Okay this brings back memories. On Cray for processes on the remote socket from the Aries/hcae in craypich we had to add a back off for functions we knew polled cache lines that were also potentially written to by the I/o device. It was a bug in the sandy bridge north ridge. Craypich only turned on the back off if rank was on remote socket and it was sandybridge. Problem was fixed in ivy ridge as I recall. |
Ok, it sounds like we can fix this a similar way in Open MPI. We can easily detect both sandybridge and a remote socket. I will take a crack at writing this later today. |
Looks like we have a Sandy Bridge system with QLogic. Let me see if I can get the same slowdown with this system. |
No luck getting a similar slowdown with libibverbs 1.0.8 from RHELL 6.7 with QLogic. Will have to test the workaround on the system @gpaulsen is running on. |
Update from discussion on Feb 9 2016 webex: it may not be the well-known Sandy Bridge bug -- it seems like the latency added is too high (the Sandy Bridge bug only aded hundreds of nanoseconds, not multiple microseconds). @gpaulsen is going to test with MXM/RC to see if he can duplicate the issue -- if it's a hardware / driver issue, it should show up with MXM, too. That being said, we still want the progressive backoff progress functionality, but probably not for v2.0.0. A good target would likely be v2.1.x. |
@jsquyres multiple microseconds, not milli. |
@hjelmn Thanks -- I edited/fixed the comment. |
I THINK it's showing up with MxM also. I can reproduce with just MxM across 2 nodes (fast on first socket and slow on 2nd), or MxM on same node. The thing is, I'm getting nasty messages saying that MxM was unable to be opened, so Now I don't know WHAT is happening.
And then if I run the same command with --cpu_set 1,4 (both on first socket), I see that they're bound correctly each to a different core on the first socket, and I get good latency. BUT I still see this "Could not find component" error message. Is my command correct? Is it using MxM intra-node? Or is it falling back to TCP or SHMEM or something else |
ANOTHER thing... I do NOT see this behavior with Platform MPI using VERBS RC, but it's not calling ibv_poll_cq():
When I try Platform MPI and turn on SRQ mode (which I think calls ibv_poll_cq) I do notice the change, but it's not this horrible performance. Perhaps the SRQ mode introduces enough of a delay that we're not hitting this issue? i'm checking to see if we call ibv_poll_cq in this mode.
|
Geoff, looks like you haven't built your OMPI with MXM support. On Tue, Feb 9, 2016 at 12:23 PM, Geoff Paulsen [email protected]
|
Your experiment 'With MXM' is not valid. It's falling back onto the BTL. On Tue, Feb 9, 2016 at 12:39 PM, Joshua Ladd [email protected] wrote:
|
Ah, yes. Drat. Thanks. ompi_info |grep -i mxm shows nothing... |
I updated to latest master, and can no longer reproduce. |
I reran with open MPI 2.0 branch, and the latency hit is about 500ns to RDMA via openib to the far socket on haswell. |
Nice. Exactly what I would expect. On Thu, Feb 25, 2016 at 10:54 AM, Geoff Paulsen [email protected]
|
okay great good to know. I have the additional item to test on 1.10 to see if it was a regression, but we're pretty sure it's not a regression. Assuming it's not, we'll close this. |
Geoff, did you mean Sandy Bridge? On Thu, Feb 25, 2016 at 12:22 PM, Geoff Paulsen [email protected]
|
Oh right sorry. Sandy Bridge.: Intel(R) Xeon(R) CPU E5-2660 |
OK. This is expected on SB. On Thu, Feb 25, 2016 at 6:41 PM, Geoff Paulsen [email protected]
|
So what's the outcome here? I thought we'd decided that this could be closed in discussions last week? |
Yes I think so too. I committed to test with the 1.10 to prove it's not a regression, but that seems very unlikely, and I'm swamped at the moment. |
…erbose coll/base verbose, and neg priority cleanup
On master branch
I observe a strange behavior. I think that openib may be using too large of a hammer for numa membinding, possibly setting the wrong memory binding policy for the vader and sm shared memory segments. I've only come to this conclusion empirically based on performance numbers.
For example, I have a RHEL 6.5 node with a single Mellanox Technologies MT25204 [InfiniHost III Lx HCA] ConnectX-3 card with a single port active.
Bad Latency run single host:
$ mpirun -host "mpi03" -np 4 --bind-to core --report-bindings --mca btl openib,vader,self ./ping_pong_ring.x2 [mpi03:12941] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..] [mpi03:12941] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..] [mpi03:12941] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..] [mpi03:12941] MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..] [0:mpi03] ping-pong 0 bytes ... 0 bytes: 7.11 usec/msg [1:mpi03] ping-pong 0 bytes ... 0 bytes: 7.10 usec/msg [2:mpi03] ping-pong 0 bytes ... 0 bytes: 7.15 usec/msg [3:mpi03] ping-pong 0 bytes ... 0 bytes: 7.17 usec/msg
Similar behavior with sm:
$ mpirun -host "mpi03" -np 4 --bind-to core --report-bindings --mca btl openib,sm,self ./ping_pong_ring.x2 [mpi03:14928] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..] [mpi03:14928] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..] [mpi03:14928] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..] [mpi03:14928] MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..] [0:mpi03] ping-pong 0 bytes ... 0 bytes: 7.45 usec/msg [1:mpi03] ping-pong 0 bytes ... 0 bytes: 7.38 usec/msg [2:mpi03] ping-pong 0 bytes ... 0 bytes: 7.35 usec/msg [3:mpi03] ping-pong 0 bytes ... 0 bytes: 7.38 usec/msg
When I remove openib results look much better:
$ mpirun -host "mpi03" -np 4 --bind-to core --report-bindings --mca btl vader,self ./ping_pong_ring.x2 [mpi03:15819] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..] [mpi03:15819] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..] [mpi03:15819] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..] [mpi03:15819] MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..] [0:mpi03] ping-pong 0 bytes ... 0 bytes: 0.50 usec/msg [1:mpi03] ping-pong 0 bytes ... 0 bytes: 0.50 usec/msg [2:mpi03] ping-pong 0 bytes ... 0 bytes: 0.49 usec/msg [3:mpi03] ping-pong 0 bytes ... 0 bytes: 0.51 usec/msg
Similar behavior with sm (though it's half as fast as vader):
$ mpirun -host "mpi03" -np 4 --bind-to core --report-bindings --mca btl sm,self ./ping_pong_ring.x2 [mpi03:16608] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../..][../../../../../../../..] [mpi03:16608] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: [../../../../../../../..][BB/../../../../../../..] [mpi03:16608] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../..][../../../../../../../..] [mpi03:16608] MCW rank 3 bound to socket 1[core 9[hwt 0-1]]: [../../../../../../../..][../BB/../../../../../..] [0:mpi03] ping-pong 0 bytes ... 0 bytes: 0.98 usec/msg [1:mpi03] ping-pong 0 bytes ... 0 bytes: 1.00 usec/msg [2:mpi03] ping-pong 0 bytes ... 0 bytes: 0.95 usec/msg [3:mpi03] ping-pong 0 bytes ... 0 bytes: 0.93 usec/msg
If I disable binding explicitly with --bind-to none, even when specifying openib I see the expected results (with either vader or sm, but now sm is the same speed as vader... weird):
$ mpirun -host "mpi03" -np 4 --bind-to none --report-bindings --mca btl openib,vader,self ./ping_pong_ring.x2 [mpi03:20206] MCW rank 1 is not bound (or bound to all available processors) [mpi03:20205] MCW rank 0 is not bound (or bound to all available processors) [mpi03:20207] MCW rank 2 is not bound (or bound to all available processors) [mpi03:20208] MCW rank 3 is not bound (or bound to all available processors) [0:mpi03] ping-pong 0 bytes ... 0 bytes: 0.50 usec/msg [1:mpi03] ping-pong 0 bytes ... 0 bytes: 0.50 usec/msg [2:mpi03] ping-pong 0 bytes ... 0 bytes: 0.50 usec/msg [3:mpi03] ping-pong 0 bytes ... 0 bytes: 0.49 usec/msg
$ mpirun -host "mpi03" -np 4 --bind-to none --report-bindings --mca btl openib,sm,self ./ping_pong_ring.x2 [mpi03:21058] MCW rank 0 is not bound (or bound to all available processors) [mpi03:21059] MCW rank 1 is not bound (or bound to all available processors) [mpi03:21060] MCW rank 2 is not bound (or bound to all available processors) [mpi03:21061] MCW rank 3 is not bound (or bound to all available processors) [0:mpi03] ping-pong 0 bytes ... 0 bytes: 0.50 usec/msg [1:mpi03] ping-pong 0 bytes ... 0 bytes: 0.51 usec/msg [2:mpi03] ping-pong 0 bytes ... 0 bytes: 0.51 usec/msg [3:mpi03] ping-pong 0 bytes ... 0 bytes: 0.49 usec/msg
Finally just for completeness... the best 0 byte ping pong ring times I could get was with --bind-to core --map-by core:
I've attached my source for ping_pong_ring.c:
ping_pong_ring.txt
The text was updated successfully, but these errors were encountered: