-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shared Memory Performance - missing wakeup support #5322
Comments
@quasiben can you pls check the main CPU consuming functions in each case by "perf top" tool? |
It seems that the primary issue in the results of the description was that both On a DGX-1:
Looking at top it seems to be the case that if the process is unpinned they will always fall on a different physical CPU, so there will be not much gain. On an Intel NUC with a single processor (2 cores, 4 threads) the results were a bit more homogenous, but we saw bandwidth from ~6.5GB/s up to ~8.0GB/s per second, depending where you would pin the proceses. |
Thank you for updating @pentschev . @yosefe are these results expected ? My hope was that adding |
on different NUMA nodes, sm is capped by ~3 GB/s due to single core memory BW.
|
The command was:
Only Configure flags:
|
|
@yosefe this are experiments on an Intel Nuc with tcp,sm,sockcmucx_perftest
perf top
tcp,sockcm
perf top
|
And from a DGX-1, release build: tcp,sm,sockcmucx_perftest
perf top
sm,sockcmucx_perftest
perf top
It indeed looks now much better in both cases, at around 10.9GB/s. |
@pentschev this is as expected. In both cases shared memory is being used, by single-copy mechanism (copy_user_enhanced_fast_string is used mostly). For completeness, we can also measure |
Thanks @yosefe , that's right, tcp only is slower, at around 5.3GB/s, results below: tcp,sockcmucx_perftest
perf top
Still perf top shows |
@pentschev probably TCP eventually uses same memory copy function in the kernel. it takes 40% of the time vs. ~34% with shared memory, which can indicate in case of TCP the copy is done on both sides (since actual BW for TCP is ~2x lower) |
That makes sense, thanks for the comment. To me the performance we obtained here of around 11GB/s with SM vs 5.3GB/s with TCP seems reasonable. @quasiben do we have any other questions or issues? |
No I think we can close. Thank you @pentschev and @yosefe |
We've been trying to run UCX-Py also with shared memory, we have a simple benchmark that spawns two processes that I patched to make sure CPU is pinned. local-send-recv-py CPU pinning patchdiff --git a/benchmarks/local-send-recv.py b/benchmarks/local-send-recv.py
index 443903c..de1b12e 100644
--- a/benchmarks/local-send-recv.py
+++ b/benchmarks/local-send-recv.py
@@ -22,6 +22,7 @@ UCX_SOCKADDR_TLS_PRIORITY=sockcm python local-send-recv.py --server-dev 0 \
import argparse
import asyncio
import multiprocessing as mp
+import os
from time import perf_counter as clock
from distributed.utils import format_bytes, parse_bytes
@@ -32,6 +33,7 @@ mp = mp.get_context("spawn")
def server(queue, args):
+ os.sched_setaffinity(0, [0])
ucp.init()
if args.object_type == "numpy":
@@ -84,6 +86,7 @@ def server(queue, args):
def client(queue, port, server_address, args):
+ os.sched_setaffinity(0, [1])
import ucp
ucp.init() It can be reproduced as follows:
By enabling UCX-Py's debug information, it prints what configurations was used by checking
Running
I'm not really sure what happens, as just enabling/disabling other transports with UCX-Py has never presented any issues, in fact adding
Any help is appreciated @yosefe ! |
@pentschev can you pls refer to UCX-py code which is used to create context, worker, and endpoint? |
Context: Worker: Endpoints: Endpoint parameters: Listener: Listener parameters: |
Are there any special flags we should be including for shared memory? I've been looking at that, it seems like we may be missing something. |
seems WAKEUP feature request disabled sm (https://github.com/rapidsai/ucx-py/blob/43c32e6223c42449fc0d89b52773016c82380073/ucp/_libs/ucx_api.pyx#L204) because of missing UCT_IFACE_FLAG_EVENT_RECV capability |
Thanks @bureddy , removing the |
Sorry, I meant to say using non-blocking mode. I disabled blocking mode, the blocking mode is set by the code from my previous comment. |
This is indeed the case. We can say it's a missing feature: wakeup support for shared memory transports. |
In order to reliably and timely make progress within UCX the user is responsible for occasionally call `progess` on the UCX worker. Originally I used a Julia `Timer` object to gurantee progress especially in the context of asymmetric communication, e.g. active messages. The `Timer` object would trigger ever millisecond resulting in a much higher latency, following that I implemented the polling interface using the WAKEUP feature, but that turns of support for shared memory openucx/ucx#5322 and turned out to have relativly high overhead in a pure latency test on the order of ~20microseconds. I experimented with two other modes (1) the busy waiting mode, but that is using unfair scheduling and might livelock and libuv `Idle`. Idle callbacks are a bit odd, but seem to work well everytime Julia ticks the event loop libuv will call the progress function. The performanc of busy waiting seems to degrade with multiple threads, while the idler performs well, but I have not yet performed a whole system comparision.
In order to reliably and timely make progress within UCX the user is responsible for occasionally call `progess` on the UCX worker. Originally I used a Julia `Timer` object to gurantee progress especially in the context of asymmetric communication, e.g. active messages. The `Timer` object would trigger ever millisecond resulting in a much higher latency, following that I implemented the polling interface using the WAKEUP feature, but that turns of support for shared memory openucx/ucx#5322 and turned out to have relativly high overhead in a pure latency test on the order of ~20microseconds. I experimented with two other modes (1) the busy waiting mode, but that is using unfair scheduling and might livelock and libuv `Idle`. Idle callbacks are a bit odd, but seem to work well everytime Julia ticks the event loop libuv will call the progress function. The performanc of busy waiting seems to degrade with multiple threads, while the idler performs well, but I have not yet performed a whole system comparision.
@vchuravy FYI this issue was addressed |
Would be good to close this then :) |
I am interested in adding memory support to UCX-Py/Dask (currently we don't add
sm
to theUCX_TLS
list) and in testing what kind of performance I should expect with core ucx, I am not seeing any benefit with smaller message ~1MB. Below I've benchmarked 1 and 10MB transfers withucx_perftest
and tuned things to typical UCX-Py settings. I am using the tip of 1.8.x to perform these benchmarks and all runs were executed on a DGX1. In building UCX from source I added--with-dm
which is not something we do when building the a UCX binary:https://github.com/rapidsai/ucx-split-feedstock/blob/228916cc633aad0408d4c3b4c3649d52d8f3802c/recipe/install_ucx.sh#L10-L24
1MB UCX_TLS=tcp,sockcm,sm -> ~30MB/s
1MB UCX_TLS=tcp,sockcm -> ~30MB/s
10MB UCX_TLS=tcp,sockcm,sm -> ~300MB/s
10MB UCX_TLS=tcp,sockcm-> ~100MB/s
UCX_RNDV_THRESH=8192 UCX_TLS=sm,tcp,sockcm ucx_info -e -u t -D shm
The text was updated successfully, but these errors were encountered: