Shared Memory Performance - missing wakeup support #5322

quasiben · 2020-06-24T18:43:21Z

I am interested in adding memory support to UCX-Py/Dask (currently we don't add sm to the UCX_TLS list) and in testing what kind of performance I should expect with core ucx, I am not seeing any benefit with smaller message ~1MB. Below I've benchmarked 1 and 10MB transfers with ucx_perftest and tuned things to typical UCX-Py settings. I am using the tip of 1.8.x to perform these benchmarks and all runs were executed on a DGX1. In building UCX from source I added --with-dm which is not something we do when building the a UCX binary:

https://github.com/rapidsai/ucx-split-feedstock/blob/228916cc633aad0408d4c3b4c3649d52d8f3802c/recipe/install_ucx.sh#L10-L24

1MB UCX_TLS=tcp,sockcm,sm -> ~30MB/s

UCX_TCP_TX_SEG_SIZE=8M UCX_TCP_RX_SEG_SIZE=8M UCX_RNDV_THRESH=8192 UCX_TLS=tcp,sockcm,sm UCX_SOCKADDR_TLS_PRIORITY=sockcm ucx_perftest `hostname` -t tag_bw -m host -s 1000000 -n 100 -c 0
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
            33     0.000 30661.605 30661.605      31.10      31.10          33          33
            66     0.000 30666.698 30664.151      31.10      31.10          33          33
            98     0.000 31624.936 30977.877      30.16      30.79          32          32
           100     0.000 36071.539 31079.750      26.44      30.68          28          32

1MB UCX_TLS=tcp,sockcm -> ~30MB/s

UCX_TCP_TX_SEG_SIZE=8M UCX_TCP_RX_SEG_SIZE=8M UCX_RNDV_THRESH=8192 UCX_TLS=tcp,sockcm UCX_SOCKADDR_TLS_PRIORITY=sockcm ucx_perftest `hostname` -t tag_bw -m host -s 1000000 -n 100 -c 0
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
            34     0.000 29888.377 29888.377      31.91      31.91          33          33
            66     0.000 32125.093 30972.846      29.69      30.79          31          32
            98     0.000 31500.533 31145.152      30.27      30.62          32          32
           100     0.000 56011.081 31642.470      17.03      30.14          18          32

10MB UCX_TLS=tcp,sockcm,sm -> ~300MB/s

UCX_TCP_TX_SEG_SIZE=8M UCX_TCP_RX_SEG_SIZE=8M UCX_RNDV_THRESH=8192 UCX_TLS=tcp,sockcm,sm UCX_SOCKADDR_TLS_PRIORITY=sockcm ucx_perftest `hostname` -t tag_bw -m host -s 10000000 -n 100 -c 0
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
            35     0.000 29485.716 29485.716     323.44     323.44          34          34
            67     0.000 32250.091 30806.015     295.71     309.57          31          32
            99     0.000 31874.970 31151.536     299.19     306.14          31          32
           100     0.000 49175.978 31331.780     193.93     304.38          20          32

10MB UCX_TLS=tcp,sockcm-> ~100MB/s

UCX_TCP_TX_SEG_SIZE=8M UCX_TCP_RX_SEG_SIZE=8M UCX_RNDV_THRESH=8192 UCX_TLS=tcp,sockcm UCX_SOCKADDR_TLS_PRIORITY=sockcm ucx_perftest `hostname` -t tag_bw -m host -s 10000000 -n 100 -c 0
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
            12     0.000 84647.914 84647.914     112.66     112.66          12          12
            23     0.000 91270.360 87815.171     104.49     108.60          11          11
            34     0.000 94547.922 89993.414     100.87     105.97          11          11
            45     0.000 93453.537 90839.222     102.05     104.98          11          11
            56     0.000 92361.992 91138.337     103.25     104.64          11          11
            67     0.000 94910.925 91757.717     100.48     103.93          11          11
            78     0.000 92735.074 91895.550     102.84     103.78          11          11
            89     0.000 93092.008 92043.427     102.44     103.61          11          11
           100     0.000 92717.279 92117.550     102.86     103.53          11          11
           100     0.000       inf 93198.209       0.00     102.33           0          11

UCX_RNDV_THRESH=8192 UCX_TLS=sm,tcp,sockcm ucx_info -e -u t -D shm

#
# UCP endpoint
#
#               peer: <no debug data>
#                 lane[0]:  0:posix/memory md[0] <proxy>  -> md[0]/posix    am am_bw#0
#                 lane[1]:  4:cma/memory md[4]            -> md[4]/cma      rma_bw#0
#
#                tag_send: 0..<egr/short>..93..<egr/bcopy>..8192..<rndv>..(inf)
#            tag_send_nbr: 0..<egr/short>..93..<egr/bcopy>..8192..<rndv>..(inf)
#           tag_send_sync: 0..<egr/short>..93..<egr/bcopy>..8192..<rndv>..(inf)
#
#                  rma_bw: mds rndv_rkey_size 9
#

The text was updated successfully, but these errors were encountered:

yosefe · 2020-06-24T18:48:27Z

@quasiben can you pls check the main CPU consuming functions in each case by "perf top" tool?

pentschev · 2020-06-24T20:15:47Z

It seems that the primary issue in the results of the description was that both ucx_perftest processes were being pinned to the same CPU. Pinning them to different CPUs shows much better performance:

On a DGX-1:

Unpinned 1MB UCX_TLS=sm,sockcm -> 3.3GB/s
Unpinned 1MB UCX_TLS=tcp,sockcm -> 2.7GB/s
Unpinned 1MB UCX_TLS=sm,tcp,sockcm -> 2.7GB/s
Pinned 1MB UCX_TLS=sm,sockcm -> 8.3GB/s
Pinned 1MB UCX_TLS=tcp,sockcm -> 5.3GB/s
Pinned 1MB UCX_TLS=sm,tcp,sockcm -> 5.1GB/s

Looking at top it seems to be the case that if the process is unpinned they will always fall on a different physical CPU, so there will be not much gain. On an Intel NUC with a single processor (2 cores, 4 threads) the results were a bit more homogenous, but we saw bandwidth from ~6.5GB/s up to ~8.0GB/s per second, depending where you would pin the proceses.

quasiben · 2020-06-24T20:21:21Z

Thank you for updating @pentschev . @yosefe are these results expected ? My hope was that adding sm would be significantly faster than tcp. That is, sm,tcp,sockcm should have a much higher bandwidth than tcp,sockcm on both commodity hardware like an Intel NUC and something specialized like a DGX1.

yosefe · 2020-06-25T07:34:56Z

on different NUMA nodes, sm is capped by ~3 GB/s due to single core memory BW.

@pentschev can you pls post full command lines for the "pinned" version?
what wash ucx ./configure command?

pentschev · 2020-06-25T10:28:04Z

The command was:

# Server
UCX_TCP_TX_SEG_SIZE=8M UCX_TCP_RX_SEG_SIZE=8M UCX_RNDV_THRESH=8192 UCX_TLS=sm,sockcm UCX_SOCKADDR_TLS_PRIORITY=sockcm ucx_perftest -t tag_bw -m host -s 1000000 -n 10000 -c 0

# Client
UCX_TCP_TX_SEG_SIZE=8M UCX_TCP_RX_SEG_SIZE=8M UCX_RNDV_THRESH=8192 UCX_TLS=sm,sockcm UCX_SOCKADDR_TLS_PRIORITY=sockcm ucx_perftest dgx13 -t tag_bw -m host -s 1000000 -n 10000 -c 1

Only -c was changed for pinned/unpinned tests above.

Configure flags:

#define UCX_CONFIGURE_FLAGS       "--build=x86_64-conda_cos6-linux-gnu --host=x86_64-conda_cos6-linux-gnu --prefix=/datasets/pentschev/miniconda3/envs/rn-102-0.15.0b200617 --with-sysroot --enable-cma --enable-mt --enable-numa --with-gnu-ld --with-cm --with-rdmacm --with-verbs --with-cuda=/usr/local/cuda"

yosefe · 2020-06-25T14:43:36Z

@pentschev

probably better to compile UCX in release mode (./contrib/configure-release)
what is "perf top" for "Pinned 1MB UCX_TLS=sm,tcp,sockcm" and "Pinned 1MB UCX_TLS=tcp,sockcm"?

quasiben · 2020-06-25T18:05:52Z

@yosefe this are experiments on an Intel Nuc with ./contrib/configure-release and numa libraries on the system. The Server is pinned on CPU 0 and the Client on CPU 1

tcp,sm,sockcm

ucx_perftest

❯ UCX_TCP_TX_SEG_SIZE=8M UCX_TCP_RX_SEG_SIZE=8M UCX_RNDV_THRESH=8192 UCX_TLS=tcp,sm,sockcm UCX_SOCKADDR_TLS_PRIORITY=sockcm ucx_perftest `hostname` -t tag_bw -m host -s 1000000 -n 10000 -c 1
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
          9409   104.566   106.530   106.530    8952.19    8952.19        9387        9387
         10000   104.155   107.476   106.586    8873.41    8947.50        9304        9382

perf top

  31.52%  perf                                   [.] rb_next
  14.86%  [kernel]                               [k] copy_user_enhanced_fast_string
  11.24%  [kernel]                               [k] _raw_spin_lock
   9.13%  [kernel]                               [k] copy_page_to_iter_iovec
   6.58%  python3.5                              [.] _Py_HashBytes
   5.15%  [kernel]                               [k] put_page
   4.66%  [kernel]                               [k] account_entity_enqueue
   3.94%  [kernel]                               [k] error_entry
   3.02%  [kernel]                               [k] get_mem_cgroup_from_mm
   2.22%  [kernel]                               [k] down_write
   1.55%  perf                                   [.] dso__find_symbol
   1.25%  [kernel]                               [k] enqueue_entity
   0.86%  [kernel]                               [k] pm_qos_request
   0.72%  [kernel]                               [k] entry_SYSCALL_64_fastpath
   0.66%  bash                                   [.] _rl_find_prev_mbchar
   0.63%  [kernel]                               [k] __fget_light
   0.61%  [kernel]                               [k] mds_user_clear_buffers
   0.42%  [kernel]                               [k] ip_finish_output2
   0.36%  [kernel]                               [k] __dev_queue_xmit
   0.31%  [kernel]                               [k] _raw_spin_lock_bh
   0.31%  [kernel]                               [k] tcp_v4_do_rcv
   0.01%  libuct.so.0.0.0                        [.] uct_mm_iface_progress
   0.01%  libc-2.23.so                           [.] __memcpy_avx_unaligned
   0.00%  libucp.so.0.0.0                        [.] ucp_worker_progress
   0.00%  ucx_perftest                           [.] ucp_perf_test_runner<(ucx_perf_cmd_t)7, (ucx_perf_test_type_t)1, 0u>::run_stream_uni
   0.00%  ucx_perftest                           [.] 0x0000000000003358
   0.00%  [kernel]                               [k] number.isra.14
   0.00%  [kernel]                               [k] entry_SYSCALL_64
   0.00%  python3.5                              [.] 0x00000000001811d2
   0.00%  python3.5                              [.] PyEval_EvalFrameEx
   0.00%  [kernel]                               [k] tcp_sendmsg
   0.00%  [unknown]                              [.] 0x00007f76aab2cf62
   0.00%  libc-2.23.so                           [.] _int_malloc
   0.00%  [kernel]                               [k] entry_SYSCALL_64_after_swapgs
   0.00%  tmux                                   [.] _init
   0.00%  libgdbm.so.3.0.0                       [.] _gdbm_findkey
   0.00%  [kernel]                               [k] copy_user_generic_unrolled
   0.00%  [unknown]                              [.] 0x00007f76aad6b9a5
   0.00%  python3.5                              [.] _PyObject_GenericGetAttrWithDict
   0.00%  [kernel]                               [k] _raw_spin_lock_irqsave
   0.00%  [unknown]                              [.] 0x00007f76aab2cf54
   0.00%  python3.5                              [.] _PyType_Lookup
   0.00%  libgdbm.so.3.0.0                       [.] _gdbm_hash
   0.00%  python3.5                              [.] PyObject_Malloc
   0.00%  bash                                   [.] 0x000000000005dfda
   0.00%  [unknown]                              [.] 0x00007f76aab2cf7e
   0.00%  [kernel]                               [k] get_page_from_freelist
   0.00%  [kernel]                               [k] prepare_exit_to_usermode
   0.00%  python3.5                              [.] 0x000000000010f430
   0.00%  [kernel]                               [k] nf_iterate
   0.00%  [unknown]                              [.] 0x00007f76aad6b9c2
   0.00%  [unknown]                              [.] 0x00007f76aab2cf23
   0.00%  [unknown]                              [.] 0x00007f76aad6b9b0
   0.00%  libucp.so.0.0.0                        [.] ucp_tag_send_nb
   0.00%  [unknown]                              [.] 0x00007f76aab2cf66
   0.00%  python3.5                              [.] PyDict_GetItem
   0.00%  python3.5                              [.] _PyDict_LoadGlobal

tcp,sockcm

❯ UCX_TCP_TX_SEG_SIZE=8M UCX_TCP_RX_SEG_SIZE=8M UCX_RNDV_THRESH=8192 UCX_TLS=tcp,sockcm UCX_SOCKADDR_TLS_PRIORITY=sockcm ucx_perftest `hostname` -t tag_bw -m host -s 1000000 -n 10000 -c 1
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
          4960   202.026   202.095   202.095    4718.94    4718.94        4948        4948
          9926   201.256   201.839   201.967    4724.93    4721.93        4954        4951
         10000   201.012   207.985   202.011    4585.31    4720.89        4808        4950
 ❲c❳ ucx-cpu  ~/GitHub 

perf top

    28.14%  [kernel]               [k] copy_user_enhanced_fast_string
  14.69%  [kernel]               [k] enqueue_entity
  10.04%  [kernel]               [k] pm_qos_request
   8.40%  [kernel]               [k] entry_SYSCALL_64_fastpath
   7.72%  bash                   [.] _rl_find_prev_mbchar
   7.41%  [kernel]               [k] __fget_light
   7.10%  [kernel]               [k] mds_user_clear_buffers
   4.93%  [kernel]               [k] ip_finish_output2
   4.18%  [kernel]               [k] __dev_queue_xmit
   3.65%  [kernel]               [k] _raw_spin_lock_bh
   3.65%  [kernel]               [k] tcp_v4_do_rcv
   0.07%  libc-2.23.so           [.] __memcpy_avx_unaligned
   0.00%  [kernel]               [k] number.isra.14
   0.00%  [kernel]               [k] entry_SYSCALL_64
   0.00%  [kernel]               [k] tcp_sendmsg
   0.00%  [kernel]               [k] _raw_spin_lock_irqsave
   0.00%  [kernel]               [k] get_page_from_freelist
   0.00%  [kernel]               [k] nf_iterate
   0.00%  [kernel]               [k] __fget
   0.00%  [kernel]               [k] ipt_do_table
   0.00%  [kernel]               [k] __x86_indirect_thunk_rax
   0.00%  [kernel]               [k] tcp_v4_rcv
   0.00%  [kernel]               [k] nf_conntrack_in
   0.00%  [kernel]               [k] __inet_lookup_established
   0.00%  [kernel]               [k] copy_from_iter
   0.00%  [kernel]               [k] entry_SYSCALL_64_after_swapgs
   0.00%  [kernel]               [k] __tcp_transmit_skb
   0.00%  [kernel]               [k] __nf_conntrack_find_get
   0.00%  [kernel]               [k] _raw_spin_lock
   0.00%  [kernel]               [k] sys_epoll_wait
   0.00%  [kernel]               [k] _raw_spin_unlock_irqrestore
   0.00%  [kernel]               [k] tcp_in_window
   0.00%  [kernel]               [k] __slab_free
   0.00%  [kernel]               [k] sock_put
   0.00%  [kernel]               [k] __alloc_skb
   0.00%  [kernel]               [k] put_compound_page
   0.00%  [kernel]               [k] tcp_packet
   0.00%  python3.5              [.] PyEval_EvalFrameEx
   0.00%  [kernel]               [k] tcp_write_xmit
   0.00%  [kernel]               [k] free_pages_prepare
   0.00%  [kernel]               [k] ipv4_dst_check
   0.00%  [kernel]               [k] ipv4_conntrack_defrag
   0.00%  [unknown]              [k] 0x00007ff9f3f21a11
   0.00%  [kernel]               [k] ipv4_confirm
   0.00%  [kernel]               [k] process_backlog
   0.00%  [kernel]               [k] nf_ct_deliver_cached_events
   0.00%  [kernel]               [k] kmem_cache_alloc_node
   0.00%  [kernel]               [k] ep_poll
   0.00%  [kernel]               [k] __local_bh_enable_ip
   0.00%  [kernel]               [k] nf_nat_ipv4_fn
   0.00%  [kernel]               [k] __skb_clone
   0.00%  [kernel]               [k] free_one_page
   0.00%  [kernel]               [k] fput
   0.00%  libc-2.23.so           [.] __libc_enable_asynccancel
   0.00%  [kernel]               [k] tcp_wfree
   0.00%  [kernel]               [k] mod_timer
   0.00%  [kernel]               [k] net_rx_action

pentschev · 2020-06-25T21:16:15Z

And from a DGX-1, release build:

tcp,sm,sockcm

ucx_perftest

UCX_TCP_TX_SEG_SIZE=8M UCX_TCP_RX_SEG_SIZE=8M UCX_RNDV_THRESH=8192 UCX_TLS=tcp,sm,sockcm UCX_SOCKADDR_TLS_PRIORITY=sockcm ucx_perftest dgx13 -t tag_bw -m host -s 1000000 -n 100000 -c 1
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
         11540    84.401    86.860    86.860   10979.50   10979.50       11513       11513
         23084    84.402    86.830    86.845   10983.22   10981.36       11517       11515
         34624    84.345    86.858    86.849   10979.66   10980.79       11513       11514
         46134    84.329    87.087    86.909   10950.82   10973.30       11483       11506
         57677    84.310    86.839    86.895   10982.04   10975.05       11516       11508
         69216    84.391    86.865    86.890   10978.87   10975.68       11512       11509
         80753    84.431    86.886    86.889   10976.18   10975.75       11509       11509
         92277    84.330    86.978    86.900   10964.56   10974.35       11497       11507
        100000    84.320    87.139    86.919   10944.30   10972.03       11476       11505

perf top

Samples: 6M of event 'cycles:ppp', Event count (approx.): 38431813971
Overhead  Shared Object                         Symbol
  34.87%  [kernel]                              [k] copy_user_enhanced_fast_string
  10.62%  libucp.so.0.0.0                       [.] ucp_worker_progress
   9.24%  libuct.so.0.0.0                       [.] uct_mm_iface_progress
   2.66%  ucx_perftest                          [.] ucp_perf_test_runner<(ucx_perf_cmd_t)7, (ucx_perf_test_type_t)1, 0u>::run_stream_uni
   2.32%  [kernel]                              [k] copy_page_to_iter
   1.95%  [unknown]                             [.] 0x00007f2fd39b6e73
   1.88%  [kernel]                              [k] process_vm_rw_core.isra.2
   1.83%  [unknown]                             [.] 0x00007f2fd3bf5ce5
   1.82%  [unknown]                             [.] 0x00007f2fd3bf5ceb
   1.75%  [kernel]                              [k] follow_page_pte
   1.69%  [unknown]                             [.] 0x00007f2fd3bf5cc2
   1.61%  [unknown]                             [.] 0x00007f2fd39b6ea5
   1.52%  [unknown]                             [.] 0x0000560694b2dde8
   1.44%  ucx_perftest                          [.] 0x0000000000003de8
   1.43%  [unknown]                             [.] 0x00007f2fd3bf5cd1
   1.43%  [unknown]                             [.] 0x0000560694b8746a
   1.40%  [unknown]                             [.] 0x00007f2fd3bf5cc0
   1.40%  [unknown]                             [.] 0x00007f2fd3bf5cb0
   1.33%  [unknown]                             [.] 0x00007f2fd39b6eb3
   1.25%  [unknown]                             [.] 0x00007f2fd39b6e7b
   1.10%  [unknown]                             [.] 0x0000560694b8745b
   1.09%  [kernel]                              [k] follow_page_mask
   1.06%  [kernel]                              [k] _nv030784rm
   0.90%  [kernel]                              [k] follow_pmd_mask
   0.85%  [unknown]                             [.] 0x00007f2fd39b6e96
   0.72%  [unknown]                             [.] 0x00007f2fd39b6ed3
   0.71%  [kernel]                              [k] __get_user_pages
   0.63%  [kernel]                              [k] mark_page_accessed
   0.57%  [kernel]                              [k] _vm_normal_page
   0.46%  [unknown]                             [.] 0x00007f2fd39b6ebc
   0.40%  [unknown]                             [.] 0x00007f2fd3bf5ca5
   0.27%  [kernel]                              [k] osq_lock
   0.26%  [kernel]                              [k] _cond_resched
   0.25%  [kernel]                              [k] update_blocked_averages
   0.22%  [unknown]                             [.] 0x00007f2fd3bf5cad
   0.21%  [kernel]                              [k] rcu_all_qs
   0.19%  [kernel]                              [k] copyout
   0.18%  [kernel]                              [k] do_syscall_64

sm,sockcm

ucx_perftest

UCX_TCP_TX_SEG_SIZE=8M UCX_TCP_RX_SEG_SIZE=8M UCX_RNDV_THRESH=8192 UCX_TLS=sm,sockcm UCX_SOCKADDR_TLS_PRIORITY=sockcm ucx_perftest dgx13 -t tag_bw -m host -s 1000000 -n 100000 -c 1
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
[1593119696.773070] [dgx13:19457:0]         parser.c:1600 UCX  WARN  unused env variable: UCX_TCP_RX_SEG_SIZE,... (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
         11526    84.417    86.966    86.966   10966.11   10966.11       11499       11499
         23056    84.375    86.935    86.950   10969.94   10968.03       11503       11501
         34614    84.352    86.727    86.876   10996.25   10977.44       11530       11511
         46167    84.259    86.760    86.847   10992.05   10981.09       11526       11515
         57669    84.275    87.148    86.907   10943.13   10973.50       11475       11507
         69204    84.188    86.901    86.906   10974.30   10973.63       11507       11507
         80742    84.174    86.878    86.902   10977.15   10974.13       11510       11507
         92308    84.317    86.667    86.873   11003.84   10977.85       11538       11511
        100000    84.362    86.970    86.880   10965.53   10976.90       11498       11510

perf top

Samples: 6M of event 'cycles:ppp', Event count (approx.): 32638412093
Overhead  Shared Object                         Symbol
  34.56%  [kernel]                              [k] copy_user_enhanced_fast_string
  20.69%  libucp.so.0.0.0                       [.] ucp_worker_progress
  18.12%  libuct.so.0.0.0                       [.] uct_mm_iface_progress
   5.16%  ucx_perftest                          [.] ucp_perf_test_runner<(ucx_perf_cmd_t)7, (ucx_perf_test_type_t)1, 0u>::run_stream_uni
   3.00%  ucx_perftest                          [.] 0x0000000000003de8
   2.32%  [kernel]                              [k] copy_page_to_iter
   1.89%  [kernel]                              [k] process_vm_rw_core.isra.2
   1.77%  [kernel]                              [k] follow_page_pte
   1.12%  [kernel]                              [k] follow_page_mask
   1.10%  [kernel]                              [k] _nv030784rm
   0.89%  [kernel]                              [k] follow_pmd_mask
   0.78%  [kernel]                              [k] __get_user_pages
   0.64%  [kernel]                              [k] _vm_normal_page
   0.59%  [kernel]                              [k] mark_page_accessed
   0.26%  [kernel]                              [k] _cond_resched
   0.25%  [kernel]                              [k] osq_lock
   0.22%  [kernel]                              [k] update_blocked_averages
   0.22%  [kernel]                              [k] rcu_all_qs
   0.21%  [kernel]                              [k] copyout
   0.20%  libdcgm.so.1                          [.] DcgmCacheManager::ActuallyUpdateAllFields
   0.19%  [kernel]                              [k] do_syscall_64
   0.18%  [kernel]                              [k] smp_call_function_single
   0.14%  [kernel]                              [k] _raw_spin_lock
   0.13%  [kernel]                              [k] find_vma
   0.12%  node_exporter                         [.] runtime.gcDrain
   0.09%  [kernel]                              [k] __indirect_thunk_start
   0.08%  [kernel]                              [k] pmd_huge
   0.08%  [kernel]                              [k] _nv021435rm
   0.08%  [kernel]                              [k] pud_huge
   0.07%  [kernel]                              [k] __radix_tree_lookup
   0.07%  [kernel]                              [k] _nv021493rm
   0.06%  [kernel]                              [k] syscall_return_via_sysret
   0.06%  libucp.so.0.0.0                       [.] ucp_rndv_progress_rma_get_zcopy
   0.06%  [kernel]                              [k] _raw_spin_lock_irqsave
   0.06%  [kernel]                              [k] task_is_descendant.part.3
   0.05%  [kernel]                              [k] follow_huge_addr
   0.05%  node_exporter                         [.] runtime.(*lfstack).pop
   0.05%  [kernel]                              [k] _nv021040rm
   0.05%  [kernel]                              [k] __entry_trampoline_start
   0.04%  libdcgm.so.1                          [.] hashtable_iter_next@plt
   0.04%  [kernel]                              [k] _nv021439rm
   0.04%  [kernel]                              [k] _nv008373rm
   0.04%  [kernel]                              [k] _nv020856rm

It indeed looks now much better in both cases, at around 10.9GB/s.

yosefe · 2020-06-26T10:06:42Z

@pentschev this is as expected. In both cases shared memory is being used, by single-copy mechanism (copy_user_enhanced_fast_string is used mostly). For completeness, we can also measure UCX_TLS=tcp,sockcm to measure TCP-only bandwidth (which we expect to be lower)

pentschev · 2020-06-26T10:54:37Z

Thanks @yosefe , that's right, tcp only is slower, at around 5.3GB/s, results below:

tcp,sockcm

ucx_perftest

UCX_TCP_TX_SEG_SIZE=8M UCX_TCP_RX_SEG_SIZE=8M UCX_RNDV_THRESH=8192 UCX_TLS=tcp,sockcm UCX_SOCKADDR_TLS_PRIORITY=sockcm ucx_perftest dgx13 -t tag_bw -m host -s 1000000 -n 100000 -c 1
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
          5624   175.262   178.244   178.244    5350.39    5350.39        5610        5610
         11247   175.236   178.281   178.262    5349.29    5349.84        5609        5610
         16880   174.900   177.961   178.162    5358.88    5352.86        5619        5613
         22505   175.175   178.203   178.172    5351.61    5352.54        5612        5613
         28132   175.029   178.140   178.166    5353.50    5352.74        5614        5613
         33760   175.151   178.108   178.156    5354.46    5353.02        5615        5613
         39384   175.015   178.243   178.169    5350.41    5352.65        5610        5613
         45011   175.161   178.154   178.167    5353.08    5352.70        5613        5613
         50636   175.124   178.195   178.170    5351.85    5352.61        5612        5613
         56267   175.082   178.001   178.153    5357.69    5353.12        5618        5613
         61895   175.194   178.116   178.150    5354.23    5353.22        5614        5613
         67502   175.306   178.774   178.202    5334.52    5351.66        5594        5612
         73125   175.179   178.257   178.206    5350.00    5351.53        5610        5611
         78759   175.100   177.921   178.186    5360.09    5352.14        5620        5612
         84389   174.951   178.038   178.176    5356.58    5352.44        5617        5612
         90019   175.045   178.045   178.168    5356.36    5352.68        5617        5613
         95642   175.257   178.285   178.174    5349.15    5352.48        5609        5612
        100000   175.162   177.975   178.166    5358.48    5352.74        5619        5613

perf top

Samples: 28K of event 'cycles:ppp', Event count (approx.): 30119055114
Overhead  Shared Object              Symbol
  40.57%  [kernel]                   [k] copy_user_enhanced_fast_string
   6.90%  [kernel]                   [k] do_syscall_64
   2.67%  [kernel]                   [k] _nv030784rm
   2.31%  [kernel]                   [k] __entry_trampoline_start
   2.28%  [kernel]                   [k] syscall_return_via_sysret
   1.83%  [kernel]                   [k] tcp_sendmsg_locked
   1.14%  libc-2.27.so               [.] 0x000000000018eb1f
   0.94%  [kernel]                   [k] _raw_spin_lock_irqsave
   0.89%  [kernel]                   [k] ipt_do_table
   0.88%  [kernel]                   [k] mutex_spin_on_owner
   0.82%  [kernel]                   [k] skb_release_data
   0.68%  [kernel]                   [k] __indirect_thunk_start
   0.64%  [kernel]                   [k] entry_SYSCALL_64_after_hwframe
   0.62%  [kernel]                   [k] copy_page_to_iter
   0.61%  [kernel]                   [k] tcp_v4_rcv
   0.60%  [kernel]                   [k] __fget
   0.52%  [kernel]                   [k] __nf_conntrack_find_get
   0.49%  [kernel]                   [k] get_page_from_freelist
   0.45%  [kernel]                   [k] sys_epoll_wait
   0.45%  [kernel]                   [k] tcp_rcv_established
   0.43%  [kernel]                   [k] nf_nat_ipv4_fn
   0.42%  [kernel]                   [k] tcp_recvmsg
   0.42%  [kernel]                   [k] __free_pages_ok
   0.42%  [kernel]                   [k] update_blocked_averages
   0.40%  [kernel]                   [k] nf_hook_slow
   0.39%  [kernel]                   [k] iov_iter_advance
   0.39%  [kernel]                   [k] __tcp_transmit_skb
   0.39%  [kernel]                   [k] dst_release
   0.38%  [kernel]                   [k] tcp_queue_rcv
   0.36%  [kernel]                   [k] nf_conntrack_in
   0.35%  [kernel]                   [k] tcp_packet
   0.34%  [kernel]                   [k] ep_poll
   0.32%  [kernel]                   [k] sock_put
   0.32%  [kernel]                   [k] _raw_spin_unlock_irqrestore
   0.31%  [kernel]                   [k] __inet_lookup_established

Still perf top shows copy_user_enhanced_fast_string with tcp only, is that expected? There's of course tcp_sendmsg_locked indicating tcp is really in use, but consuming much less time.

yosefe · 2020-06-26T10:58:35Z

@pentschev probably TCP eventually uses same memory copy function in the kernel. it takes 40% of the time vs. ~34% with shared memory, which can indicate in case of TCP the copy is done on both sides (since actual BW for TCP is ~2x lower)

pentschev · 2020-06-26T11:10:56Z

That makes sense, thanks for the comment.

To me the performance we obtained here of around 11GB/s with SM vs 5.3GB/s with TCP seems reasonable. @quasiben do we have any other questions or issues?

quasiben · 2020-06-26T15:48:26Z

No I think we can close. Thank you @pentschev and @yosefe

pentschev · 2020-06-29T15:42:28Z

We've been trying to run UCX-Py also with shared memory, we have a simple benchmark that spawns two processes that I patched to make sure CPU is pinned.

local-send-recv-py CPU pinning patch

diff --git a/benchmarks/local-send-recv.py b/benchmarks/local-send-recv.py
index 443903c..de1b12e 100644
--- a/benchmarks/local-send-recv.py
+++ b/benchmarks/local-send-recv.py
@@ -22,6 +22,7 @@ UCX_SOCKADDR_TLS_PRIORITY=sockcm python local-send-recv.py --server-dev 0 \
 import argparse
 import asyncio
 import multiprocessing as mp
+import os
 from time import perf_counter as clock

 from distributed.utils import format_bytes, parse_bytes
@@ -32,6 +33,7 @@ mp = mp.get_context("spawn")


 def server(queue, args):
+    os.sched_setaffinity(0, [0])
     ucp.init()

     if args.object_type == "numpy":
@@ -84,6 +86,7 @@ def server(queue, args):


 def client(queue, port, server_address, args):
+    os.sched_setaffinity(0, [1])
     import ucp

     ucp.init()

It can be reproduced as follows:

UCXPY_LOG_LEVEL=DEBUG UCX_LOG_LEVEL=INFO UCX_TCP_TX_SEG_SIZE=8M UCX_TCP_RX_SEG_SIZE=8M UCX_RNDV_THRESH=8192 UCX_TLS=tcp,sockcm,sm UCX_SOCKADDR_TLS_PRIORITY=sockcm python benchmarks/local-send-recv.py -o numpy --n-bytes 10_000_000 --n-iter 10

By enabling UCX-Py's debug information, it prints what configurations was used by checking ucp_config_read, and the command above shows that sm is enabled in UCX_TLS, but the UCX_LOG_LEVEL=INFO says that posix/memory is not enabled nevertheless, the relevant part of the log can be seen below:

[1593444658.397971] [dgx13:8356] UCXPY  INFO UCP initiated using config:
[1593444658.398062] [dgx13:8356] UCXPY  INFO   NET_DEVICES: all
[1593444658.398117] [dgx13:8356] UCXPY  INFO   SHM_DEVICES: all
[1593444658.398147] [dgx13:8356] UCXPY  INFO   ACC_DEVICES: all
[1593444658.398174] [dgx13:8356] UCXPY  INFO   SELF_DEVICES: all
[1593444658.398201] [dgx13:8356] UCXPY  INFO   TLS: tcp,sockcm,sm
[1593444658.398232] [dgx13:8356] UCXPY  INFO   ALLOC_PRIO: md:sysv,md:posix,huge,thp,md:*,mmap,heap
[1593444658.398260] [dgx13:8356] UCXPY  INFO   SOCKADDR_TLS_PRIORITY: sockcm
[1593444658.398304] [dgx13:8356] UCXPY  INFO   SOCKADDR_AUX_TLS: ud
[1593444658.398349] [dgx13:8356] UCXPY  INFO   WARN_INVALID_CONFIG: y
[1593444658.398376] [dgx13:8356] UCXPY  INFO   BCOPY_THRESH: 0
[1593444658.398402] [dgx13:8356] UCXPY  INFO   RNDV_THRESH: 8K
[1593444658.398429] [dgx13:8356] UCXPY  INFO   RNDV_SEND_NBR_THRESH: 256K
[1593444658.398456] [dgx13:8356] UCXPY  INFO   RNDV_THRESH_FALLBACK: inf
[1593444658.398482] [dgx13:8356] UCXPY  INFO   RNDV_PERF_DIFF: 1.000
[1593444658.398509] [dgx13:8356] UCXPY  INFO   MAX_EAGER_RAILS: 1
[1593444658.398541] [dgx13:8356] UCXPY  INFO   MAX_RNDV_RAILS: 2
[1593444658.398569] [dgx13:8356] UCXPY  INFO   RNDV_SCHEME: get_zcopy
[1593444658.398596] [dgx13:8356] UCXPY  INFO   ZCOPY_THRESH: auto
[1593444658.398622] [dgx13:8356] UCXPY  INFO   BCOPY_BW: (inf)Bps
[1593444658.398648] [dgx13:8356] UCXPY  INFO   ATOMIC_MODE: guess
[1593444658.398674] [dgx13:8356] UCXPY  INFO   MAX_WORKER_NAME: 32
[1593444658.398700] [dgx13:8356] UCXPY  INFO   USE_MT_MUTEX: n
[1593444658.398727] [dgx13:8356] UCXPY  INFO   ADAPTIVE_PROGRESS: y
[1593444658.398754] [dgx13:8356] UCXPY  INFO   SEG_SIZE: 8K
[1593444658.398780] [dgx13:8356] UCXPY  INFO   TM_THRESH: 1K
[1593444658.398813] [dgx13:8356] UCXPY  INFO   TM_MAX_BB_SIZE: 1K
[1593444658.398840] [dgx13:8356] UCXPY  INFO   TM_FORCE_THRESH: 8K
[1593444658.398866] [dgx13:8356] UCXPY  INFO   TM_SW_RNDV: n
[1593444658.398893] [dgx13:8356] UCXPY  INFO   NUM_EPS: auto
[1593444658.398919] [dgx13:8356] UCXPY  INFO   NUM_PPN: auto
[1593444658.398945] [dgx13:8356] UCXPY  INFO   RNDV_FRAG_SIZE: 512K
[1593444658.398971] [dgx13:8356] UCXPY  INFO   MEMTYPE_CACHE: n
[1593444658.398997] [dgx13:8356] UCXPY  INFO   FLUSH_WORKER_EPS: y
[1593444658.399023] [dgx13:8356] UCXPY  INFO   UNIFIED_MODE: n
[1593444658.399050] [dgx13:8356] UCXPY  INFO   SOCKADDR_CM_ENABLE: n

...

[1593444660.692140] [dgx13:8356 :0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[1]: tag(tcp/ib0); stream(tcp/ib0);

Running ucx_perftest with the same environment variables from the command above and same UCX install confirms that posix/memory is enabled:

[1593445079.531503] [dgx13:8686 :0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[1]: tag(posix/memory tcp/ib0 tcp/ib3);

I'm not really sure what happens, as just enabling/disabling other transports with UCX-Py has never presented any issues, in fact adding rc to the command shows that IB gets enabled correctly:

[1593445224.917907] [dgx13:8875 :0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[1]: tag(rc_verbs/mlx5_1:1 rc_verbs/mlx5_0:1); stream(rc_verbs/mlx5_1:1);

Any help is appreciated @yosefe !

yosefe · 2020-06-29T19:32:35Z

@pentschev can you pls refer to UCX-py code which is used to create context, worker, and endpoint?

pentschev · 2020-06-29T19:55:13Z

Context:

https://github.com/rapidsai/ucx-py/blob/43c32e6223c42449fc0d89b52773016c82380073/ucp/_libs/ucx_api.pyx#L197-L214

Worker:

https://github.com/rapidsai/ucx-py/blob/43c32e6223c42449fc0d89b52773016c82380073/ucp/_libs/ucx_api.pyx#L283-L287

Endpoints:

https://github.com/rapidsai/ucx-py/blob/43c32e6223c42449fc0d89b52773016c82380073/ucp/_libs/ucx_api.pyx#L349-L384

Endpoint parameters:

https://github.com/rapidsai/ucx-py/blob/43c32e6223c42449fc0d89b52773016c82380073/ucp/_libs/src/c_util.c#L44-L85

Listener:

https://github.com/rapidsai/ucx-py/blob/43c32e6223c42449fc0d89b52773016c82380073/ucp/_libs/ucx_api.pyx#L501-L511

Listener parameters:

https://github.com/rapidsai/ucx-py/blob/43c32e6223c42449fc0d89b52773016c82380073/ucp/_libs/src/c_util.c#L15-L37

pentschev · 2020-06-29T19:55:57Z

Are there any special flags we should be including for shared memory? I've been looking at that, it seems like we may be missing something.

bureddy · 2020-06-29T20:54:08Z

seems WAKEUP feature request disabled sm (https://github.com/rapidsai/ucx-py/blob/43c32e6223c42449fc0d89b52773016c82380073/ucp/_libs/ucx_api.pyx#L204) because of missing UCT_IFACE_FLAG_EVENT_RECV capability

pentschev · 2020-06-29T21:27:54Z

Thanks @bureddy , removing the UCP_FEATURE_WAKEUP and using UCX in blocking mode (see https://github.com/rapidsai/ucx-py/blob/43c32e6223c42449fc0d89b52773016c82380073/ucp/_libs/ucx_api.pyx#L297-L319 ) indeed seems to resolve that. Does that mean that we must use UCX in blocking if we want to enable sm?

pentschev · 2020-06-29T21:32:53Z

Sorry, I meant to say using non-blocking mode. I disabled blocking mode, the blocking mode is set by the code from my previous comment.

yosefe · 2020-06-29T22:17:57Z

Are there any special flags we should be including for shared memory? I've been looking at that, it seems like we may be missing something.

This is indeed the case. We can say it's a missing feature: wakeup support for shared memory transports.

pentschev · 2020-06-29T22:27:41Z

Thanks both @yosefe and @bureddy for the comments, that has been very helpful to find source of the issue. I will do some testing in non-blocking mode without the wakeup feature to see how UCX-Py behaves and report back once I have some results.

In order to reliably and timely make progress within UCX the user is responsible for occasionally call `progess` on the UCX worker. Originally I used a Julia `Timer` object to gurantee progress especially in the context of asymmetric communication, e.g. active messages. The `Timer` object would trigger ever millisecond resulting in a much higher latency, following that I implemented the polling interface using the WAKEUP feature, but that turns of support for shared memory openucx/ucx#5322 and turned out to have relativly high overhead in a pure latency test on the order of ~20microseconds. I experimented with two other modes (1) the busy waiting mode, but that is using unfair scheduling and might livelock and libuv `Idle`. Idle callbacks are a bit odd, but seem to work well everytime Julia ticks the event loop libuv will call the progress function. The performanc of busy waiting seems to degrade with multiple threads, while the idler performs well, but I have not yet performed a whole system comparision.

shamisp · 2021-11-30T19:31:54Z

@vchuravy FYI this issue was addressed

vchuravy · 2021-12-02T16:05:38Z

Would be good to close this then :)

quasiben added the Bug label Jun 24, 2020

quasiben closed this as completed Jun 26, 2020

yosefe changed the title ~~Shared Memory Performance~~ Shared Memory Performance - missing wakeup support Jun 29, 2020

yosefe added Feature New feature and removed Bug labels Jun 29, 2020

yosefe reopened this Jun 29, 2020

quasiben mentioned this issue Jul 20, 2020

High Inter-node Latency for Small Messages for UCX-Py using InfiniBand rapidsai/ucx-py#563

Open

pentschev mentioned this issue Jul 24, 2020

Disable UCP_FEATURE_WAKEUP in non-blocking mode rapidsai/ucx-py#565

Open

quasiben mentioned this issue Feb 9, 2021

RFC: explicit shared memory dask/distributed#4497

Open

vchuravy mentioned this issue Feb 26, 2021

Polling vs non polling JuliaParallel/UCX.jl#36

Closed

shamisp closed this as completed Dec 3, 2021

vchuravy mentioned this issue Dec 6, 2021

Use wakeup feature by default JuliaParallel/UCX.jl#45

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared Memory Performance - missing wakeup support #5322

Shared Memory Performance - missing wakeup support #5322

quasiben commented Jun 24, 2020

yosefe commented Jun 24, 2020 •

edited

Loading

pentschev commented Jun 24, 2020

quasiben commented Jun 24, 2020

yosefe commented Jun 25, 2020

pentschev commented Jun 25, 2020

yosefe commented Jun 25, 2020

quasiben commented Jun 25, 2020

pentschev commented Jun 25, 2020

yosefe commented Jun 26, 2020

pentschev commented Jun 26, 2020

yosefe commented Jun 26, 2020

pentschev commented Jun 26, 2020

quasiben commented Jun 26, 2020

pentschev commented Jun 29, 2020

yosefe commented Jun 29, 2020 •

edited

Loading

pentschev commented Jun 29, 2020

pentschev commented Jun 29, 2020

bureddy commented Jun 29, 2020

pentschev commented Jun 29, 2020

pentschev commented Jun 29, 2020

yosefe commented Jun 29, 2020

pentschev commented Jun 29, 2020

shamisp commented Nov 30, 2021

vchuravy commented Dec 2, 2021

Shared Memory Performance - missing wakeup support #5322

Shared Memory Performance - missing wakeup support #5322

Comments

quasiben commented Jun 24, 2020

yosefe commented Jun 24, 2020 • edited Loading

pentschev commented Jun 24, 2020

quasiben commented Jun 24, 2020

yosefe commented Jun 25, 2020

pentschev commented Jun 25, 2020

yosefe commented Jun 25, 2020

quasiben commented Jun 25, 2020

tcp,sm,sockcm

ucx_perftest

tcp,sockcm

pentschev commented Jun 25, 2020

tcp,sm,sockcm

sm,sockcm

yosefe commented Jun 26, 2020

pentschev commented Jun 26, 2020

tcp,sockcm

yosefe commented Jun 26, 2020

pentschev commented Jun 26, 2020

quasiben commented Jun 26, 2020

pentschev commented Jun 29, 2020

yosefe commented Jun 29, 2020 • edited Loading

pentschev commented Jun 29, 2020

pentschev commented Jun 29, 2020

bureddy commented Jun 29, 2020

pentschev commented Jun 29, 2020

pentschev commented Jun 29, 2020

yosefe commented Jun 29, 2020

pentschev commented Jun 29, 2020

shamisp commented Nov 30, 2021

vchuravy commented Dec 2, 2021

yosefe commented Jun 24, 2020 •

edited

Loading

yosefe commented Jun 29, 2020 •

edited

Loading