Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ucx_perftest core dump when specified two RoCE nic using UCX_NET_DEVICES #9724

Closed
huzhijiang opened this issue Mar 2, 2024 · 5 comments
Closed
Labels

Comments

@huzhijiang
Copy link

huzhijiang commented Mar 2, 2024

Describe the bug

My machine has two ConnectX-4 RoCE card, thus 4 phy ports. I pass through two of them(from each card) to a kvm guest VM running inside the machine. Then I connect 2 phy ports (from different RoCE card) of the host machine directly to the 2 phy ports of the guest machine using cable.

If I specify any single phy port by using UCX_NET_DEVICES for ucx_perftest to run between the two nodes, it works fine. But if not using
UCX_NET_DEVICES, it simply crashs.

Steps to Reproduce

  1. With UCX_NET_DEVICES to specify one device:
[root@promote ucx-1.15.0]# UCX_NET_DEVICES=mlx5_1:1 ucx_perftest 192.168.1.199 -t tag_bw -s 16384 -n 100000
[1709393444.062904] [promote:53845:0]        perftest.c:783  UCX  WARN  CPU affinity is not set (bound to 12 cpus). Performance may be impacted.
+--------------+--------------+------------------------------+---------------------+-----------------------+
|              |              |       overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | 50.0%ile | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[thread 0]             96801      2.000    10.334    10.334     1511.93    1511.93       96764       96764
Final:                100000      2.000    10.634    10.344     1469.35    1510.53       94038       96674
  1. With UCX_NET_DEVICES to specify the other device:
[root@promote ucx-1.15.0]# UCX_NET_DEVICES=mlx5_0:1 ucx_perftest 192.168.1.199 -t tag_bw -s 16384 -n 100000
[1709393484.677329] [promote:53864:0]        perftest.c:783  UCX  WARN  CPU affinity is not set (bound to 12 cpus). Performance may be impacted.
+--------------+--------------+------------------------------+---------------------+-----------------------+
|              |              |       overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | 50.0%ile | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[thread 0]             96801      2.000    10.336    10.336     1511.76    1511.76       96752       96752
Final:                100000      2.000    10.629    10.345     1470.00    1510.38       94080       96665
  1. Without UCX_NET_DEVICES
[1709393226.788365] [promote:53773:0]        perftest.c:783  UCX  WARN  CPU affinity is not set (bound to 12 cpus). Performance may be impacted.
+--------------+--------------+------------------------------+---------------------+-----------------------+
|              |              |       overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | 50.0%ile | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[promote:53773:0:53773]       ud_ep.c:280  Fatal: UD endpoint 0x1a9a010 to <no debug data>: unhandled timeout error
==== backtrace (tid:  53773) ====
 0  /lib64/libucs.so.0(ucs_handle_error+0x294) [0x7f15914a6a74]
 1  /lib64/libucs.so.0(ucs_fatal_error_message+0xb0) [0x7f15914a3a30]
 2  /lib64/libucs.so.0(ucs_fatal_error_format+0xd1) [0x7f15914a3b11]
 3  /lib64/ucx/libuct_ib.so.0(+0x88df8) [0x7f15830a1df8]
 4  /lib64/libucs.so.0(+0x24062) [0x7f159149a062]
 5  /lib64/libucp.so.0(ucp_worker_progress+0x3a) [0x7f159198fdba]
 6  ucx_perftest() [0x409faf]
 7  ucx_perftest() [0x40aebd]
 8  ucx_perftest() [0x40b142]
 9  ucx_perftest() [0x4054fc]
10  ucx_perftest() [0x405643]
11  ucx_perftest() [0x403727]
12  /lib64/libc.so.6(__libc_start_main+0xe5) [0x7f15902d87e5]
13  ucx_perftest() [0x4037ae]
=================================
Aborted (core dumped)
[root@promote ucx-1.15.0]# ucx_info -v
# Library version: 1.15.0
# Library path: /lib64/libucs.so.0
# API headers version: 1.15.0
# Git branch '', revision
# Configured with: --enable-examples

No ucx env variable used.

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)

[root@promote ucx-1.15.0]# cat /etc/redhat-release
CentOS Stream release 8

[root@promote ucx-1.15.0]# uname -a
Linux promote.ldns.rate.local 4.18.0-529.el8.x86_64 #1 SMP Wed Dec 6 01:03:46 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

  • For RDMA/IB/RoCE related issues:
    • Driver version:
      [root@promote ucx-1.15.0]# rpm -q rdma-core
      rdma-core-48.0-1.el8.x86_64

    • HW information from ibstat or ibv_devinfo -vv command
      [root@promote ucx-1.15.0]# ibv_devinfo -vv
      hca_id: mlx5_0
      transport: InfiniBand (0)
      fw_ver: 14.32.1010
      node_guid: 9803:9b03:000c:5e13
      sys_image_guid: 9803:9b03:000c:5e12
      vendor_id: 0x02c9
      vendor_part_id: 4117
      hw_ver: 0x0
      board_id: MT_2420110034
      phys_port_cnt: 1
      max_mr_size: 0xffffffffffffffff
      page_size_cap: 0xfffffffffffff000
      max_qp: 131072
      max_qp_wr: 32768
      device_cap_flags: 0x25321c36
      BAD_PKEY_CNTR
      BAD_QKEY_CNTR
      AUTO_PATH_MIG
      CHANGE_PHY_PORT
      PORT_ACTIVE_EVENT
      SYS_IMAGE_GUID
      RC_RNR_NAK_GEN
      MEM_WINDOW
      XRC
      MEM_MGT_EXTENSIONS
      MEM_WINDOW_TYPE_2B
      RAW_IP_CSUM
      MANAGED_FLOW_STEERING
      max_sge: 30
      max_sge_rd: 30
      max_cq: 16777216
      max_cqe: 4194303
      max_mr: 16777216
      max_pd: 8388608
      max_qp_rd_atom: 16
      max_ee_rd_atom: 0
      max_res_rd_atom: 2097152
      max_qp_init_rd_atom: 16
      max_ee_init_rd_atom: 0
      atomic_cap: ATOMIC_HCA (1)
      max_ee: 0
      max_rdd: 0
      max_mw: 16777216
      max_raw_ipv6_qp: 0
      max_raw_ethy_qp: 0
      max_mcast_grp: 2097152
      max_mcast_qp_attach: 240
      max_total_mcast_qp_attach: 503316480
      max_ah: 2147483647
      max_fmr: 0
      max_srq: 8388608
      max_srq_wr: 32767
      max_srq_sge: 31
      max_pkeys: 128
      local_ca_ack_delay: 16
      general_odp_caps:
      ODP_SUPPORT
      ODP_SUPPORT_IMPLICIT
      rc_odp_caps:
      SUPPORT_SEND
      SUPPORT_RECV
      SUPPORT_WRITE
      SUPPORT_READ
      SUPPORT_SRQ
      uc_odp_caps:
      NO SUPPORT
      ud_odp_caps:
      SUPPORT_SEND
      xrc_odp_caps:
      SUPPORT_SEND
      SUPPORT_WRITE
      SUPPORT_READ
      SUPPORT_SRQ
      completion timestamp_mask: 0x7fffffffffffffff
      hca_core_clock: 156250kHZ
      raw packet caps:
      C-VLAN stripping offload
      Scatter FCS offload
      IP csum offload
      Delay drop
      device_cap_flags_ex: 0x1425321C36
      RAW_SCATTER_FCS
      PCI_WRITE_END_PADDING
      tso_caps:
      max_tso: 262144
      supported_qp:
      SUPPORT_RAW_PACKET
      rss_caps:
      max_rwq_indirection_tables: 1048576
      max_rwq_indirection_table_size: 2048
      rx_hash_function: 0x1
      rx_hash_fields_mask: 0x800000FF
      supported_qp:
      SUPPORT_RAW_PACKET
      max_wq_type_rq: 8388608
      packet_pacing_caps:
      qp_rate_limit_min: 0kbps
      qp_rate_limit_max: 0kbps
      tag matching not supported

      cq moderation caps:
      max_cq_count: 65535
      max_cq_period: 4095 us

      num_comp_vectors: 12
      port: 1
      state: PORT_ACTIVE (4)
      max_mtu: 4096 (5)
      active_mtu: 4096 (5)
      sm_lid: 0
      port_lid: 0
      port_lmc: 0x00
      link_layer: Ethernet
      max_msg_sz: 0x40000000
      port_cap_flags: 0x04010000
      port_cap_flags2: 0x0000
      max_vl_num: invalid value (0)
      bad_pkey_cntr: 0x0
      qkey_viol_cntr: 0x0
      sm_sl: 0
      pkey_tbl_len: 1
      gid_tbl_len: 255
      subnet_timeout: 0
      init_type_reply: 0
      active_width: 1X (1)
      active_speed: 25.0 Gbps (32)
      phys_state: LINK_UP (5)
      GID[ 0]: fe80:0000:0000:0000:9a03:9bff:fe0c:5e13, RoCE v1
      GID[ 1]: fe80::9a03:9bff:fe0c:5e13, RoCE v2
      GID[ 2]: 0000:0000:0000:0000:0000:ffff:c0a8:c802, RoCE v1
      GID[ 3]: ::ffff:192.168.200.2, RoCE v2
      GID[ 4]: fe80:0000:0000:0000:8b43:a800:a6f7:d07a, RoCE v1
      GID[ 5]: fe80::8b43:a800:a6f7:d07a, RoCE v2

hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 14.32.1010
node_guid: b8ce:f603:0027:49cb
sys_image_guid: b8ce:f603:0027:49ca
vendor_id: 0x02c9
vendor_part_id: 4117
hw_ver: 0x0
board_id: MT_2420110034
phys_port_cnt: 1
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffffffffffff000
max_qp: 131072
max_qp_wr: 32768
device_cap_flags: 0x25321c36
BAD_PKEY_CNTR
BAD_QKEY_CNTR
AUTO_PATH_MIG
CHANGE_PHY_PORT
PORT_ACTIVE_EVENT
SYS_IMAGE_GUID
RC_RNR_NAK_GEN
MEM_WINDOW
XRC
MEM_MGT_EXTENSIONS
MEM_WINDOW_TYPE_2B
RAW_IP_CSUM
MANAGED_FLOW_STEERING
max_sge: 30
max_sge_rd: 30
max_cq: 16777216
max_cqe: 4194303
max_mr: 16777216
max_pd: 8388608
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom: 2097152
max_qp_init_rd_atom: 16
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA (1)
max_ee: 0
max_rdd: 0
max_mw: 16777216
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 0
max_mcast_grp: 2097152
max_mcast_qp_attach: 240
max_total_mcast_qp_attach: 503316480
max_ah: 2147483647
max_fmr: 0
max_srq: 8388608
max_srq_wr: 32767
max_srq_sge: 31
max_pkeys: 128
local_ca_ack_delay: 16
general_odp_caps:
ODP_SUPPORT
ODP_SUPPORT_IMPLICIT
rc_odp_caps:
SUPPORT_SEND
SUPPORT_RECV
SUPPORT_WRITE
SUPPORT_READ
SUPPORT_SRQ
uc_odp_caps:
NO SUPPORT
ud_odp_caps:
SUPPORT_SEND
xrc_odp_caps:
SUPPORT_SEND
SUPPORT_WRITE
SUPPORT_READ
SUPPORT_SRQ
completion timestamp_mask: 0x7fffffffffffffff
hca_core_clock: 156250kHZ
raw packet caps:
C-VLAN stripping offload
Scatter FCS offload
IP csum offload
Delay drop
device_cap_flags_ex: 0x1425321C36
RAW_SCATTER_FCS
PCI_WRITE_END_PADDING
tso_caps:
max_tso: 262144
supported_qp:
SUPPORT_RAW_PACKET
rss_caps:
max_rwq_indirection_tables: 1048576
max_rwq_indirection_table_size: 2048
rx_hash_function: 0x1
rx_hash_fields_mask: 0x800000FF
supported_qp:
SUPPORT_RAW_PACKET
max_wq_type_rq: 8388608
packet_pacing_caps:
qp_rate_limit_min: 0kbps
qp_rate_limit_max: 0kbps
tag matching not supported

    cq moderation caps:
            max_cq_count:   65535
            max_cq_period:  4095 us

    num_comp_vectors:               12
            port:   1
                    state:                  PORT_ACTIVE (4)
                    max_mtu:                4096 (5)
                    active_mtu:             4096 (5)
                    sm_lid:                 0
                    port_lid:               0
                    port_lmc:               0x00
                    link_layer:             Ethernet
                    max_msg_sz:             0x40000000
                    port_cap_flags:         0x04010000
                    port_cap_flags2:        0x0000
                    max_vl_num:             invalid value (0)
                    bad_pkey_cntr:          0x0
                    qkey_viol_cntr:         0x0
                    sm_sl:                  0
                    pkey_tbl_len:           1
                    gid_tbl_len:            255
                    subnet_timeout:         0
                    init_type_reply:        0
                    active_width:           1X (1)
                    active_speed:           25.0 Gbps (32)
                    phys_state:             LINK_UP (5)
                    GID[  0]:               fe80:0000:0000:0000:bace:f6ff:fe27:49cb, RoCE v1
                    GID[  1]:               fe80::bace:f6ff:fe27:49cb, RoCE v2
                    GID[  2]:               0000:0000:0000:0000:0000:ffff:c0a8:6402, RoCE v1
                    GID[  3]:               ::ffff:192.168.100.2, RoCE v2
                    GID[  4]:               fe80:0000:0000:0000:4bc8:3945:5e94:fa20, RoCE v1
                    GID[  5]:               fe80::4bc8:3945:5e94:fa20, RoCE v2

Additional information (depending on the issue)

  • OpenMPI version ( Not used)
  • Output of ucx_info -d to show transports and devices recognized by UCX
[root@promote ucx-1.15.0]# ucx_info -d
#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#           rkey_ptr is supported
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: self
#         Device: memory
#           Type: loopback
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 19360.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: tcp
#         Device: lo
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.91/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: enp8s0
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 2200.00/ppn + 0.00 MB/sec
#              latency: 5223 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: enp1s0
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.32/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 0
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: enp7s0
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 2200.00/ppn + 0.00 MB/sec
#              latency: 5223 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
#      max_conn_priv: 2064 bytes
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#         memory types: host (access,alloc,cache)
#
#      Transport: sysv
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 15360.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: posix
#     Component: posix
#             allocate: <= 32710208K
#           remote key: 24 bytes
#           rkey_ptr is supported
#         memory types: host (access,alloc,cache)
#
#      Transport: posix
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 15360.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: mlx5_0
#     Component: ib
#             register: unlimited, cost: 180 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#         memory types: host (access,reg,cache)
#
#      Transport: rc_verbs
#         Device: mlx5_0:1
#           Type: network
#  System device: mlx5_0 (0)
#
#      capabilities:
#            bandwidth: 1719.30/ppn + 0.00 MB/sec
#              latency: 800 + 1.000 * N nsec
#             overhead: 75 nsec
#            put_short: <= 124
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 5 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 5 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 123
#             am_bcopy: <= 8255
#             am_zcopy: <= 8255, up to 4 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 127
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#      device priority: 28
#     device num paths: 1
#              max eps: 256
#       device address: 17 bytes
#           ep address: 7 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: rc_mlx5
#         Device: mlx5_0:1
#           Type: network
#  System device: mlx5_0 (0)
#
#      capabilities:
#            bandwidth: 1719.30/ppn + 0.00 MB/sec
#              latency: 800 + 1.000 * N nsec
#             overhead: 40 nsec
#            put_short: <= 220
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 14 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 14 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 234
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 186
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to ep
#      device priority: 28
#     device num paths: 1
#              max eps: 256
#       device address: 17 bytes
#           ep address: 10 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: ud_verbs
#         Device: mlx5_0:1
#           Type: network
#  System device: mlx5_0 (0)
#
#      capabilities:
#            bandwidth: 1719.30/ppn + 0.00 MB/sec
#              latency: 830 nsec
#             overhead: 105 nsec
#             am_short: <= 116
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 5 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 3992
#           connection: to ep, to iface
#      device priority: 28
#     device num paths: 1
#              max eps: inf
#       device address: 17 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: ud_mlx5
#         Device: mlx5_0:1
#           Type: network
#  System device: mlx5_0 (0)
#
#      capabilities:
#            bandwidth: 1719.30/ppn + 0.00 MB/sec
#              latency: 830 nsec
#             overhead: 80 nsec
#             am_short: <= 180
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 132
#           connection: to ep, to iface
#      device priority: 28
#     device num paths: 1
#              max eps: inf
#       device address: 17 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
# Memory domain: mlx5_1
#     Component: ib
#             register: unlimited, cost: 180 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#         memory types: host (access,reg,cache)
#
#      Transport: rc_verbs
#         Device: mlx5_1:1
#           Type: network
#  System device: mlx5_1 (1)
#
#      capabilities:
#            bandwidth: 1719.30/ppn + 0.00 MB/sec
#              latency: 800 + 1.000 * N nsec
#             overhead: 75 nsec
#            put_short: <= 124
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 5 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 5 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 123
#             am_bcopy: <= 8255
#             am_zcopy: <= 8255, up to 4 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 127
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#      device priority: 28
#     device num paths: 1
#              max eps: 256
#       device address: 17 bytes
#           ep address: 7 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: rc_mlx5
#         Device: mlx5_1:1
#           Type: network
#  System device: mlx5_1 (1)
#
#      capabilities:
#            bandwidth: 1719.30/ppn + 0.00 MB/sec
#              latency: 800 + 1.000 * N nsec
#             overhead: 40 nsec
#            put_short: <= 220
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 14 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 14 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 234
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 186
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to ep
#      device priority: 28
#     device num paths: 1
#              max eps: 256
#       device address: 17 bytes
#           ep address: 10 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: ud_verbs
#         Device: mlx5_1:1
#           Type: network
#  System device: mlx5_1 (1)
#
#      capabilities:
#            bandwidth: 1719.30/ppn + 0.00 MB/sec
#              latency: 830 nsec
#             overhead: 105 nsec
#             am_short: <= 116
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 5 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 3992
#           connection: to ep, to iface
#      device priority: 28
#     device num paths: 1
#              max eps: inf
#       device address: 17 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: ud_mlx5
#         Device: mlx5_1:1
#           Type: network
#  System device: mlx5_1 (1)
#
#      capabilities:
#            bandwidth: 1719.30/ppn + 0.00 MB/sec
#              latency: 830 nsec
#             overhead: 80 nsec
#             am_short: <= 180
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 132
#           connection: to ep, to iface
#      device priority: 28
#     device num paths: 1
#              max eps: inf
#       device address: 17 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
# Connection manager: rdmacm
#      max_conn_priv: 54 bytes
#
# Memory domain: cma
#     Component: cma
#             register: unlimited, cost: 9 nsec
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: cma
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 2000 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#
  • Configure result - config.log
    config.log

  • Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"
    run_log.txt

@huzhijiang huzhijiang added the Bug label Mar 2, 2024
@huzhijiang
Copy link
Author

Quick additional info reply: Even if I let both nodes to use two phy ports from same card, same issue still happen.
I also checked master branch, and it also has the same issue.
Please help to take a look. Many thanks!

@huzhijiang
Copy link
Author

Another additional info reply:
If I repalce one of the two same ConnectX-4 RoCE card with an Infiniband card (CX454A), then everything is fine!
Seems there is problem associated with two RoCE card?

@huzhijiang
Copy link
Author

Seems realted to reachable detection method. After adding UCX_IB_ROCE_LOCAL_SUBNET=y env variable, ucx_perftest finally works. but tag_bw test case seems only occupy one RoCE card. Which test case can make use of more than one rail?

@huzhijiang
Copy link
Author

By increase the data size to 32K (tag_bw -s 32768), two RoCE card finally work together and bandwidth doubled! There must be some kind of threshold to control this behavior right?

@huzhijiang
Copy link
Author

MIN_RNDV_CHUNK_SIZE seems to be the threshold. Issue closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant