Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openib error: "ibv_exp_query_device: invalid comp_mask" -- reported by multiple users #5914

Closed
jsquyres opened this issue Oct 12, 2018 · 14 comments
Assignees
Labels

Comments

@jsquyres
Copy link
Member

jsquyres commented Oct 12, 2018

There have been multiple reports of the openib BTL reporting variations this error:

ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0x27800000002 valid_mask = 0x1)"

I know that openib is on its way out the door, but it's still supported in v2.x and 3.x. Is there a quick/easy fix for this issue?

I am unable to reproduce the issue with ConnectX 5's on Ethernet and the inbox verbs drivers on RHEL 7.2. Is there something that has changed in upstream OFED and/or MOFED that is causing this issue?

#5810 is the most recent activity where this came up, but it has come up in other issues, too (and possibly on mailing lists...?).

@jsquyres jsquyres added the bug label Oct 12, 2018
@hppritcha
Copy link
Member

We are seeing this at LANL, esp. on our ARM machines when using many processes (64 or more)
per node. We see this using UCX as well.

@yosefe
Copy link
Contributor

yosefe commented Oct 15, 2018

Seems like it could happen if openib BTL, or an older version of UCX, is compiled with MLNX_OFED 4.4.
@hppritcha it should work with UCX v1.4.x or UCX master

@abeltre1
Copy link

Seems like it could happen if openib BTL, or an older version of UCX, is compiled with MLNX_OFED 4.4.
@hppritcha it should work with UCX v1.4.x or UCX master

Can you point out the source code of UCX to include it in the MLNX_OFED 4.4 installation?

@yosefe
Copy link
Contributor

yosefe commented Oct 16, 2018

@abeltre1 not sure i understand the question, anyway - UCX binary is included in MLNX_OFED 4.4 installation, source code is in https://github.com/openucx/ucx, and the particular code which avoids the error from ibv_exp_query_device() is here

@hppritcha
Copy link
Member

I did some more checking on one of our clusters and actually do not see the issue when using
UCX. Our setup is MOFED 4.4 and UCX shipped as part of that release -

/home/hpp/ompi/examples@cn805:~/ompi/examples> (v4.0.x)ofed_info
MLNX_OFED_LINUX-4.4-2.0.7.0 (OFED-4.4-2.0.7):
/home/hpp/ompi/examples@cn805:~/ompi/examples> (v4.0.x)ucx_info -v
# UCT version=1.4.0 revision 739b569

I do see this with openib BTL initialization even though its not even being used:

/home/hpp/ompi/examples@cn805:~/srun -n 2 -N 2 ./ring_c
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0x4000031ba900 valid_mask = 0x1)
[cn805][[5425,6],0][btl_openib_component.c:1698:init_one_device] error obtaining device attributes for mlx5_1 errno says Invalid argument
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   cn805
  Local device: mlx5_1
--------------------------------------------------------------------------
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0x27ddfac0 valid_mask = 0x1)
[cn806][[5425,6],1][btl_openib_component.c:1698:init_one_device] error obtaining device attributes for mlx5_1 errno says Invalid argument
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   cn806
  Local device: mlx5_1

@hppritcha hppritcha self-assigned this Oct 17, 2018
hppritcha added a commit to hppritcha/ompi that referenced this issue Oct 31, 2018
Under certain circumstances, ibv_exp_query_device was
returning an error due to uninitialized fields in the
extended attributes struct.

Fixes: open-mpi#5810
Fixes: open-mpi#5914

Signed-off-by: Howard Pritchard <[email protected]>
hppritcha added a commit to hppritcha/ompi that referenced this issue Nov 2, 2018
Under certain circumstances, ibv_exp_query_device was
returning an error due to uninitialized fields in the
extended attributes struct.

Fixes: open-mpi#5810
Fixes: open-mpi#5914

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 8126779)
@hppritcha hppritcha reopened this Nov 2, 2018
hppritcha added a commit to hppritcha/ompi that referenced this issue Nov 6, 2018
Under certain circumstances, ibv_exp_query_device was
returning an error due to uninitialized fields in the
extended attributes struct.

Fixes: open-mpi#5810
Fixes: open-mpi#5914

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 8126779)
hppritcha added a commit to hppritcha/ompi that referenced this issue Nov 6, 2018
Under certain circumstances, ibv_exp_query_device was
returning an error due to uninitialized fields in the
extended attributes struct.

Fixes: open-mpi#5810
Fixes: open-mpi#5914

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 8126779)
hppritcha added a commit to hppritcha/ompi that referenced this issue Nov 6, 2018
Under certain circumstances, ibv_exp_query_device was
returning an error due to uninitialized fields in the
extended attributes struct.

Fixes: open-mpi#5810
Fixes: open-mpi#5914

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 8126779)
bosilca pushed a commit to bosilca/ompi that referenced this issue Dec 3, 2018
Under certain circumstances, ibv_exp_query_device was
returning an error due to uninitialized fields in the
extended attributes struct.

Fixes: open-mpi#5810
Fixes: open-mpi#5914

Signed-off-by: Howard Pritchard <[email protected]>
@hppritcha
Copy link
Member

Done merging to release branches

@chrissamuel
Copy link

I've just run into this on an x86-64 cluster with IB here at NERSC and can confirm that this one line addition fixes it here as well.

@jsquyres
Copy link
Member Author

Good! We just released v2.1.6 with the fix. It'll eventually come out in new v3.0.x and v3.1.x and v4.0.x releases, too.

@chrissamuel
Copy link

chrissamuel commented Jan 15, 2019

Thanks Jeff, be good to get that in (especially as trying to use UCX from MOFED or 1.5.0RC1 seems to cause MPI_Barrier to segfault in the OSU microbenchmarks).

@yosefe
Copy link
Contributor

yosefe commented Jan 17, 2019

@chrissamuel any chance you have a backtrace of the UCX segfault in MPI_Barrier?

@chrissamuel
Copy link

chrissamuel commented Jan 17, 2019

Hi @yosefe,

Here you go - I was about to open a bug with the UCX folks about it.

> srun -C gpu --ntasks=2 --ntasks-per-node=1 ./osu_latency
# OSU MPI Latency Test v5.5
# Size          Latency (us)
[cgpu02:116574:0:116574] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
    0  /global/homes/c/csamuel/UCX/ucx-1.5.0/lib/libucs.so.0(+0x1f1a0) [0x2ba62ea7b1a0]
    1  /global/homes/c/csamuel/UCX/ucx-1.5.0/lib/libucs.so.0(+0x1f3fb) [0x2ba62ea7b3fb]
    2  /global/homes/c/csamuel/UCX/ucx-1.5.0/lib/libuct.so.0(uct_ib_address_unpack+0x14) [0x2ba62df3a774]
    3  /global/homes/c/csamuel/UCX/ucx-1.5.0/lib/libuct.so.0(uct_rc_ep_connect_to_ep+0x2c) [0x2ba62df429fc]
    4  /global/homes/c/csamuel/UCX/ucx-1.5.0/lib/libucp.so.0(+0x49d99) [0x2ba62dcf3d99]
    5  /global/homes/c/csamuel/UCX/ucx-1.5.0/lib/libucp.so.0(+0x4bca0) [0x2ba62dcf5ca0]
    6  /global/homes/c/csamuel/UCX/ucx-1.5.0/lib/libuct.so.0(uct_ud_ep_process_rx+0x24a) [0x2ba62df88d5a]
    7  /global/homes/c/csamuel/UCX/ucx-1.5.0/lib/libuct.so.0(+0x7ea72) [0x2ba62df8fa72]
    8  /global/homes/c/csamuel/UCX/ucx-1.5.0/lib/libucp.so.0(ucp_worker_progress+0x32) [0x2ba62dcc63f2]
    9  /global/homes/c/csamuel/OMPI/4.0.0-gcc-ucx_15/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0x10a) [0x2ba62d8a079a]
   10  /global/homes/c/csamuel/OMPI/4.0.0-gcc-ucx_15/lib/libmpi.so.40(ompi_coll_base_barrier_intra_two_procs+0xe1) [0x2ba621c78231]
   11  /global/homes/c/csamuel/OMPI/4.0.0-gcc-ucx_15/lib/libmpi.so.40(MPI_Barrier+0xa7) [0x2ba621c35a17]
   12  /global/u2/c/csamuel/bin/osu_ompi_40_ucx_15/./osu_latency() [0x4017a9]
   13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x2ba6229b9725]
   14  /global/u2/c/csamuel/bin/osu_ompi_40_ucx_15/./osu_latency() [0x401ab9]
===================
srun: error: cgpu02: task 1: Segmentation fault (core dumped)

I found that the solution was to restrict the devices to just those intended for MPI with:
export UCX_NET_DEVICES=mlx5_0:1,mlx5_2:1,mlx5_4:1,mlx5_6:1

All the best,
Chris

@chrissamuel
Copy link

@yosefe I've created this UCX issue for the crash. openucx/ucx#3145

@nitinpatil1985
Copy link

I have compiled openmpi 2.0.2 with hpcx-v2.2.0-gcc-MLNX_OFED_LINUX-4.4-1.0.0.0-redhat7.3-x86_64 on redhat 7.6 using Intel 2018u2 compiler.

I am using the old version because the latest update does not work on 87++ nodes! (see: ompi/issues/6786)

MLNX_OFED_LINUX-4.6-1.0.1.1 (OFED-4.6-1.0.1)

I am running on 150 nodes as per below:
mpirun --mca btl openib,self -mca mtl mxm -mca plm_rsh_no_tree_spawn true -hostfile ${PBS_NODEFILE} ./binary.exe

============
I am getting the following error:
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0x75699f0 valid_mask = 0x3)
[r2i0n0][[22721,1],2][btl_openib_component.c:1646:init_one_device] error obtaining device attributes for mlx5_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0x756ceb0 valid_mask = 0x3)
[r2i0n0][[22721,1],3][btl_openib_component.c:1646:init_one_device] error obtaining device attributes for mlx5_1 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0x756ceb0 valid_mask = 0x3)
[r2i0n0][[22721,1],9][btl_openib_component.c:1646:init_one_device] error obtaining device attributes for mlx5_1 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0x756cf70 valid_mask = 0x3)
[r2i0n0][[22721,1],5][btl_openib_component.c:1646:init_one_device] error obtaining device attributes for mlx5_1 errno says Invalid argument

WARNING: There was an error initializing an OpenFabrics device.

Local host: r2i0n0
Local device: mlx5_1

@xinzhao3
Copy link
Contributor

xinzhao3 commented Jul 11, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants