-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
openib error: "ibv_exp_query_device: invalid comp_mask" -- reported by multiple users #5914
Comments
We are seeing this at LANL, esp. on our ARM machines when using many processes (64 or more) |
Seems like it could happen if openib BTL, or an older version of UCX, is compiled with MLNX_OFED 4.4. |
Can you point out the source code of UCX to include it in the MLNX_OFED 4.4 installation? |
@abeltre1 not sure i understand the question, anyway - UCX binary is included in MLNX_OFED 4.4 installation, source code is in https://github.com/openucx/ucx, and the particular code which avoids the error from ibv_exp_query_device() is here |
I did some more checking on one of our clusters and actually do not see the issue when using
I do see this with openib BTL initialization even though its not even being used:
|
Under certain circumstances, ibv_exp_query_device was returning an error due to uninitialized fields in the extended attributes struct. Fixes: open-mpi#5810 Fixes: open-mpi#5914 Signed-off-by: Howard Pritchard <[email protected]>
Under certain circumstances, ibv_exp_query_device was returning an error due to uninitialized fields in the extended attributes struct. Fixes: open-mpi#5810 Fixes: open-mpi#5914 Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 8126779)
Under certain circumstances, ibv_exp_query_device was returning an error due to uninitialized fields in the extended attributes struct. Fixes: open-mpi#5810 Fixes: open-mpi#5914 Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 8126779)
Under certain circumstances, ibv_exp_query_device was returning an error due to uninitialized fields in the extended attributes struct. Fixes: open-mpi#5810 Fixes: open-mpi#5914 Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 8126779)
Under certain circumstances, ibv_exp_query_device was returning an error due to uninitialized fields in the extended attributes struct. Fixes: open-mpi#5810 Fixes: open-mpi#5914 Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 8126779)
Under certain circumstances, ibv_exp_query_device was returning an error due to uninitialized fields in the extended attributes struct. Fixes: open-mpi#5810 Fixes: open-mpi#5914 Signed-off-by: Howard Pritchard <[email protected]>
Done merging to release branches |
I've just run into this on an x86-64 cluster with IB here at NERSC and can confirm that this one line addition fixes it here as well. |
Good! We just released v2.1.6 with the fix. It'll eventually come out in new v3.0.x and v3.1.x and v4.0.x releases, too. |
Thanks Jeff, be good to get that in (especially as trying to use UCX from MOFED or 1.5.0RC1 seems to cause MPI_Barrier to segfault in the OSU microbenchmarks). |
@chrissamuel any chance you have a backtrace of the UCX segfault in MPI_Barrier? |
Hi @yosefe, Here you go - I was about to open a bug with the UCX folks about it.
I found that the solution was to restrict the devices to just those intended for MPI with: All the best, |
@yosefe I've created this UCX issue for the crash. openucx/ucx#3145 |
I have compiled openmpi 2.0.2 with hpcx-v2.2.0-gcc-MLNX_OFED_LINUX-4.4-1.0.0.0-redhat7.3-x86_64 on redhat 7.6 using Intel 2018u2 compiler. I am using the old version because the latest update does not work on 87++ nodes! (see: ompi/issues/6786) MLNX_OFED_LINUX-4.6-1.0.1.1 (OFED-4.6-1.0.1) I am running on 150 nodes as per below: ============
|
Please use the UCX PML on Mellanox IB systems. It will provide the best out of box experience.
Josh
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: nitinpatil1985 <[email protected]>
Sent: Thursday, July 11, 2019 12:01:40 PM
To: open-mpi/ompi
Cc: Subscribed
Subject: Re: [open-mpi/ompi] openib error: "ibv_exp_query_device: invalid comp_mask" -- reported by multiple users (#5914)
I have compiled openmpi 2.0.2 with hpcx-v2.2.0-gcc-MLNX_OFED_LINUX-4.4-1.0.0.0-redhat7.3-x86_64 on redhat 7.6 using Intel 2018u2 compiler.
I am using the old version because the latest update does not work on 87++ nodes! (see: ompi/issues/6786)
MLNX_OFED_LINUX-4.6-1.0.1.1 (OFED-4.6-1.0.1)
I am running on 150 nodes as per below:
mpirun --mca btl openib,self -mca mtl mxm -mca plm_rsh_no_tree_spawn true -hostfile ${PBS_NODEFILE} ./binary.exe
============
I am getting the following error:
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0x75699f0 valid_mask = 0x3)
[r2i0n0][[22721,1],2][btl_openib_component.c:1646:init_one_device] error obtaining device attributes for mlx5_0 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0x756ceb0 valid_mask = 0x3)
[r2i0n0][[22721,1],3][btl_openib_component.c:1646:init_one_device] error obtaining device attributes for mlx5_1 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0x756ceb0 valid_mask = 0x3)
[r2i0n0][[22721,1],9][btl_openib_component.c:1646:init_one_device] error obtaining device attributes for mlx5_1 errno says Invalid argument
ibv_exp_query_device: invalid comp_mask !!! (comp_mask = 0x756cf70 valid_mask = 0x3)
[r2i0n0][[22721,1],5][btl_openib_component.c:1646:init_one_device] error obtaining device attributes for mlx5_1 errno says Invalid argument
WARNING: There was an error initializing an OpenFabrics device.
Local host: r2i0n0
Local device: mlx5_1
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopen-mpi%2Fompi%2Fissues%2F5914%3Femail_source%3Dnotifications%26email_token%3DAEBJNADVOPLML6PACEHYCQDP65KOJA5CNFSM4F3IMXQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZXFPUQ%23issuecomment-510547922&data=02%7C01%7Cjoshual%40mellanox.com%7C6c2c72c8982844df2f8b08d7061910a9%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636984577043599832&sdata=Vt5rtO3MX2XmrhAaQtXockUrBGe5dc8XnyjZsepmY0g%3D&reserved=0>, or mute the thread<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAEBJNAAXZYA2ZLY4MBQI43DP65KOJANCNFSM4F3IMXQQ&data=02%7C01%7Cjoshual%40mellanox.com%7C6c2c72c8982844df2f8b08d7061910a9%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636984577043609829&sdata=M2hNgBVdAQsbOkF2hiOhN8ziylgSBUEtcITcFE1q8ko%3D&reserved=0>.
|
There have been multiple reports of the openib BTL reporting variations this error:
I know that openib is on its way out the door, but it's still supported in v2.x and 3.x. Is there a quick/easy fix for this issue?
I am unable to reproduce the issue with ConnectX 5's on Ethernet and the inbox verbs drivers on RHEL 7.2. Is there something that has changed in upstream OFED and/or MOFED that is causing this issue?
#5810 is the most recent activity where this came up, but it has come up in other issues, too (and possibly on mailing lists...?).
The text was updated successfully, but these errors were encountered: