-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v2.x: XRC UCDM openib failures while running Mellanox CI #3890
Comments
@hppritcha @jsquyres @hjelmn @bwbarrett @miked-mellanox @jladd-mlnx @Di0gen |
Ah, so there was an actual failure, it was just silent? Got it. Thanks for tracking it down and making it non-silent for the future. |
With pml/yalla and pml/ucx those tests are fine as well. |
What is the right way to proceed? |
In reading the description of the bug, I didn't realize it was a real openib problem -- I think most people will have missed the logfile you put at the bottom of the description. @hjelmn @hppritcha @bharatpotnuri There appears to be a problem with XRC in the openib BTL in the v2.x branch right now. See below for a snippit from the logfile Artem included earlier in the ticket. Who will fix this?
|
I temporarily disabled openib for v2.x branch to allow PRs to be tested. |
Looks like we need to cherry pick 56bdcd0 |
Before dynamic add_procs support was committed to master we called add_procs with every proc in the job. The XRC code in the openib btl was taking advantage of this and setting the number of work queue entries (WQE) based on all the procs on a remote node. Since that is no longer the case we can not simply increment the sd_wqe field on the queue pair. To fix the issue a new field has been added to the xrc queue pair structure to keep track of how many wqes there are total on the queue pair. If a new endpoint is added that increases the number of wqes and the xrc queue pair is already connected the code will attempt to modify the number of wqes on the queue pair. A failure is ignored because all that will happen is the number of active send work requests on an XRC queue pair will be more limited. related to open-mpi#1721 fixes open-mpi#3890 Signed-off-by: Nathan Hjelm <[email protected]> Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 56bdcd0)
Per discussions at devel core meeting 7/25/17 a PR will be opened to disable XRC support in the openib BTL. Mellanox will try to determine why the test fails on their config over the next week. If no progress made, by next Tuesday, this PR will get merged in to master. |
@hppricha, I checked and we won't be able to get to this in near future. |
Change the default enable configure option XRC to disabled. If a user want's to give it a try they have to explicitly ask for it. Modify the configury help message to indicate it is not enabled by default. Related to open-mpi#3890 Fixes open-mpi#3969 Signed-off-by: Howard Pritchard <[email protected]>
Change the default enable configure option XRC to disabled. If a user want's to give it a try they have to explicitly ask for it. Modify the configury help message to indicate it is not enabled by default. Related to open-mpi#3890 Fixes open-mpi#3969 Signed-off-by: Howard Pritchard <[email protected]>
Change the default enable configure option XRC to disabled. If a user want's to give it a try they have to explicitly ask for it. Modify the configury help message to indicate it is not enabled by default. Related to open-mpi#3890 Fixes open-mpi#3969 Signed-off-by: Howard Pritchard <[email protected]>
Change the default enable configure option XRC to disabled. If a user want's to give it a try they have to explicitly ask for it. Modify the configury help message to indicate it is not enabled by default. Related to open-mpi#3890 Fixes open-mpi#3969 Signed-off-by: Howard Pritchard <[email protected]>
Disable XRC support for OpenIB BTL Related to open-mpi#3890 Fixes open-mpi#3969 Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 8223d4c) Conflicts: NEWS
disable XRC in OpenIB BTL due to lack of support. Related to open-mpi#3890 Fixes open-mpi#3969 Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 8223d4c) (cherry picked from commit c22a7c7) Conflicts: NEWS config/opal_check_openfabrics.m4
disable XRC in OpenIB BTL due to lack of support. Related to open-mpi#3890 Fixes open-mpi#3969 Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 8223d4c) Conflicts: NEWS
I think this was addressed long ago. |
Background information
Silent Mellanox jenkins failures was observed recently.
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
Failures seems to be observed for GitHub v2.x branch only.
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Regular Mellanox CI build
Please describe the system on which you are running
Details of the problem
The following command silently fails:
While expected output is
Same command with btl/tcp works fine:
Here is more detailed log (with btl verbose on):
openib_failure.txt
Mellanox Jenkins script is updated to output the exit status so in future this behavior will not cause such confusion.
The text was updated successfully, but these errors were encountered: