Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "btl/openib: disable XRC in OpenIB BTL" #4082

Closed
wants to merge 1 commit into from

Conversation

artpol84
Copy link
Contributor

This reverts commit c22a7c7.

@artpol84
Copy link
Contributor Author

This is a PR to verify current XRC status for v2.x branch

@artpol84
Copy link
Contributor Author

@hppritcha @jsquyres - it seems like only v2.x is affected but this problem.
And I can't reproduce it if I build manually, but if I go and re-run the command on that server - it can be reproduced.

@artpol84
Copy link
Contributor Author

I've started playing with this and I found the following:

  • My manual config wasn't the same as Jenkins - I thought it enough to provide a platform file, but it seems that I need to export couple of variables also. I'm re-building now.
  • I was confused by "MXM" warnings and started playing with the command line and I found that command line below completes successfully (although there is a visible 2-3 second delay which is unexpected). The only thing I did there was changing -mca pml ob1 to -mca pml ^yalla. What is more interesting - I originally mistyped pml name and set -mca pml ^mxm and run was successful as well!
/var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/bin/mpirun -np 8 -bind-to none -mca orte_tmpdir_base /tmp/tmp.T7Zw99rNEJ --report-state-on-timeout --get-stack-traces --timeout 900 -mca btl_openib_if_include mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc,cm -mca pml ^yalla -mca btl self,openib -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512  taskset -c 4,5 /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/ompi_install1/examples/hello_c
[1502824235.938294] [jenkins03:28369:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.98
[1502824235.938297] [jenkins03:28366:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.98
[1502824235.961031] [jenkins03:28368:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.98
[1502824235.965176] [jenkins03:28364:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.98
[1502824235.965723] [jenkins03:28370:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.98
[1502824235.968194] [jenkins03:28371:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.98
[1502824235.969163] [jenkins03:28365:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.98
[1502824235.971946] [jenkins03:28367:0]         sys.c:744  MXM  WARN  Conflicting CPU frequencies detected, using: 3499.98
Hello, world, I am 3 of 8, (Open MPI v2.1.2rc1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-157-g922e50a, Unreleased developer copy, 144)
Hello, world, I am 1 of 8, (Open MPI v2.1.2rc1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-157-g922e50a, Unreleased developer copy, 144)
Hello, world, I am 5 of 8, (Open MPI v2.1.2rc1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-157-g922e50a, Unreleased developer copy, 144)
Hello, world, I am 7 of 8, (Open MPI v2.1.2rc1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-157-g922e50a, Unreleased developer copy, 144)
Hello, world, I am 2 of 8, (Open MPI v2.1.2rc1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-157-g922e50a, Unreleased developer copy, 144)
Hello, world, I am 6 of 8, (Open MPI v2.1.2rc1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-157-g922e50a, Unreleased developer copy, 144)
Hello, world, I am 4 of 8, (Open MPI v2.1.2rc1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-157-g922e50a, Unreleased developer copy, 144)
Hello, world, I am 0 of 8, (Open MPI v2.1.2rc1, package: Open MPI jenkins@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-157-g922e50a, Unreleased developer copy, 144)

@artpol84
Copy link
Contributor Author

Ok, the reason why -mca ^yalla works is that pml/cm component is selected.

@artpol84
Copy link
Contributor Author

@jladd-mlnx @vspetrov @bureddy
Apparently coll/hcoll is causing this failure. If I remove hcoll component from the components directory all works fine.

@jsquyres @hppritcha @hjelmn
My apologize for this confusion - it was really not obvious and all was pointing to openib.
It sounds funny but it looks like we should actually merge this PRs to turn XRC back and we will take the hcoll issue internally and fix it separately and advise ASAP.

@artpol84
Copy link
Contributor Author

P.S. I was able to reproduce it manually when I fixed my config script.

@vspetrov
Copy link

Hi, The bug is definitely inside XRC support in btl openib. I’ve modified the hello_c.c test by adding
MPI_Allreduce after MPI_init.

#include <stdio.h>
#include "mpi.h"

int main(int argc, char* argv[])
{
    int rank, size, len;
    char version[MPI_MAX_LIBRARY_VERSION_STRING];
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Get_library_version(version, &len);

    int sbuf, rbuf;
    MPI_Allreduce(&sbuf,&rbuf,1,MPI_INT,MPI_SUM,MPI_COMM_WORLD);
    printf("Hello, world, I am %d of %d, (%s, %d)\n",
           rank, size, version, len);    MPI_Finalize();
    return 0;
}

And now the issue is reproduced w/o hcoll:
artemp@jenkins03 ~/scrap/OMPI/builds/ompi/install/bin
XRC:

$./mpirun -np 2 -bind-to none -mca btl_openib_if_include mlx5_0:1 -mca pml ob1 -mca btl self,openib -mca coll ^hcoll -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 ./hello_c
$echo $?
255

No XRC:

$./mpirun -np 2 -bind-to none -mca btl_openib_if_include mlx5_0:1 -mca pml ob1 -mca btl self,openib -mca coll ^hcoll ./hello_c
Hello, world, I am 1 of 2, (Open MPI v2.1.2rc1, package: Open MPI artemp@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-154-g036990e, Unreleased developer copy, 143)
Hello, world, I am 0 of 2, (Open MPI v2.1.2rc1, package: Open MPI artemp@jenkins03 Distribution, ident: 2.1.2rc1, repo rev: v2.1.1-154-g036990e, Unreleased developer copy, 143)

All the time when the issue is observed the MPI processes are getting SIGCONT signal that for some reason stops them (can be checked with either GDB or strace).

@jsquyres
Copy link
Member

So is there an XRC problem in both hcoll and openib?

@jsquyres
Copy link
Member

Please conduct all followup discussions on the umbrella issue for all the "revert the disable-XRC patch" PRs: #4087.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants