Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BTL TCP on macOSX broken in v4.0.x and master #5815

Closed
gpaulsen opened this issue Oct 1, 2018 · 8 comments · Fixed by #5819
Closed

BTL TCP on macOSX broken in v4.0.x and master #5815

gpaulsen opened this issue Oct 1, 2018 · 8 comments · Fixed by #5819

Comments

@gpaulsen
Copy link
Member

gpaulsen commented Oct 1, 2018

As also reported by @rhc54 and @jsquyres in https://www.mail-archive.com/[email protected]//msg20758.html

 mpirun -np 2 --mca btl "tcp,self" ./imb.x
[GRP-MBP][[39843,1],0][btl_tcp_endpoint.c:731:mca_btl_tcp_endpoint_start_connect] bind() failed: Invalid argument (22)
[GRP-MBP:04122] *** An error occurred in MPI_Bcast
[GRP-MBP:04122] *** reported by process [2611150849,0]
[GRP-MBP:04122] *** on communicator MPI_COMM_WORLD
[GRP-MBP:04122] *** MPI_ERR_OTHER: known error not in list
[GRP-MBP:04122] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[GRP-MBP:04122] ***    and potentially your MPI job)
@gpaulsen gpaulsen changed the title BTL TCP on mac broken in v4.0.x and master BTL TCP on macOSX broken in v4.0.x and master Oct 1, 2018
@gpaulsen
Copy link
Member Author

gpaulsen commented Oct 1, 2018

haven't tried 3.1.x or earlier yet.

@bwbarrett bwbarrett self-assigned this Oct 1, 2018
@bosilca
Copy link
Member

bosilca commented Oct 2, 2018

#5819 is what you are looking for.

@bosilca bosilca self-assigned this Oct 2, 2018
@gpaulsen
Copy link
Member Author

gpaulsen commented Oct 2, 2018

I'll give it a try tonight.

I did learn that -mca btl self,tcp -mca btl_tcp_if_include lo0 helps it pass for mpi hello world, but more complext apps like IMB or rings still failes in MPI_Bcast or MPI_Send.

@bosilca
Copy link
Member

bosilca commented Oct 2, 2018

This might be the right time to have a clear discussion about what we want to support and what we don't. The only reason to have multiple IPs on the same interface is for load balance and QoS, 2 things that we mildly care about (there is also because "we can", but let's ignore this for now). If we drop all addresses but the one on the first device, and use multi-links we will be able to achieve the same outcome but with a selection logic greatly simplified, and a cleaner code base. I added this topic to the developer meeting agenda in few weeks.

@gpaulsen
Copy link
Member Author

gpaulsen commented Oct 2, 2018

I've cherry-picked this to the v4.0.x branch, and verified that I can run on MACOSX with or without the if_include, with only btl self,tcp. with both hello_world and IMB.

@bosilca bosilca mentioned this issue Oct 2, 2018
@jsquyres
Copy link
Member

jsquyres commented Oct 2, 2018

I confirm that this is a problem on both master and v4.0.x. It does not happen on v3.1.x.

I'd argue that this is a blocker for v4.0.0.

@jsquyres
Copy link
Member

jsquyres commented Oct 2, 2018

BTW, it's not clear yet that #5819 is the correct fix (i.e., it fixes the issue, but there's still some debate as to whether it is the correct fix or not).

@gpaulsen
Copy link
Member Author

gpaulsen commented Oct 3, 2018

Fixed on master and v4.0.x (I also verified), and not an issue on v3.1.x. Closing this issue. Thanks to all who helped out on this one.

@gpaulsen gpaulsen closed this as completed Oct 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants