-
Notifications
You must be signed in to change notification settings - Fork 866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/local bind #5819
Fix/local bind #5819
Conversation
I have some concerns about this patch. Yes, it works around the immediate problem of the bind failing, but also breaks an assumption that we added to work around different failures. Seems like we’re just playing wack-a-mole and papering over more fundamental problems in our low level selection logic. Now we can’t count on a certain behavior in connection logic, so it’s even harder to understand when things break. |
What is the problem with the bind failing that concerns you so much ? In most cases (and OSes) the bind will work as expected, and nobody will see a difference. On all others situations I don't see why failing the bind should be a fatal error. Honestly, I think the bind was a bandaid for well configured system, something that is hard to find irl. Anyway, I am looking forward to your patch. |
Guys - did a quick search on this and it may be due to a recent OSX security change that prevents binding to socket 0 unless the app has been "entitled" for incoming connections. Still digging a bit, but this discussion shows up in several places online. One solution I saw given:
but that seems awkward to instruct everyone to do. This fix seems cleaner to me. |
@bosilca the bind() change had nothing to do with multiple IPs on the same interface. It had to do with behaviors of most modern OSes whenever there are multiple paths to the same remote endpoint, whether it be multiple interfaces with routes to each other or multiple IPs on the same interface. Basically, any time there are two interfaces and both have 0/0 routes, we were counting on a behavior that wasn't safe to count on. There are two solutions, and we (it appears) chose poorly. The first is what Jordan did, which was to bind the socket to the source interface we intended to use for the given module to talk to the remote endpoint. The intended source IP is then definitively known at the destination, because there's only one source IP that can be used (as opposed to an unbound socket, where the OS chooses the source IP based on routes and guessing). This worked everywhere, but it appears MacOS has made it difficult to do a port 0 bind, for reasons that don't make any sense. The second solution, which, looking back, would have been the better choice, is to send the IP as part of the hello message. It's a bit more code change on the receiver side of the connection, but is also probably a bit more resistant to silly OS changes. The problem with this patch is that we the developers now don't know if the OS is guessing the source IP or we're actually choosing it, because there's no indication of the bind failing, anywhere. So if there's another of the dreaded "IP address that is unexpected" error (which, by the way, can arrive in two situations: the IP really wasn't expected, or we've already received all expected connections for the given IP), we don't know whether it's the OS playing routing tricks or not. Given that bind() with port = 0 is supposed to succeed and this broken behavior appears limited to OS X, I can see two solutions from here: 1) configure test for bind() behavior and either always set it or never set it based on something that is repeatable reportable in config.log or 2) implement the full send the IP in the hello packet patch. I may have time to do the first this week, the second is what we should do, but I'm not sure I'll get to the second this week. |
The small test code below doesn't have a problem with bind() a 0 port, so there's potentially some evidence there's something else going wrong in the code and we're just papering over the problem. More investigation coming.
|
Isn't that the unfortunate cast on the bind call where we go from a sockaddr_storage to sockaddr via a cast ? |
I think the cast is ok, however the real underlying bug is that the size passed to bind() is wrong. We're passing |
f949f98
to
75007ab
Compare
Indeed, the larger socklen to bind is the culprit. I fixed that part and also improved the output when bind fails (it would not have helped in this case thou). |
opal/mca/btl/tcp/btl_tcp_endpoint.c
Outdated
CLOSE_THE_SOCKET(btl_endpoint->endpoint_sd); | ||
return OPAL_ERROR; | ||
} | ||
sockaddr_addrlen) < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't right; the third argument should be sizeof(struct sockaddr_in), not sizeof(struct sockaddr). Similarly, the one below should be struct sockaddr_in6.
opal/mca/btl/tcp/btl_tcp_endpoint.c
Outdated
opal_net_get_hostname((struct sockaddr*) &btl_endpoint->endpoint_btl->tcp_ifaddr), | ||
htons(((struct sockaddr_in*)&btl_endpoint->endpoint_btl->tcp_ifaddr)->sin_port), | ||
strerror(opal_socket_errno), opal_socket_errno); | ||
BTL_ERROR(("bind() failed: %s (%d)", strerror(opal_socket_errno), opal_socket_errno)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand why we'd have two levels of output here. Seems like better to just always print the full debugging.
Get Brian's patch from open-mpi#5825 and his log message: Fix a failure in binding the initiating side of a connection on MacOS. MacOS doesn't like passing the size of the storage structure (sockaddr_storage) instead of the expected size of the structure (sockaddr_in or sockaddr_in6), which was causing bind() failures. This patch simply changes the structure size to the expected size. Add a more clear error message in debug mode. Signed-off-by: George Bosilca <[email protected]>
Signed-off-by: George Bosilca <[email protected]>
75007ab
to
a3a492b
Compare
Thanks! |
This should fix Ralph issue on OSX. In particular, I was not able to figure out why OSX refuses to bind on a local IP with port 0, but for some reasons it does. Ignoring the failed bind and issuing the connect to the expected peer makes things right.
Fixes #5815