Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/local bind #5819

Merged
merged 2 commits into from
Oct 2, 2018
Merged

Fix/local bind #5819

merged 2 commits into from
Oct 2, 2018

Conversation

bosilca
Copy link
Member

@bosilca bosilca commented Oct 2, 2018

This should fix Ralph issue on OSX. In particular, I was not able to figure out why OSX refuses to bind on a local IP with port 0, but for some reasons it does. Ignoring the failed bind and issuing the connect to the expected peer makes things right.

Fixes #5815

@bwbarrett
Copy link
Member

I have some concerns about this patch. Yes, it works around the immediate problem of the bind failing, but also breaks an assumption that we added to work around different failures. Seems like we’re just playing wack-a-mole and papering over more fundamental problems in our low level selection logic. Now we can’t count on a certain behavior in connection logic, so it’s even harder to understand when things break.

@bosilca
Copy link
Member Author

bosilca commented Oct 2, 2018

What is the problem with the bind failing that concerns you so much ? In most cases (and OSes) the bind will work as expected, and nobody will see a difference. On all others situations I don't see why failing the bind should be a fatal error. Honestly, I think the bind was a bandaid for well configured system, something that is hard to find irl. Anyway, I am looking forward to your patch.

@rhc54
Copy link
Contributor

rhc54 commented Oct 2, 2018

Guys - did a quick search on this and it may be due to a recent OSX security change that prevents binding to socket 0 unless the app has been "entitled" for incoming connections. Still digging a bit, but this discussion shows up in several places online. One solution I saw given:

Network: Incoming Connections entitlement. That entitlement can be added
in Xcode under the App Sandbox details in the Capabilities tab of the target settings.

but that seems awkward to instruct everyone to do. This fix seems cleaner to me.

@bwbarrett
Copy link
Member

@bosilca the bind() change had nothing to do with multiple IPs on the same interface. It had to do with behaviors of most modern OSes whenever there are multiple paths to the same remote endpoint, whether it be multiple interfaces with routes to each other or multiple IPs on the same interface. Basically, any time there are two interfaces and both have 0/0 routes, we were counting on a behavior that wasn't safe to count on.

There are two solutions, and we (it appears) chose poorly. The first is what Jordan did, which was to bind the socket to the source interface we intended to use for the given module to talk to the remote endpoint. The intended source IP is then definitively known at the destination, because there's only one source IP that can be used (as opposed to an unbound socket, where the OS chooses the source IP based on routes and guessing). This worked everywhere, but it appears MacOS has made it difficult to do a port 0 bind, for reasons that don't make any sense.

The second solution, which, looking back, would have been the better choice, is to send the IP as part of the hello message. It's a bit more code change on the receiver side of the connection, but is also probably a bit more resistant to silly OS changes.

The problem with this patch is that we the developers now don't know if the OS is guessing the source IP or we're actually choosing it, because there's no indication of the bind failing, anywhere. So if there's another of the dreaded "IP address that is unexpected" error (which, by the way, can arrive in two situations: the IP really wasn't expected, or we've already received all expected connections for the given IP), we don't know whether it's the OS playing routing tricks or not.

Given that bind() with port = 0 is supposed to succeed and this broken behavior appears limited to OS X, I can see two solutions from here: 1) configure test for bind() behavior and either always set it or never set it based on something that is repeatable reportable in config.log or 2) implement the full send the IP in the hello packet patch. I may have time to do the first this week, the second is what we should do, but I'm not sure I'll get to the second this week.

@bwbarrett
Copy link
Member

The small test code below doesn't have a problem with bind() a 0 port, so there's potentially some evidence there's something else going wrong in the code and we're just papering over the problem. More investigation coming.

#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <sys/socket.h>
#include <arpa/inet.h>

int
main(int argc, char *argv[])
{
  int rc, sd;

  struct sockaddr_in address;

  address.sin_family = AF_INET;
  address.sin_port = 0;
  address.sin_addr.s_addr = inet_addr("127.0.0.1");

  sd = socket(PF_INET, SOCK_STREAM, 0);
  if (sd < 0) {
    printf("socket(): %d %s\n", sd, strerror(errno));
    return 1;
  }

  rc = bind(sd, (struct sockaddr*)&address, sizeof(address));
  if (rc < 0) {
    printf("bind(): %d %s\n", rc, strerror(errno));
    return 1;
  }

  return 0;
}

@bosilca
Copy link
Member Author

bosilca commented Oct 2, 2018

Isn't that the unfortunate cast on the bind call where we go from a sockaddr_storage to sockaddr via a cast ?

@bwbarrett
Copy link
Member

Isn't that the unfortunate cast on the bind call where we go from a sockaddr_storage to sockaddr via a cast ?

I think the cast is ok, however the real underlying bug is that the size passed to bind() is wrong. We're passing sizeof(struct sockaddr_storagte) when we should be passing sizeof(struct sockaddr_in). With this change, MacOS is happy again. Putting together a patch that fixes the issue; will post when IPv6 build finishes. Recommend we close this PR, as it's not the underlying fix.

@bosilca
Copy link
Member Author

bosilca commented Oct 2, 2018

Indeed, the larger socklen to bind is the culprit. I fixed that part and also improved the output when bind fails (it would not have helped in this case thou).

CLOSE_THE_SOCKET(btl_endpoint->endpoint_sd);
return OPAL_ERROR;
}
sockaddr_addrlen) < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't right; the third argument should be sizeof(struct sockaddr_in), not sizeof(struct sockaddr). Similarly, the one below should be struct sockaddr_in6.

opal_net_get_hostname((struct sockaddr*) &btl_endpoint->endpoint_btl->tcp_ifaddr),
htons(((struct sockaddr_in*)&btl_endpoint->endpoint_btl->tcp_ifaddr)->sin_port),
strerror(opal_socket_errno), opal_socket_errno);
BTL_ERROR(("bind() failed: %s (%d)", strerror(opal_socket_errno), opal_socket_errno));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why we'd have two levels of output here. Seems like better to just always print the full debugging.

Get Brian's patch from open-mpi#5825 and his log message:
Fix a failure in binding the initiating side of a connection
on MacOS. MacOS doesn't like passing the size of the storage
structure (sockaddr_storage) instead of the expected size of
the structure (sockaddr_in or sockaddr_in6), which was causing
bind() failures. This patch simply changes the structure size
to the expected size.

Add a more clear error message in debug mode.

Signed-off-by: George Bosilca <[email protected]>
Signed-off-by: George Bosilca <[email protected]>
@rhc54 rhc54 merged commit 31f6c75 into open-mpi:master Oct 2, 2018
@rhc54
Copy link
Contributor

rhc54 commented Oct 2, 2018

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BTL TCP on macOSX broken in v4.0.x and master
4 participants