-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Host IP resolution problem w/ dask on kubernetes or dask-gateway #5765
Comments
I'm trying to expose the rabit context through dask interface. Hopefully I can get it done in this release. |
Thanks @trivialfis ! |
Am I correct in reading that PR ( #6142 ) may address this? |
We spent a fair amount of time on this. We were able to get it working on local testing with k8s, but failed on GKE. I will try resolving dask gateway issue first. |
Quick note: After resolving the IP issue, dask failed at gathering data partitions on GKE. |
Hi!
There are two different failure modes here, but I think the solution to them is the same so I'm going to keep them bundled. Both of these come up using the bundled
dask
helpers and in both cases (k8s & dask-gateway) I believe the culprit is here:https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/dask.py#L360-L367
For a dask cluster deployed with a helm chart, the scheduler pod is exposed via a kubernetes service. The service acts as a reverse proxy, providing a static route to the scheduler pod (that would survive the scheduler pod restarting).
Currently,
xgboost.dask
isn't compatible with this deployment pattern:The failure here is because the
client.scheduler.address
is pointing to the service IP, but the underlying scheduler pod can't open a port in the service, so it barfs.One workaround is to use
hostname -i
viasubprocess
and run this on the scheduler viaclient.run_on_scheduler
, then reconnect to the scheduler via the pod IP.I think this can be generalized using something like the fix proposed here (for the same issue): dask/dask-xgboost#40
In short, perform hostname lookup on the scheduler (or have an exposed way to choose how to resolve the host IP).
I think that performing hostname lookup this way will also allow
xgboost.dask
to work with clusters spun up via a dask-gateway. These currently time-out because the hostname parsing ofclient.scheduler.address
doesn't know how to deal with an SNI routed scheduler address (example below of address type)The text was updated successfully, but these errors were encountered: