-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Dask] Race condition in finding ports #5865
Comments
Hi @adfea9c0, thanks for raising this. Indeed that can happen because of the reasons you describe. The optimal solution IMO would be allowing to pass 0 as port to the C++ side and have that find a random open port and acquire it immediately. I'll take a look soon to see if that'd be possible. |
+1 thanks for reporting this so we have a specific issue to track. Linking some other related things:
@jmoralez I have one idea to consider that might be a quick way to make these issues less likely (inspired by @adfea9c0 's comment that Maybe we could change the contract for LightGBM/python-package/lightgbm/dask.py Line 323 in d0dfcee
All the time spent between the beginning of I mention this Python-side just because I think the C++ side's use of collective communications (where every worker process can talk directly to every other worker process, and therefore they all are initialized with that |
Would it be possible to have |
Maybe, but the exact mechanism of "permanently claim" could be difficult. Right now, LightGBM's distributed training code in C++ assumes the following:
And on the Dask side, it's important to remember that:
So what would it mean for That might be possible, but I think it'd be difficult to get right in a way that doesn't leak sockets or processes. I don't say all this to discourage an attempt to do this on the C++ side, just saying that the suggestion I gave above about reducing the time between determining a set of ports to use and actually starting up training is where I'd probably start if I was working on this problem, since it's easier to reason about and requires less-invasive changes in LightGBM (at the expense of only benefiting |
Yeah I agree that's a good thing to try first. We just need to figure out a way to do it, I don't think sockets are serializable so we may need to use actors or something similar. I can work on that and open a PR. |
I have a much worse, Dask-specific half-idea, but maybe it could work... could we run a function to claim and never release a port on each worker... def _claim_a_port_until_killed(port: int):
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(('', port))
while True:
time.sleep(10)
futures = client.run(_claim_a_port_until_killed) And then somehow have each def _train_part(...):
# ... all that data-collecting code ... #
with worker_client() as client:
client.cancel(the_future_holding_the_port_for_this_worker)
model.fit()
# ... |
(I won't be offended if you say "absolutely not that's way too weird") |
Oh I see, so when you claim a port in python, the C++ process is not yet running, and you want to claim a port first since the C++ process expects ports to be known upon initialization?
I didn't know this -- does machine here really mean physical machine? I think I've previously had some dask workers run on the same physical box and I don't think I ran into issues, but maybe I misremember. But if this is the case, an intermediate solution for me personally would be to a) enforce a single worker per physical machine (*) b) just locally reserve a port for training across our network and pass that in with *) I've personally noticed LightGBM benefits from few workers with many threads rather than many workers with few threads anyway. |
Correct.
Sorry about that! I misspoke, just edited that comment. you can run multiple distributed training processes on the same physical machine, but not within the same process. More of a problem on a single-physical-machine setup using
Reserving a port ahead of time in your cluster should totally eliminate the risk of hitting this race condition. But to be clear, it isn't required that it be via the To have the most control, you can pass such an explicit list with the |
What is the reason ports need to be selected before distributing the data? Is it because you want the params and training data to be in a single |
Ports needs to be decided and broadcast to all workers before distributed training starts, because workers use worker-to-worker communication, as I explained in #5865 (comment). Because they use worker-to-worker direct communication, not some central "driver" process, finding a port for a given worker process can't be done by that worker process itself... it'd have no way to communicate that information to the other workers. |
Description
Dask LightGBM will sometimes try to bind to ports that were previously free, but are now used by a different program.
Specifically, it seems that the python code in LGBMDaskRegressor [1] finds open ports, saves the port number and then immediately closes them. After that the C++ layer will try reopening the port [2]. This can go wrong when another program binds to the port between these two steps.
I would say most of my runs succeed but I've ran into the 'LightGBMError: Binding port blah failed' error a handful of times now, and I'm fairly confident the above race condition is the issue.
[1]
LightGBM/python-package/lightgbm/dask.py
Line 86 in d0dfcee
[2]
LightGBM/src/network/linkers_socket.cpp
Line 128 in d0dfcee
Reproducible example
It's kind of hard to reproduce this reliably since it's effectively a race condition. I hope my description of the issue suffices, let me know if I can do more to help.
Environment info
I'm using LightGBM 3.3.2 on Dask.
The text was updated successfully, but these errors were encountered: