You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are using LightGBM in our project Mars to train models distributedly. When we tests our modules with separate processes to mock what happens in distributed environment, sometimes two processes cannot connect with each other and LightGBM retrys connection in an infinite loop.
We discover that all ports are opened successfully. The cause of the connection failure is that in
Thank you for this report! I have seen similar issues on Mac (but not in Linux environmemts), but hadn't been able to reproduce it cleanly or describe it this clearly.
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.
How you are using LightGBM?
We are using LightGBM in our project Mars to train models distributedly. When we tests our modules with separate processes to mock what happens in distributed environment, sometimes two processes cannot connect with each other and LightGBM retrys connection in an infinite loop.
We discover that all ports are opened successfully. The cause of the connection failure is that in
LightGBM/src/network/linkers_socket.cpp
Lines 200 to 210 in ac706e1
when a connection attempt fails, the socket handle is reused again and OS reports bad fle descriptor, and connection attempt can never be successful.
We creates a PR by recreating the socket handle on every connection attempt.
Choose one of the following components
Environment info
Operating System: MacOS 10.15.7
CPU/GPU model: Intel Core i7
Python version: 3.8.5
LightGBM version or commit hash: ac706e1
Error message and / or logs
Reproducible example(s)
We make a minimal example to demonstrate the issue.
The text was updated successfully, but these errors were encountered: