-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lightgbm.dask hangs after worker restarting #5920
Comments
Thanks for using LightGBM. Unfortunately, LightGBM distributed training is not currently resilient to workers being lost during training. See #3775 for some details on that. It's a feature we'd love to add in the future, so if you are familiar with C++, Python, TCP, and collective communication patterns we'd welcome contributions. Otherwise, you will just have to ensure your workers have sufficient memory to survive the training process, and subscribe to #3775 to be notified if/when it's addressed. |
This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM! |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Description
The code for the dask workers
the code to train lightgbm model is very simple
After running for about 30 minutes, the program hangs. The logs are:
and the dashboard looks like this
From the dashboard, it looks like the two workers have been successfully restarted, and the memory is now within the limit. Why can't the training proceed further?
Environment info
lightgbm: 3.3.5
dask: 2022.11.1
python: 3.8.10
OS: ubuntu 20.04
The text was updated successfully, but these errors were encountered: