Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] make Dask training resilient to worker restarts during network setup #3775

Closed
jameslamb opened this issue Jan 17, 2021 · 1 comment
Closed

Comments

@jameslamb
Copy link
Collaborator

jameslamb commented Jan 17, 2021

Summary

LightGBM distributed training requires that all workers participating in training know about each other (https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html#preparation). This information is given via the parameter machines.

LightGBM's Dask module generates a list of workers by checking where pieces of the input data are stored.

who_has = client.who_has(parts)

It then uses the addresses in that list to set up the machines parameters. This step can take a few seconds to a few minutes (#3766 (comment)), and if one of the workers is restarted in between when that list is generated and when training starts, you can get a hard-to-understand KeyError like this:

KeyError: 'tcp://10.10.0.20:44284'

Improving this experience would mean one or both of these:

  • refactoring python-package/lightgbm/dask.py to reduce the risk of this, maybe by adding error handling for the case where workers are restarted
  • detecting this condition and adding a more informative error message

Motivation

Adding this feature would make the Dask interface a bit more stable, reducing the need for users to retry training or try to debug low-level Dask details.

Notes for Reviewers

I added "during network setup" to the title here, because I think that making it possible for LightGBM training to continue if a worker disappears during training is a much bigger task and not limited to Dask. @guolinke, I'd love your thoughts on this when you have time.

References

This issue was originally posted as dask/dask-lightgbm#24, and more background is available there. Moving it here as part of dask/community#104.

@jameslamb
Copy link
Collaborator Author

Closing this, as we use #2302 to track feature requests. Leave a comment below if you'd like to contribute this feature, and we'll be happy to re-open it!

jameslamb referenced this issue Jan 17, 2021
…twork (fixes #3753) (#3766)

* starting work

* fixed port-binding issue on localhost

* minor cleanup

* updates

* getting closer

* definitely working for LocalCluster

* it works, it works

* docs

* add tests

* removing testing-only files

* linting

* Apply suggestions from code review

Co-authored-by: Nikita Titov <[email protected]>

* remove duplicated code

* remove unnecessary listen()

Co-authored-by: Nikita Titov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant