-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] [python] Store co-local data parts as dicts instead of lists #3853
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thank you! I just have one very small nitpicky style thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much cleaner and I think it will help reduce the risk of mistakes in the future. Thanks so much!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Thanks! Code is much cleaner now.
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Addresses #3795. The
_train
function indask.py
places delayed "parts," i.e.data
,label
,sample_weight
, andgroup
data into lists (which get distributed to workers) to enforce co-locality of data. Each worker is distributed alist_of_parts
, a list of lists. Once the worker receives its lists of parts, it infers thedata
,label
, and then optionallysample_weight
andgroup
from eachlist_of_parts
by assuming thatdata
is in position 0,label
is in position 1, and if the model is not a DaskLGBMRanker, the second position is reserved forsample_weight
.This is an issue of readability + scalability - implying the presence of
sample_weight
andgroup
based on the length of each sublist oflist_of_parts
in conjunction with the estimator type is... hard to understand and doesn't lend itself well to storing additionalparts
likeeval_sets
. Instead, let's just store individual sets of parts as dicts instead of lists. Now each worker will receive its portion of the overall dataset as a list of dicts instead of a list of lists.