[dask] [python] Store co-local data parts as dicts instead of lists #3853

ffineis · 2021-01-25T16:34:31Z

Addresses #3795. The _train function in dask.py places delayed "parts," i.e. data, label, sample_weight, and group data into lists (which get distributed to workers) to enforce co-locality of data. Each worker is distributed a list_of_parts, a list of lists. Once the worker receives its lists of parts, it infers the data, label, and then optionally sample_weight and group from each list_of_parts by assuming that data is in position 0, label is in position 1, and if the model is not a DaskLGBMRanker, the second position is reserved for sample_weight.

This is an issue of readability + scalability - implying the presence of sample_weight and group based on the length of each sublist of list_of_parts in conjunction with the estimator type is... hard to understand and doesn't lend itself well to storing additional parts like eval_sets. Instead, let's just store individual sets of parts as dicts instead of lists. Now each worker will receive its portion of the overall dataset as a list of dicts instead of a list of lists.

jameslamb

This is great, thank you! I just have one very small nitpicky style thing.

python-package/lightgbm/dask.py

jameslamb

This is much cleaner and I think it will help reduce the risk of mistakes in the future. Thanks so much!

StrikerRUS

Awesome! Thanks! Code is much cleaner now.

github-actions · 2023-08-24T02:16:53Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

store data parts in dict instead of list

bc7fa6a

ffineis requested a review from jameslamb as a code owner January 25, 2021 16:34

jameslamb requested changes Jan 25, 2021

View reviewed changes

python-package/lightgbm/dask.py Outdated Show resolved Hide resolved

StrikerRUS added the maintenance label Jan 25, 2021

jameslamb reviewed Jan 25, 2021

View reviewed changes

python-package/lightgbm/dask.py Outdated Show resolved Hide resolved

StrikerRUS reviewed Jan 25, 2021

View reviewed changes

python-package/lightgbm/dask.py Show resolved Hide resolved

simplify weight/group parts dict assignment

0d36cc4

jameslamb self-requested a review January 25, 2021 20:57

jameslamb approved these changes Jan 25, 2021

View reviewed changes

jameslamb requested a review from StrikerRUS January 25, 2021 21:00

StrikerRUS approved these changes Jan 25, 2021

View reviewed changes

StrikerRUS merged commit 113da3a into microsoft:master Jan 25, 2021

StrikerRUS mentioned this pull request Jan 25, 2021

[dask] use dictionaries instead of tuples for parts in Dask training #3795

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] [python] Store co-local data parts as dicts instead of lists #3853

[dask] [python] Store co-local data parts as dicts instead of lists #3853

ffineis commented Jan 25, 2021 •

edited

Loading

jameslamb left a comment

jameslamb left a comment

StrikerRUS left a comment

github-actions bot commented Aug 24, 2023

[dask] [python] Store co-local data parts as dicts instead of lists #3853

[dask] [python] Store co-local data parts as dicts instead of lists #3853

Conversation

ffineis commented Jan 25, 2021 • edited Loading

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb left a comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 24, 2023

ffineis commented Jan 25, 2021 •

edited

Loading