[python-package] Non-unique data error when doing hyper-parameter search with DaskLGBMClassifier #6091

kunshouout · 2023-09-12T03:02:18Z

I am trying to do some hyper parameter searching using dask, and code looks like this

data:dask.DataFrame = load_data()
model =DaskLGBMClassifier(**model_param)

for i in range(100):
   model.fit(data)

I have 2 dask workers, one of which is on another machine. After successfully fitting 4 models, the fifth model failed with, and I find the following exception in one of the worker logs,

2023-09-11 19:45:32,443 - distributed.worker - WARNING - Compute Failed
Key:       ('getitem-939a22691d305f88a3bf008879bded28', 108)
Function:  subgraph_callable-a336435b-0267-4d5a-b38e-a553a4a7
args:      ('feature', slice(datetime.date(2022, 3, 15), datetime.date(2022, 6, 1), None), slice(None, None, None), 'try_loc-9cf4266fd991474fec96fcb4c7de20aa',                        feature             ...  label instrument
                         feat1    feat2  ... future  code
datetime                                   ...                  
2021-01-04 09:36:04  -0.339260  -0.326822  ...      0   c640
2021-01-04 09:36:07   0.104004   0.104004  ...      0   c640
...                        ...        ...  ...    ...        ...
2021-12-30 09:57:46 -40.941887 -26.143145  ...      0   c640
kwargs:    {}
Exception: "ValueError('cannot handle a non-unique multi-index!')"

For my data, the index has one level, but the columns have two levels. The other worker warns about low memory.

What confuses me is that, if there is really duplication in the data, how come the first four trials succeeded? Plus I checked the subset data with code c640, and there's no duplication.
What could be the reason? Please help

Environment info

LightGBM version or commit hash:
3.3.5.99

The text was updated successfully, but these errors were encountered:

jameslamb · 2023-09-12T03:11:07Z

Thanks for using LightGBM.

With just the information you've provided, I think it'll be difficult for us to debug this. I've personally never seen this using lightgbm.dask before, and I don't believe that lightgbm.dask creates any multi-indexes itself.

Does the same error show up if you switch to using LocalCluster? Maybe this is a weird side effect of a worker being lost during training (e.g. #3775).

Could you please create a minimal, reproducible example trying to replicate this? Or at least, which creates a Dask dataframe with the same dimensions and indices as your real training data, even if it doesn't reproduce the problem?

jameslamb · 2023-09-12T03:11:51Z

Linking this possibly-related thread: dask/dask#4845

kunshouout · 2023-09-13T10:17:22Z

I tried again with smaller dataset and with the same code and dask configuration, and it works fine. It seems this problem occurs only when the dataset is too large(two dask workers memory limit 74.5 GB*2, and the total "bytes stored" is 80GB during running). Thus, it is quite hard to reproduce the problem. I guess the error is memory related.

After reviewing my code, I think I figured out which line triggered the exception. More detailed version of the code is,

def fetch_data( data_storage,  selector,  col_set ):
   return data_storage.loc[selector, col_set]

data:dask.DataFrame = load_data()
data = data.persist()

for i in range(100):
   model =DaskLGBMClassifier(**model_param)

   train_feat = fetch_data(data, slice("2021:03:15","2022:03:15"), "feature")
   train_label = fetch_data(data, slice("2021:03:15","2022:03:15"), "label")
   # This is where the exception is reported
   valid_feat = fetch_data(data, slice("2022:03:15","2022:06:01"), "feature")
   valid_label = fetch_data(data, slice("2022:03:15","2022:06:01"), "label")
   model.fit(train_feat, train_label, eval_set=[(valid_feat, valid_label))

The data looks like this,

                        feature           ...  label instrument
                         feat1    feat2  ... future  code
datetime                                   ...                  
2021-01-04 09:36:04  -0.339260  -0.326822  ...      0   c640
2021-01-04 09:36:07   0.104004   0.104004  ...      0   c640
...                        ...        ...  ...    ...        ...
2021-12-30 09:57:46 -40.941887 -26.143145  ...      0   c640

dask.DataFrame does support multi-index at axis=0, but it's ok to use multi-index at axis=1

I hope this helps to id the problem.

For a remedy, I will just try to reduce the size of the data.

jmoralez · 2023-10-04T18:15:49Z

Hey @kunshouout, seems like the problem isn't related to LightGBM though, is it?

github-actions · 2023-11-04T04:03:20Z

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

github-actions · 2024-11-06T00:26:23Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb added question dask labels Sep 12, 2023

jameslamb changed the title ~~Non-unique data error when doing hyper-parameter search with DaskLGBMClassifier~~ [python-package] Non-unique data error when doing hyper-parameter search with DaskLGBMClassifier Sep 12, 2023

jameslamb added the awaiting response label Sep 12, 2023

github-actions bot removed the awaiting response label Sep 13, 2023

jmoralez added the awaiting response label Oct 4, 2023

github-actions bot closed this as completed Nov 4, 2023

github-actions bot removed the awaiting response label Nov 6, 2024

github-actions bot locked as resolved and limited conversation to collaborators Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] Non-unique data error when doing hyper-parameter search with DaskLGBMClassifier #6091

[python-package] Non-unique data error when doing hyper-parameter search with DaskLGBMClassifier #6091

kunshouout commented Sep 12, 2023

jameslamb commented Sep 12, 2023

jameslamb commented Sep 12, 2023

kunshouout commented Sep 13, 2023

jmoralez commented Oct 4, 2023

github-actions bot commented Nov 4, 2023

github-actions bot commented Nov 6, 2024

[python-package] Non-unique data error when doing hyper-parameter search with DaskLGBMClassifier #6091

[python-package] Non-unique data error when doing hyper-parameter search with DaskLGBMClassifier #6091

Comments

kunshouout commented Sep 12, 2023

Environment info

jameslamb commented Sep 12, 2023

jameslamb commented Sep 12, 2023

kunshouout commented Sep 13, 2023

jmoralez commented Oct 4, 2023

github-actions bot commented Nov 4, 2023

github-actions bot commented Nov 6, 2024