Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Non-unique data error when doing hyper-parameter search with DaskLGBMClassifier #6091

Closed
kunshouout opened this issue Sep 12, 2023 · 6 comments

Comments

@kunshouout
Copy link

I am trying to do some hyper parameter searching using dask, and code looks like this

data:dask.DataFrame = load_data()
model =DaskLGBMClassifier(**model_param)

for i in range(100):
   model.fit(data)

I have 2 dask workers, one of which is on another machine. After successfully fitting 4 models, the fifth model failed with, and I find the following exception in one of the worker logs,

2023-09-11 19:45:32,443 - distributed.worker - WARNING - Compute Failed
Key:       ('getitem-939a22691d305f88a3bf008879bded28', 108)
Function:  subgraph_callable-a336435b-0267-4d5a-b38e-a553a4a7
args:      ('feature', slice(datetime.date(2022, 3, 15), datetime.date(2022, 6, 1), None), slice(None, None, None), 'try_loc-9cf4266fd991474fec96fcb4c7de20aa',                        feature             ...  label instrument
                         feat1    feat2  ... future  code
datetime                                   ...                  
2021-01-04 09:36:04  -0.339260  -0.326822  ...      0   c640
2021-01-04 09:36:07   0.104004   0.104004  ...      0   c640
...                        ...        ...  ...    ...        ...
2021-12-30 09:57:46 -40.941887 -26.143145  ...      0   c640
kwargs:    {}
Exception: "ValueError('cannot handle a non-unique multi-index!')"

For my data, the index has one level, but the columns have two levels. The other worker warns about low memory.

What confuses me is that, if there is really duplication in the data, how come the first four trials succeeded? Plus I checked the subset data with code c640, and there's no duplication.
What could be the reason? Please help

Environment info

LightGBM version or commit hash:
3.3.5.99

@jameslamb jameslamb changed the title Non-unique data error when doing hyper-parameter search with DaskLGBMClassifier [python-package] Non-unique data error when doing hyper-parameter search with DaskLGBMClassifier Sep 12, 2023
@jameslamb
Copy link
Collaborator

Thanks for using LightGBM.

With just the information you've provided, I think it'll be difficult for us to debug this. I've personally never seen this using lightgbm.dask before, and I don't believe that lightgbm.dask creates any multi-indexes itself.

Does the same error show up if you switch to using LocalCluster? Maybe this is a weird side effect of a worker being lost during training (e.g. #3775).

Could you please create a minimal, reproducible example trying to replicate this? Or at least, which creates a Dask dataframe with the same dimensions and indices as your real training data, even if it doesn't reproduce the problem?

@jameslamb
Copy link
Collaborator

Linking this possibly-related thread: dask/dask#4845

@kunshouout
Copy link
Author

I tried again with smaller dataset and with the same code and dask configuration, and it works fine. It seems this problem occurs only when the dataset is too large(two dask workers memory limit 74.5 GB*2, and the total "bytes stored" is 80GB during running). Thus, it is quite hard to reproduce the problem. I guess the error is memory related.

After reviewing my code, I think I figured out which line triggered the exception. More detailed version of the code is,

def fetch_data( data_storage,  selector,  col_set ):
   return data_storage.loc[selector, col_set]

data:dask.DataFrame = load_data()
data = data.persist()

for i in range(100):
   model =DaskLGBMClassifier(**model_param)

   train_feat = fetch_data(data, slice("2021:03:15","2022:03:15"), "feature")
   train_label = fetch_data(data, slice("2021:03:15","2022:03:15"), "label")
   # This is where the exception is reported
   valid_feat = fetch_data(data, slice("2022:03:15","2022:06:01"), "feature")
   valid_label = fetch_data(data, slice("2022:03:15","2022:06:01"), "label")
   model.fit(train_feat, train_label, eval_set=[(valid_feat, valid_label))

The data looks like this,

                        feature           ...  label instrument
                         feat1    feat2  ... future  code
datetime                                   ...                  
2021-01-04 09:36:04  -0.339260  -0.326822  ...      0   c640
2021-01-04 09:36:07   0.104004   0.104004  ...      0   c640
...                        ...        ...  ...    ...        ...
2021-12-30 09:57:46 -40.941887 -26.143145  ...      0   c640

dask.DataFrame does support multi-index at axis=0, but it's ok to use multi-index at axis=1

I hope this helps to id the problem.

For a remedy, I will just try to reduce the size of the data.

@jmoralez
Copy link
Collaborator

jmoralez commented Oct 4, 2023

Hey @kunshouout, seems like the problem isn't related to LightGBM though, is it?

Copy link

github-actions bot commented Nov 4, 2023

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

@github-actions github-actions bot closed this as completed Nov 4, 2023
Copy link

github-actions bot commented Nov 6, 2024

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 6, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants