-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] Non-unique data error when doing hyper-parameter search with DaskLGBMClassifier #6091
Comments
Thanks for using LightGBM. With just the information you've provided, I think it'll be difficult for us to debug this. I've personally never seen this using Does the same error show up if you switch to using Could you please create a minimal, reproducible example trying to replicate this? Or at least, which creates a Dask dataframe with the same dimensions and indices as your real training data, even if it doesn't reproduce the problem? |
Linking this possibly-related thread: dask/dask#4845 |
I tried again with smaller dataset and with the same code and dask configuration, and it works fine. It seems this problem occurs only when the dataset is too large(two dask workers memory limit 74.5 GB*2, and the total "bytes stored" is 80GB during running). Thus, it is quite hard to reproduce the problem. I guess the error is memory related. After reviewing my code, I think I figured out which line triggered the exception. More detailed version of the code is,
The data looks like this,
dask.DataFrame does support multi-index at axis=0, but it's ok to use multi-index at axis=1 I hope this helps to id the problem. For a remedy, I will just try to reduce the size of the data. |
Hey @kunshouout, seems like the problem isn't related to LightGBM though, is it? |
This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM! |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
I am trying to do some hyper parameter searching using dask, and code looks like this
I have 2 dask workers, one of which is on another machine. After successfully fitting 4 models, the fifth model failed with, and I find the following exception in one of the worker logs,
For my data, the index has one level, but the columns have two levels. The other worker warns about low memory.
What confuses me is that, if there is really duplication in the data, how come the first four trials succeeded? Plus I checked the subset data with code c640, and there's no duplication.
What could be the reason? Please help
Environment info
LightGBM version or commit hash:
3.3.5.99
The text was updated successfully, but these errors were encountered: