-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to load Dataset.subset() back after Dataset.save_binary() #5402
Comments
I'm sure the example reproduces it. To help troubleshoot further, I created a new virtualenv and only installed lightgbm==3.3.2. Enviroments
CodeNow use range(4*3) to create deterministic toy data, and try to get the first 3 as a subset. Here's the new minimal code: import lightgbm as lgb
import numpy as np
import os
print('lightgbm version: ', lgb.__version__)
print('numpy version: ', np.__version__)
def subset(data):
# Clean up if exists
files = ['train.bin', 'subset.bin']
for file in files:
if os.path.exists(file):
os.remove(file)
ds = lgb.Dataset(data, params={'data_random_seed': 0}).construct()
ds.save_binary('train.bin')
# Load, create, and save a subset
ds = lgb.Dataset('train.bin').construct()
subset = ds.subset([1,2,3]).construct()
print(f'Got {subset.num_data()} samples from {ds.num_data()} samples')
subset.save_binary('subset.bin')
# Load but failed
subset = lgb.Dataset('subset.bin').construct()
num_rows = 4
num_cols = 3
data = np.array(range( num_rows*num_cols )).reshape(num_rows, num_cols)
print('Data:')
print(data)
print(f'\nSubset of {data.shape} data')
subset(data) OutputAn interesting thing I found is the error msg can be different sometimes, that's why I tried to fix the Possible Output 1
Possible Output 2
Possible root causeI guess the problem is due to when LightGBM creates the subset and saves it to binary, it doesn't clean and initialize the headers correctly. EvidenceI reran the script twice, renamed the
hex(32597) = 0x7f55, as shown in the later header. |
I tried to fix it by copying these lines: https://github.com/microsoft/LightGBM/blob/master/src/io/dataset.cpp#L742-L746 max_bin_ = dataset->max_bin_;
min_data_in_bin_ = dataset->min_data_in_bin_;
bin_construct_sample_cnt_ = dataset->bin_construct_sample_cnt_;
use_missing_ = dataset->use_missing_;
zero_as_missing_ = dataset->zero_as_missing_; To L736, the end of |
Thanks @YuriWu! I was able to reproduce the error and verified the fix you proposed indeed solves the issue. @shiyu1994 @guolinke can you check the proposed fix here? |
The fix looks good to me! thank you @YuriWu |
@YuriWu would you like to make a PR that includes your fix and a small test? |
@jmoralez Sorry, due to the IP policy of my organzation, I'm not allowed to make a PR to open source projects. |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Description
I'd like to split a large dataset into several subsets and use them later, via Dataset.subset() API.
I really like the LightGBM binary dataset, that saves the memory and disk, so I tried to use it everywhere. But I found the following workflow doesn't work:
Reproducible example
Error message:
Environment info
LightGBM 3.3.2
Command(s) you used to install LightGBM
Other environments:
The text was updated successfully, but these errors were encountered: