Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to load Dataset.subset() back after Dataset.save_binary() #5402

Closed
YuriWu opened this issue Aug 4, 2022 · 8 comments · Fixed by #5416
Closed

Failed to load Dataset.subset() back after Dataset.save_binary() #5402

YuriWu opened this issue Aug 4, 2022 · 8 comments · Fixed by #5416
Labels

Comments

@YuriWu
Copy link

YuriWu commented Aug 4, 2022

Description

I'd like to split a large dataset into several subsets and use them later, via Dataset.subset() API.
I really like the LightGBM binary dataset, that saves the memory and disk, so I tried to use it everywhere. But I found the following workflow doesn't work:

  1. Construct and save a binary dataset.
  2. Load a subset from the binary dataset.
  3. Save the subset.
  4. [Failed] Load the subset back.

Reproducible example

import lightgbm as lgb
import numpy as np

# Create and save the data
data = np.random.random((100,10))
ds = lgb.Dataset(data).construct()
ds.save_binary('train.bin')

# Load, create, and save a subset
ds = lgb.Dataset('train.bin')
subset = ds.subset([1,2,3,5,8]).construct()
print(f'Got {subset.num_data()} samples from {ds.num_data()} samples')
subset.save_binary('subset.bin')

# Load but failed
subset = lgb.Dataset('subset.bin').construct()

Error message:

[LightGBM] [Info] Saving data to binary file train.bin
[LightGBM] [Info] Load from binary file train.bin
Got 5 samples from 100 samples
[LightGBM] [Info] Saving data to binary file subset.bin
[LightGBM] [Info] Load from binary file subset.bin
[LightGBM] [Fatal] Dataset max_bin 140839269 != config 255
Traceback (most recent call last):
  File "load_bin.py", line 16, in <module>
    subset = lgb.Dataset('subset.bin').construct()
  File "D:\yuri_env\lib\site-packages\lightgbm\basic.py", line 1815, in construct
    self._lazy_init(self.data, label=self.label,
  File "D:\yuri_env\lib\site-packages\lightgbm\basic.py", line 1528, in _lazy_init
    _safe_call(_LIB.LGBM_DatasetCreateFromFile(
  File "D:\yuri_env\lib\site-packages\lightgbm\basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Dataset max_bin 140839269 != config 255

Environment info

LightGBM 3.3.2

Command(s) you used to install LightGBM

pip install lightgbm

Other environments:

  • Python 3.8.8
  • numpy 1.21.5
@jmoralez
Copy link
Collaborator

jmoralez commented Aug 7, 2022

Hi @YuriWu, thank you for your interest in LightGBM. I'm not able to reproduce this error, are you sure this example reproduces it?

There is a known error about loading a dataset from a file with non-default parameters that is documented in #4904, do you think that's what you're running into?

@YuriWu
Copy link
Author

YuriWu commented Aug 8, 2022

I'm sure the example reproduces it. To help troubleshoot further, I created a new virtualenv and only installed lightgbm==3.3.2.
Then I use python test.py to run the code.

Enviroments

Package       Version
------------- -------
joblib        1.1.0
lightgbm      3.3.2
numpy         1.19.5
pip           21.3.1
scikit-learn  0.24.2
scipy         1.5.4
setuptools    59.6.0
threadpoolctl 3.1.0
wheel         0.37.1

Code

Now use range(4*3) to create deterministic toy data, and try to get the first 3 as a subset. Here's the new minimal code:

import lightgbm as lgb
import numpy as np
import os

print('lightgbm version: ', lgb.__version__)
print('numpy version: ', np.__version__)

def subset(data):
    # Clean up if exists
    files = ['train.bin', 'subset.bin']
    for file in files:
        if os.path.exists(file):
            os.remove(file)

    ds = lgb.Dataset(data, params={'data_random_seed': 0}).construct()
    ds.save_binary('train.bin')
    
    # Load, create, and save a subset
    ds = lgb.Dataset('train.bin').construct()
    subset = ds.subset([1,2,3]).construct()
    print(f'Got {subset.num_data()} samples from {ds.num_data()} samples')
    subset.save_binary('subset.bin')

    # Load but failed
    subset = lgb.Dataset('subset.bin').construct()

num_rows = 4
num_cols = 3
data = np.array(range( num_rows*num_cols )).reshape(num_rows, num_cols)
print('Data:')
print(data)
print(f'\nSubset of {data.shape} data')
subset(data)

Output

An interesting thing I found is the error msg can be different sometimes, that's why I tried to fix the data_random_seed.
Here are two possible outputs, they differ in max_bin {N} != config 255

Possible Output 1

$ python test.py
lightgbm version:  3.3.2
numpy version:  1.19.5
Data:
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]

Subset of (4, 3) data
[LightGBM] [Warning] There are no meaningful features, as all feature values are constant.
[LightGBM] [Info] Saving data to binary file train.bin
[LightGBM] [Info] Load from binary file train.bin
Got 3 samples from 4 samples
[LightGBM] [Info] Saving data to binary file subset.bin
[LightGBM] [Info] Load from binary file subset.bin
[LightGBM] [Fatal] Dataset max_bin 0 != config 255
Traceback (most recent call last):
  File "test.py", line 31, in <module>
    subset(data)
  File "test.py", line 25, in subset
    subset = lgb.Dataset('subset.bin').construct()
  File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 1819, in construct
    categorical_feature=self.categorical_feature, params=self.params)
  File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 1532, in _lazy_init
    ctypes.byref(self.handle)))
  File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Dataset max_bin 0 != config 255

Possible Output 2

$ python test.py
lightgbm version:  3.3.2
numpy version:  1.19.5
Data:
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]

Subset of (4, 3) data
[LightGBM] [Warning] There are no meaningful features, as all feature values are constant.
[LightGBM] [Info] Saving data to binary file train.bin
[LightGBM] [Info] Load from binary file train.bin
Got 3 samples from 4 samples
[LightGBM] [Info] Saving data to binary file subset.bin
[LightGBM] [Info] Load from binary file subset.bin
[LightGBM] [Fatal] Dataset max_bin 32709 != config 255
Traceback (most recent call last):
  File "test.py", line 33, in <module>
    subset(data)
  File "test.py", line 25, in subset
    subset = lgb.Dataset('subset.bin').construct()
  File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 1819, in construct
    categorical_feature=self.categorical_feature, params=self.params)
  File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 1532, in _lazy_init
    ctypes.byref(self.handle)))
  File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Dataset max_bin 32709 != config 255

Possible root cause

I guess the problem is due to when LightGBM creates the subset and saves it to binary, it doesn't clean and initialize the headers correctly.

Evidence

I reran the script twice, renamed the subset.bin to subset_0.bin and subset_1.bin, then inspect the binary content, they are different in headers:

xxd subset_0.bin # LightGBMError: Dataset max_bin 0 != config 255
0000000: 5f5f 5f5f 5f5f 4c69 6768 7447 424d 5f42  ______LightGBM_B
0000010: 696e 6172 795f 4669 6c65 5f54 6f6b 656e  inary_File_Token
0000020: 5f5f 5f5f 5f5f 0a00 c800 0000 0000 0000  ______..........
0000030: 0300 0000 0000 0000 0000 0000 0000 0000  ................
0000040: 0300 0000 0000 0000 0000 0000 0000 0000  ................
0000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000080: ffff ffff ffff ffff ffff ffff 0000 0000  ................
0000090: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000a0: ffff ffff ffff ffff ffff ffff 0000 0000  ................
00000b0: 0800 0000 0000 0000 436f 6c75 6d6e 5f30  ........Column_0
00000c0: 0800 0000 0000 0000 436f 6c75 6d6e 5f31  ........Column_1
00000d0: 0800 0000 0000 0000 436f 6c75 6d6e 5f32  ........Column_2
00000e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000f0: 0000 0000 0000 0000 2800 0000 0000 0000  ........(.......
0000100: 0300 0000 0000 0000 0000 0000 0000 0000  ................
0000110: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000120: 0000 0000 0000 0000                      ........
xxd subset_1.bin #LightGBMError: Dataset max_bin 32597 != config 255
0000000: 5f5f 5f5f 5f5f 4c69 6768 7447 424d 5f42  ______LightGBM_B
0000010: 696e 6172 795f 4669 6c65 5f54 6f6b 656e  inary_File_Token
0000020: 5f5f 5f5f 5f5f 0a00 c800 0000 0000 0000  ______..........
0000030: 0300 0000 0000 0000 0000 0000 0000 0000  ................
0000040: 0300 0000 0000 0000 0000 0000 0000 0000  ................
0000050: 557f 0000 0000 0000 0000 0000 0000 0000  U...............
0000060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000080: ffff ffff ffff ffff ffff ffff 0000 0000  ................
0000090: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000a0: ffff ffff ffff ffff ffff ffff 0000 0000  ................
00000b0: 0800 0000 0000 0000 436f 6c75 6d6e 5f30  ........Column_0
00000c0: 0800 0000 0000 0000 436f 6c75 6d6e 5f31  ........Column_1
00000d0: 0800 0000 0000 0000 436f 6c75 6d6e 5f32  ........Column_2
00000e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000f0: 0000 0000 0000 0000 2800 0000 0000 0000  ........(.......
0000100: 0300 0000 0000 0000 0000 0000 0000 0000  ................
0000110: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000120: 0000 0000 0000 0000                      ........

hex(32597) = 0x7f55, as shown in the later header.

@YuriWu
Copy link
Author

YuriWu commented Aug 8, 2022

I tried to fix it by copying these lines: https://github.com/microsoft/LightGBM/blob/master/src/io/dataset.cpp#L742-L746

  max_bin_ = dataset->max_bin_;
  min_data_in_bin_ = dataset->min_data_in_bin_;
  bin_construct_sample_cnt_ = dataset->bin_construct_sample_cnt_;
  use_missing_ = dataset->use_missing_;
  zero_as_missing_ = dataset->zero_as_missing_;

To L736, the end of Dataset::CopyFeatureMapperFrom
And it seems fix the problem after re-compile the .so

@jmoralez
Copy link
Collaborator

jmoralez commented Aug 8, 2022

Thanks @YuriWu! I was able to reproduce the error and verified the fix you proposed indeed solves the issue. @shiyu1994 @guolinke can you check the proposed fix here?

@guolinke
Copy link
Collaborator

guolinke commented Aug 9, 2022

The fix looks good to me! thank you @YuriWu
BTW, should we add a test for it?

@jmoralez
Copy link
Collaborator

jmoralez commented Aug 9, 2022

@YuriWu would you like to make a PR that includes your fix and a small test?

@YuriWu
Copy link
Author

YuriWu commented Aug 11, 2022

@jmoralez Sorry, due to the IP policy of my organzation, I'm not allowed to make a PR to open source projects.
Please test my proposed fix and merge it if legit.

@jameslamb jameslamb added the bug label Aug 16, 2022
StrikerRUS pushed a commit that referenced this issue Aug 28, 2022
)

* include parameters from reference dataset on copy

* lint

* set non-default parameters
@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants