Failed to load Dataset.subset() back after Dataset.save_binary() #5402

YuriWu · 2022-08-04T12:55:02Z

Description

I'd like to split a large dataset into several subsets and use them later, via Dataset.subset() API.
I really like the LightGBM binary dataset, that saves the memory and disk, so I tried to use it everywhere. But I found the following workflow doesn't work:

Construct and save a binary dataset.
Load a subset from the binary dataset.
Save the subset.
[Failed] Load the subset back.

Reproducible example

import lightgbm as lgb
import numpy as np

# Create and save the data
data = np.random.random((100,10))
ds = lgb.Dataset(data).construct()
ds.save_binary('train.bin')

# Load, create, and save a subset
ds = lgb.Dataset('train.bin')
subset = ds.subset([1,2,3,5,8]).construct()
print(f'Got {subset.num_data()} samples from {ds.num_data()} samples')
subset.save_binary('subset.bin')

# Load but failed
subset = lgb.Dataset('subset.bin').construct()

Error message:

[LightGBM] [Info] Saving data to binary file train.bin
[LightGBM] [Info] Load from binary file train.bin
Got 5 samples from 100 samples
[LightGBM] [Info] Saving data to binary file subset.bin
[LightGBM] [Info] Load from binary file subset.bin
[LightGBM] [Fatal] Dataset max_bin 140839269 != config 255
Traceback (most recent call last):
  File "load_bin.py", line 16, in <module>
    subset = lgb.Dataset('subset.bin').construct()
  File "D:\yuri_env\lib\site-packages\lightgbm\basic.py", line 1815, in construct
    self._lazy_init(self.data, label=self.label,
  File "D:\yuri_env\lib\site-packages\lightgbm\basic.py", line 1528, in _lazy_init
    _safe_call(_LIB.LGBM_DatasetCreateFromFile(
  File "D:\yuri_env\lib\site-packages\lightgbm\basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Dataset max_bin 140839269 != config 255

Environment info

LightGBM 3.3.2

Command(s) you used to install LightGBM

pip install lightgbm

Other environments:

Python 3.8.8
numpy 1.21.5

The text was updated successfully, but these errors were encountered:

jmoralez · 2022-08-07T03:03:43Z

Hi @YuriWu, thank you for your interest in LightGBM. I'm not able to reproduce this error, are you sure this example reproduces it?

There is a known error about loading a dataset from a file with non-default parameters that is documented in #4904, do you think that's what you're running into?

YuriWu · 2022-08-08T08:59:25Z

I'm sure the example reproduces it. To help troubleshoot further, I created a new virtualenv and only installed lightgbm==3.3.2.
Then I use python test.py to run the code.

Enviroments

Package       Version
------------- -------
joblib        1.1.0
lightgbm      3.3.2
numpy         1.19.5
pip           21.3.1
scikit-learn  0.24.2
scipy         1.5.4
setuptools    59.6.0
threadpoolctl 3.1.0
wheel         0.37.1

Code

Now use range(4*3) to create deterministic toy data, and try to get the first 3 as a subset. Here's the new minimal code:

import lightgbm as lgb
import numpy as np
import os

print('lightgbm version: ', lgb.__version__)
print('numpy version: ', np.__version__)

def subset(data):
    # Clean up if exists
    files = ['train.bin', 'subset.bin']
    for file in files:
        if os.path.exists(file):
            os.remove(file)

    ds = lgb.Dataset(data, params={'data_random_seed': 0}).construct()
    ds.save_binary('train.bin')
    
    # Load, create, and save a subset
    ds = lgb.Dataset('train.bin').construct()
    subset = ds.subset([1,2,3]).construct()
    print(f'Got {subset.num_data()} samples from {ds.num_data()} samples')
    subset.save_binary('subset.bin')

    # Load but failed
    subset = lgb.Dataset('subset.bin').construct()

num_rows = 4
num_cols = 3
data = np.array(range( num_rows*num_cols )).reshape(num_rows, num_cols)
print('Data:')
print(data)
print(f'\nSubset of {data.shape} data')
subset(data)

Output

An interesting thing I found is the error msg can be different sometimes, that's why I tried to fix the data_random_seed.
Here are two possible outputs, they differ in max_bin {N} != config 255

Possible Output 1

$ python test.py
lightgbm version:  3.3.2
numpy version:  1.19.5
Data:
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]

Subset of (4, 3) data
[LightGBM] [Warning] There are no meaningful features, as all feature values are constant.
[LightGBM] [Info] Saving data to binary file train.bin
[LightGBM] [Info] Load from binary file train.bin
Got 3 samples from 4 samples
[LightGBM] [Info] Saving data to binary file subset.bin
[LightGBM] [Info] Load from binary file subset.bin
[LightGBM] [Fatal] Dataset max_bin 0 != config 255
Traceback (most recent call last):
  File "test.py", line 31, in <module>
    subset(data)
  File "test.py", line 25, in subset
    subset = lgb.Dataset('subset.bin').construct()
  File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 1819, in construct
    categorical_feature=self.categorical_feature, params=self.params)
  File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 1532, in _lazy_init
    ctypes.byref(self.handle)))
  File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Dataset max_bin 0 != config 255

Possible Output 2

$ python test.py
lightgbm version:  3.3.2
numpy version:  1.19.5
Data:
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]

Subset of (4, 3) data
[LightGBM] [Warning] There are no meaningful features, as all feature values are constant.
[LightGBM] [Info] Saving data to binary file train.bin
[LightGBM] [Info] Load from binary file train.bin
Got 3 samples from 4 samples
[LightGBM] [Info] Saving data to binary file subset.bin
[LightGBM] [Info] Load from binary file subset.bin
[LightGBM] [Fatal] Dataset max_bin 32709 != config 255
Traceback (most recent call last):
  File "test.py", line 33, in <module>
    subset(data)
  File "test.py", line 25, in subset
    subset = lgb.Dataset('subset.bin').construct()
  File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 1819, in construct
    categorical_feature=self.categorical_feature, params=self.params)
  File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 1532, in _lazy_init
    ctypes.byref(self.handle)))
  File "/workspace/yuriwu/env_lgb/lib/python3.6/site-packages/lightgbm/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Dataset max_bin 32709 != config 255

Possible root cause

I guess the problem is due to when LightGBM creates the subset and saves it to binary, it doesn't clean and initialize the headers correctly.

Evidence

I reran the script twice, renamed the subset.bin to subset_0.bin and subset_1.bin, then inspect the binary content, they are different in headers:

xxd subset_0.bin # LightGBMError: Dataset max_bin 0 != config 255
0000000: 5f5f 5f5f 5f5f 4c69 6768 7447 424d 5f42  ______LightGBM_B
0000010: 696e 6172 795f 4669 6c65 5f54 6f6b 656e  inary_File_Token
0000020: 5f5f 5f5f 5f5f 0a00 c800 0000 0000 0000  ______..........
0000030: 0300 0000 0000 0000 0000 0000 0000 0000  ................
0000040: 0300 0000 0000 0000 0000 0000 0000 0000  ................
0000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000080: ffff ffff ffff ffff ffff ffff 0000 0000  ................
0000090: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000a0: ffff ffff ffff ffff ffff ffff 0000 0000  ................
00000b0: 0800 0000 0000 0000 436f 6c75 6d6e 5f30  ........Column_0
00000c0: 0800 0000 0000 0000 436f 6c75 6d6e 5f31  ........Column_1
00000d0: 0800 0000 0000 0000 436f 6c75 6d6e 5f32  ........Column_2
00000e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000f0: 0000 0000 0000 0000 2800 0000 0000 0000  ........(.......
0000100: 0300 0000 0000 0000 0000 0000 0000 0000  ................
0000110: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000120: 0000 0000 0000 0000                      ........

xxd subset_1.bin #LightGBMError: Dataset max_bin 32597 != config 255
0000000: 5f5f 5f5f 5f5f 4c69 6768 7447 424d 5f42  ______LightGBM_B
0000010: 696e 6172 795f 4669 6c65 5f54 6f6b 656e  inary_File_Token
0000020: 5f5f 5f5f 5f5f 0a00 c800 0000 0000 0000  ______..........
0000030: 0300 0000 0000 0000 0000 0000 0000 0000  ................
0000040: 0300 0000 0000 0000 0000 0000 0000 0000  ................
0000050: 557f 0000 0000 0000 0000 0000 0000 0000  U...............
0000060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000080: ffff ffff ffff ffff ffff ffff 0000 0000  ................
0000090: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000a0: ffff ffff ffff ffff ffff ffff 0000 0000  ................
00000b0: 0800 0000 0000 0000 436f 6c75 6d6e 5f30  ........Column_0
00000c0: 0800 0000 0000 0000 436f 6c75 6d6e 5f31  ........Column_1
00000d0: 0800 0000 0000 0000 436f 6c75 6d6e 5f32  ........Column_2
00000e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000f0: 0000 0000 0000 0000 2800 0000 0000 0000  ........(.......
0000100: 0300 0000 0000 0000 0000 0000 0000 0000  ................
0000110: 0000 0000 0000 0000 0000 0000 0000 0000  ................
0000120: 0000 0000 0000 0000                      ........

hex(32597) = 0x7f55, as shown in the later header.

YuriWu · 2022-08-08T09:59:35Z

I tried to fix it by copying these lines: https://github.com/microsoft/LightGBM/blob/master/src/io/dataset.cpp#L742-L746

  max_bin_ = dataset->max_bin_;
  min_data_in_bin_ = dataset->min_data_in_bin_;
  bin_construct_sample_cnt_ = dataset->bin_construct_sample_cnt_;
  use_missing_ = dataset->use_missing_;
  zero_as_missing_ = dataset->zero_as_missing_;

To L736, the end of Dataset::CopyFeatureMapperFrom
And it seems fix the problem after re-compile the .so

jmoralez · 2022-08-08T23:33:33Z

Thanks @YuriWu! I was able to reproduce the error and verified the fix you proposed indeed solves the issue. @shiyu1994 @guolinke can you check the proposed fix here?

guolinke · 2022-08-09T11:38:08Z

The fix looks good to me! thank you @YuriWu
BTW, should we add a test for it?

jmoralez · 2022-08-09T15:14:27Z

@YuriWu would you like to make a PR that includes your fix and a small test?

YuriWu · 2022-08-11T13:03:59Z

@jmoralez Sorry, due to the IP policy of my organzation, I'm not allowed to make a PR to open source projects.
Please test my proposed fix and merge it if legit.

) * include parameters from reference dataset on copy * lint * set non-default parameters

github-actions · 2023-08-19T03:31:46Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jmoralez added the awaiting response label Aug 7, 2022

github-actions bot removed the awaiting response label Aug 8, 2022

jmoralez mentioned this issue Aug 11, 2022

include parameters from reference dataset on subset (fixes #5402) #5416

Merged

jameslamb added the bug label Aug 16, 2022

StrikerRUS closed this as completed in #5416 Aug 28, 2022

StrikerRUS pushed a commit that referenced this issue Aug 28, 2022

include parameters from reference dataset on subset (fixes #5402) (#5416

5079de4

) * include parameters from reference dataset on copy * lint * set non-default parameters

jameslamb mentioned this issue Oct 7, 2022

[DO NOT MERGE] Release v3.3.3 #5525

Closed

40 tasks

github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to load Dataset.subset() back after Dataset.save_binary() #5402

Failed to load Dataset.subset() back after Dataset.save_binary() #5402

YuriWu commented Aug 4, 2022

jmoralez commented Aug 7, 2022

YuriWu commented Aug 8, 2022

YuriWu commented Aug 8, 2022

jmoralez commented Aug 8, 2022

guolinke commented Aug 9, 2022

jmoralez commented Aug 9, 2022

YuriWu commented Aug 11, 2022 •

edited

Loading

github-actions bot commented Aug 19, 2023

Failed to load Dataset.subset() back after Dataset.save_binary() #5402

Failed to load Dataset.subset() back after Dataset.save_binary() #5402

Comments

YuriWu commented Aug 4, 2022

Description

Reproducible example

Environment info

jmoralez commented Aug 7, 2022

YuriWu commented Aug 8, 2022

Enviroments

Code

Output

Possible Output 1

Possible Output 2

Possible root cause

Evidence

YuriWu commented Aug 8, 2022

jmoralez commented Aug 8, 2022

guolinke commented Aug 9, 2022

jmoralez commented Aug 9, 2022

YuriWu commented Aug 11, 2022 • edited Loading

github-actions bot commented Aug 19, 2023

YuriWu commented Aug 11, 2022 •

edited

Loading