Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bin size 257 cannot run on GPU #4082

Open
Tracked by #5153
pseudotensor opened this issue Mar 18, 2021 · 25 comments
Open
Tracked by #5153

bin size 257 cannot run on GPU #4082

pseudotensor opened this issue Mar 18, 2021 · 25 comments
Labels

Comments

@pseudotensor
Copy link

pseudotensor commented Mar 18, 2021

I know there are a couple other issues that mention this problem, but it's gotten messy with suggestions it's related to categorical_feature setting and other stuff. Here is clean MRE.

d9a96c9

lgb257.pkl.zip

import pickle
model, X, y, kwargs = pickle.load(open(lgb257.pkl, "rb"))
model.fit(X, y, **kwargs)

FYI a model.get_params() shows:

params = {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.8, 'importance_type': 'gain',
          'learning_rate': 0.5, 'max_depth': 6, 'min_child_samples': 1, 'min_child_weight': 1.0, 'min_split_gain': 0.0,
          'n_estimators': 100, 'n_jobs': 8, 'num_leaves': 64, 'objective': 'binary', 'random_state': 1234,
          'reg_alpha': 0.0, 'reg_lambda': 1.0, 'silent': True, 'subsample': 0.7, 'subsample_for_bin': 200000,
          'subsample_freq': 1, 'pred_gap': None, 'pred_periods': None, 'max_bin': 255, 'scale_pos_weight': 1.0,
          'max_delta_step': 0.0, 'min_data_in_bin': 1, 'seed': 1234, 'early_stopping_limit': None, 'device_type': 'gpu',
          'gpu_device_id': 0, 'gpu_platform_id': 0, 'gpu_use_dp': True, 'feature_fraction_seed': 1235,
          'bagging_seed': 1236, 'num_threads': 8, 'num_class': 1, 'verbose': -1, 'categorical_feature': ''}

and FYI here is kwargs:

image

[LightGBM] [Warning] num_threads is set=8, n_jobs=8 will be ignored. Current value: num_threads=8
[LightGBM] [Warning] seed is set=1234, random_state=1234 will be ignored. Current value: seed=1234
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py:1586: UserWarning: Using categorical_feature in Dataset.
  warnings.warn('Using categorical_feature in Dataset.')
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py:1590: UserWarning: categorical_feature in Dataset is overridden.
New categorical_feature is []
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py:1108: UserWarning: categorical_feature keyword has been found in `params` and will be ignored.
Please use categorical_feature argument of the Dataset constructor to pass this parameter.
  .format(key))
[LightGBM] [Fatal] bin size 257 cannot run on GPU
Traceback (most recent call last):
  File "/home/jon/h2oai.fullcondatest/h2oaicore/lgb257.py", line 18, in <module>
    model.fit(X, y, **kwargs)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 867, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 637, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 230, in train
    booster = Booster(params=params, train_set=train_set)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2104, in __init__
    ctypes.byref(self.handle)))
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 52, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: bin size 257 cannot run on GPU

Running

model.fit(X, y)

fails same way, but I'm unsure for sklearn API if it is using 'auto' for categorical_feature then.

@pseudotensor
Copy link
Author

pseudotensor commented Mar 18, 2021

Here is more minimal MRE:

import pickle
X, y = pickle.load(open("lgb257b.pkl", "rb"))

params = dict(categorical_feature='', device_type='gpu', gpu_device_id=0, gpu_platform_id=0, min_data_in_bin=1, max_bin=255)
model = lgb.LGBMClassifier(**params)
model.fit(X, y, categorical_feature='')

FYI gpu_use_dp=True or False has no effect.

That is, I iterated through all parameters, the key to failure is (of course) on GPU but also min_data_in_bin=1. 2 also fails, but 10 does not fail. So lgb is not respecting the max_bin of 255 even for numeric values.

lgb257b.pkl.zip

If this is a user error, I recommend listening primarily to max_bin. E.g. when doing hyperparameter search, fatal failures are not fun to handle. Best if lgb does reasonable thing.

@pseudotensor
Copy link
Author

Hi, any thoughts? Seems like a clear MRE, but it's been 5 days and no response. Thanks.

@pseudotensor
Copy link
Author

@guolinke ?

@pseudotensor
Copy link
Author

pseudotensor commented Jul 29, 2021

  File "/opt/h2oai/dai/python/lib/python3.8/site-packages/lightgbm_gpu/sklearn.py", line 712, in fit
    self._Booster = train(params, train_set,
  File "/opt/h2oai/dai/python/lib/python3.8/site-packages/lightgbm_gpu/engine.py", line 235, in train
    booster = Booster(params=params, train_set=train_set)
  File "/opt/h2oai/dai/python/lib/python3.8/site-packages/lightgbm_gpu/basic.py", line 2528, in __init__
    _safe_call(_LIB.LGBM_BoosterCreate(
  File "/opt/h2oai/dai/python/lib/python3.8/site-packages/lightgbm_gpu/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: bin size 258 cannot run on GPU

Again, no categorical handling enabled etc.

This is on master as of last night.

@jameslamb jameslamb added the bug label Jul 29, 2021
@arnocandel
Copy link

@guolinke reminder - still the dominant failure mode for LightGBM in Driverless AI

@guolinke
Copy link
Collaborator

I think the old GPU/CUDA version will be abandoned.
also cc @shiyu1994 to follow up on this issue.

@shiyu1994
Copy link
Collaborator

@arnocandel We are updating a branch new CUDA version. Please follow #4630 and #4528 for latest progress.

@pseudotensor
Copy link
Author

@shiyu1994 and @guolinke . Hi, Looking at those 2 PRs made me realize that perhaps the current CUDA mode (as opposed to openCL) is incomplete. e.g. you mention categorical handling as added to CUDA version in the PR. Is that correct?

More generally, is the CUDA version incomplete in various ways that are documented? Or does it have (or will have) full parity?

If I run with CUDA version with categorical handling it seems to run fine, but maybe it's not doing what I choose even though I pass categorical_feature?

@shiyu1994
Copy link
Collaborator

@pseudotensor The current CUDA version is doing the correct thing, it can handle categorical features normally. The only problem is current implementation only do histogram construction on GPU, so the GPU utilization can be low.

Supporting of categorical features is not added yet in our first part of new CUDA version #4630, but will be added later.

@arnocandel
Copy link

arnocandel commented Nov 30, 2021

Here's another minimal repro, in case helps

lgb.bin257.pkl.zip

import pickle
import lightgbm as lgb
print(lgb.__version__)

from lightgbm.sklearn import LGBMRegressor
with open("lgb.bin257.pkl", "rb") as f:
    X, y = pickle.load(f)
    model = LGBMRegressor(max_bin=252, device_type='gpu')
    model.fit(X, y)
    print("OK1")

    model = LGBMRegressor(max_bin=253, device_type='gpu')
    model.fit(X, y)
    print("OK2")

first one passes, second one fails, not sure where 257 comes from:

3.2.1.99
OK1
[LightGBM] [Fatal] bin size 257 cannot run on GPU
Traceback (most recent call last):
  File "/nfs4/lgb_prefit_1c95733f-58d6-4a61-969f-b2331e03e895.py", line 13, in <module>
    model.fit(X, y)
  File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/sklearn.py", line 851, in fit
    super().fit(X, y, sample_weight=sample_weight, init_score=init_score,
  File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/sklearn.py", line 714, in fit
    self._Booster = train(params, train_set,
  File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/engine.py", line 260, in train
    booster = Booster(params=params, train_set=train_set)
  File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/basic.py", line 2537, in __init__
    _safe_call(_LIB.LGBM_BoosterCreate(
  File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: bin size 257 cannot run on GPU

Process finished with exit code 1

@jameslamb
Copy link
Collaborator

Thanks very much @arnocandel !

But are you able to provide a reproducible example starting from raw data in a text-based format, generated from scratch with pandas / numpy / scipy code, or using a widely-distributed dataset like those available in sklearn.datasets?

I personally don't ever load pickle files whose origin I don't know, and I expect others wanting to contribute to fixing this issue might share that hesistation.

From https://docs.python.org/3/library/pickle.html

Warning The pickle module is not secure. Only unpickle data you trust.

It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.

@arnocandel
Copy link

arnocandel commented Jan 5, 2022

@jameslamb - ok
use this instead: X_y.zip

import pandas as pd
X=pd.read_csv("X.csv").values
y=pd.read_csv("y.csv").values.ravel()

@lewis-morris
Copy link

I'm having the same issue over here!

bin size 257 cannot run on GPU

@arnocandel
Copy link

arnocandel commented Feb 24, 2022

@jameslamb - were you able to check with above two .csv files for X and y?

Here the full thing for simplicity:
https://github.com/microsoft/LightGBM/files/7817145/X_y.zip

import lightgbm as lgb
print(lgb.__version__)
import pandas as pd
X=pd.read_csv("X.csv").values
y=pd.read_csv("y.csv").values.ravel()

from lightgbm.sklearn import LGBMRegressor
model = LGBMRegressor(max_bin=252, device_type='gpu')
model.fit(X, y)
print("OK1")

model = LGBMRegressor(max_bin=253, device_type='gpu')
model.fit(X, y)
print("OK2")

@jameslamb
Copy link
Collaborator

were you able to check with above two .csv files for X and y

I was not. If you're subscribed to this issue, you'll be notified when someone picks this up or has new information to share.

@jiluojiluo
Copy link

this is a bug for lightGBM for GPU,when use CPU,it is OK.

@ahmedshahriar
Copy link

Any update so far on this issue?

@lilianabs
Copy link

I'm having the same issue :(

@chixujohnny
Copy link

same issue too :(

@holma91
Copy link

holma91 commented Apr 12, 2024

Still have this issue.

@matousfamera
Copy link

I have the same issue

"LightGBMError: bin size 1973 cannot run on GPU."

It runs alright using CPU.

@shiyu1994
Copy link
Collaborator

For everyone who encounters this issue with the -DUSE_GPU=ON version of LightGBM, please check our latest GPU version which should be compiled with -DUSE_CUDA=ON.
https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#build-cuda-version. Thanks.

@cocoderss
Copy link

For everyone who encounters this issue with the -DUSE_GPU=ON version of LightGBM, please check our latest GPU version which should be compiled with -DUSE_CUDA=ON. https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#build-cuda-version. Thanks.

I have followed these instructions to install the CUDA version instead of the GPU version, but I still have the same issue:
LightGBMError: bin size XXX cannot run on GPU.

For more info, I am running on a linux server with cuda 12.1 with A100. Let me know if more info are needed to fix this issue.

@wil70
Copy link

wil70 commented Aug 13, 2024

Same issue with GPU version on windows, works fine on CPU
[LightGBM] [Fatal] bin size 260 cannot run on GPU

[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Load from binary file wil10_8_data_2004_2006_split_train.csv.bin
[LightGBM] [Warning] Parameter two_round works only in case of loading data directly from text file. It will be ignored when loading from binary file.
[LightGBM] [Info] Finished loading data in 286.006354 seconds
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 278556290
[LightGBM] [Info] Number of data points in the train set: 30472, number of used features: 2398793
[LightGBM] [Fatal] bin size 260 cannot run on GPU
Met Exceptions:
bin size 260 cannot run on GPU

@cocoderss
Copy link

For everyone who encounters this issue with the -DUSE_GPU=ON version of LightGBM, please check our latest GPU version which should be compiled with -DUSE_CUDA=ON. https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#build-cuda-version. Thanks.

I have realized that after compiling lightgbm with the cuda option, and then using the command sudo sh ./build-python.sh install --precompile to install it as highlighted in the documentation, it defaults to installing the pip repo version. I have not verified that by inspecting the build-python.sh script, but my workaround was to build the pip wheel package myself. This solves the issue, and when specifying device_type=cuda works correctly as expected.

On a side note, the main issue of cuda memory still persists, and this relates to the fact that a categorical feature has too many unique values (I tested by omitting that feature and it works fine on both gpu, cuda and cpu). But when including that feature, using the gpu version I get LightGBMError: bin size XXX cannot run on GPU, it works fine on the CPU, but takes a very long time, and using the cuda version, you can find the error below (optuna study multiple workers).

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/treelearner/cuda/cuda_best_split_finder.cu 2066

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

[LightGBM] [Warning] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

[LightGBM] [Warning] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_tree.cpp 37

[LightGBM] [Warning] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

terminate called after throwing an instance of 'std::runtime_error'
[LightGBM] [Warning] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

  what():  [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_tree.cpp 37

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

[LightGBM] [Warning] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

So it seems that there is a limitation in the implementation when it comes to categorical features on cuda/gpu, that requires a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests