bin size 257 cannot run on GPU #4082

pseudotensor · 2021-03-18T17:46:34Z

I know there are a couple other issues that mention this problem, but it's gotten messy with suggestions it's related to categorical_feature setting and other stuff. Here is clean MRE.

d9a96c9

lgb257.pkl.zip

import pickle
model, X, y, kwargs = pickle.load(open(lgb257.pkl, "rb"))
model.fit(X, y, **kwargs)

FYI a model.get_params() shows:

params = {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 0.8, 'importance_type': 'gain',
          'learning_rate': 0.5, 'max_depth': 6, 'min_child_samples': 1, 'min_child_weight': 1.0, 'min_split_gain': 0.0,
          'n_estimators': 100, 'n_jobs': 8, 'num_leaves': 64, 'objective': 'binary', 'random_state': 1234,
          'reg_alpha': 0.0, 'reg_lambda': 1.0, 'silent': True, 'subsample': 0.7, 'subsample_for_bin': 200000,
          'subsample_freq': 1, 'pred_gap': None, 'pred_periods': None, 'max_bin': 255, 'scale_pos_weight': 1.0,
          'max_delta_step': 0.0, 'min_data_in_bin': 1, 'seed': 1234, 'early_stopping_limit': None, 'device_type': 'gpu',
          'gpu_device_id': 0, 'gpu_platform_id': 0, 'gpu_use_dp': True, 'feature_fraction_seed': 1235,
          'bagging_seed': 1236, 'num_threads': 8, 'num_class': 1, 'verbose': -1, 'categorical_feature': ''}

and FYI here is kwargs:

[LightGBM] [Warning] num_threads is set=8, n_jobs=8 will be ignored. Current value: num_threads=8
[LightGBM] [Warning] seed is set=1234, random_state=1234 will be ignored. Current value: seed=1234
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py:1586: UserWarning: Using categorical_feature in Dataset.
  warnings.warn('Using categorical_feature in Dataset.')
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py:1590: UserWarning: categorical_feature in Dataset is overridden.
New categorical_feature is []
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))
/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py:1108: UserWarning: categorical_feature keyword has been found in `params` and will be ignored.
Please use categorical_feature argument of the Dataset constructor to pass this parameter.
  .format(key))
[LightGBM] [Fatal] bin size 257 cannot run on GPU
Traceback (most recent call last):
  File "/home/jon/h2oai.fullcondatest/h2oaicore/lgb257.py", line 18, in <module>
    model.fit(X, y, **kwargs)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 867, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/sklearn.py", line 637, in fit
    callbacks=callbacks, init_model=init_model)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/engine.py", line 230, in train
    booster = Booster(params=params, train_set=train_set)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 2104, in __init__
    ctypes.byref(self.handle)))
  File "/home/jon/minicondadai/lib/python3.6/site-packages/lightgbm_gpu/basic.py", line 52, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: bin size 257 cannot run on GPU

Running

model.fit(X, y)

fails same way, but I'm unsure for sklearn API if it is using 'auto' for categorical_feature then.

The text was updated successfully, but these errors were encountered:

pseudotensor · 2021-03-18T18:06:33Z

Here is more minimal MRE:

import pickle
X, y = pickle.load(open("lgb257b.pkl", "rb"))

params = dict(categorical_feature='', device_type='gpu', gpu_device_id=0, gpu_platform_id=0, min_data_in_bin=1, max_bin=255)
model = lgb.LGBMClassifier(**params)
model.fit(X, y, categorical_feature='')

FYI gpu_use_dp=True or False has no effect.

That is, I iterated through all parameters, the key to failure is (of course) on GPU but also min_data_in_bin=1. 2 also fails, but 10 does not fail. So lgb is not respecting the max_bin of 255 even for numeric values.

lgb257b.pkl.zip

If this is a user error, I recommend listening primarily to max_bin. E.g. when doing hyperparameter search, fatal failures are not fun to handle. Best if lgb does reasonable thing.

pseudotensor · 2021-03-23T06:27:14Z

Hi, any thoughts? Seems like a clear MRE, but it's been 5 days and no response. Thanks.

pseudotensor · 2021-04-02T17:49:18Z

@guolinke ?

pseudotensor · 2021-07-29T21:41:23Z

  File "/opt/h2oai/dai/python/lib/python3.8/site-packages/lightgbm_gpu/sklearn.py", line 712, in fit
    self._Booster = train(params, train_set,
  File "/opt/h2oai/dai/python/lib/python3.8/site-packages/lightgbm_gpu/engine.py", line 235, in train
    booster = Booster(params=params, train_set=train_set)
  File "/opt/h2oai/dai/python/lib/python3.8/site-packages/lightgbm_gpu/basic.py", line 2528, in __init__
    _safe_call(_LIB.LGBM_BoosterCreate(
  File "/opt/h2oai/dai/python/lib/python3.8/site-packages/lightgbm_gpu/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: bin size 258 cannot run on GPU

Again, no categorical handling enabled etc.

This is on master as of last night.

arnocandel · 2021-10-19T15:59:45Z

@guolinke reminder - still the dominant failure mode for LightGBM in Driverless AI

guolinke · 2021-10-20T13:23:27Z

I think the old GPU/CUDA version will be abandoned.
also cc @shiyu1994 to follow up on this issue.

shiyu1994 · 2021-10-20T13:59:29Z

@arnocandel We are updating a branch new CUDA version. Please follow #4630 and #4528 for latest progress.

pseudotensor · 2021-10-20T17:07:35Z

@shiyu1994 and @guolinke . Hi, Looking at those 2 PRs made me realize that perhaps the current CUDA mode (as opposed to openCL) is incomplete. e.g. you mention categorical handling as added to CUDA version in the PR. Is that correct?

More generally, is the CUDA version incomplete in various ways that are documented? Or does it have (or will have) full parity?

If I run with CUDA version with categorical handling it seems to run fine, but maybe it's not doing what I choose even though I pass categorical_feature?

shiyu1994 · 2021-10-21T03:48:40Z

@pseudotensor The current CUDA version is doing the correct thing, it can handle categorical features normally. The only problem is current implementation only do histogram construction on GPU, so the GPU utilization can be low.

Supporting of categorical features is not added yet in our first part of new CUDA version #4630, but will be added later.

arnocandel · 2021-11-30T19:21:52Z

Here's another minimal repro, in case helps

lgb.bin257.pkl.zip

import pickle
import lightgbm as lgb
print(lgb.__version__)

from lightgbm.sklearn import LGBMRegressor
with open("lgb.bin257.pkl", "rb") as f:
    X, y = pickle.load(f)
    model = LGBMRegressor(max_bin=252, device_type='gpu')
    model.fit(X, y)
    print("OK1")

    model = LGBMRegressor(max_bin=253, device_type='gpu')
    model.fit(X, y)
    print("OK2")

first one passes, second one fails, not sure where 257 comes from:

3.2.1.99
OK1
[LightGBM] [Fatal] bin size 257 cannot run on GPU
Traceback (most recent call last):
  File "/nfs4/lgb_prefit_1c95733f-58d6-4a61-969f-b2331e03e895.py", line 13, in <module>
    model.fit(X, y)
  File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/sklearn.py", line 851, in fit
    super().fit(X, y, sample_weight=sample_weight, init_score=init_score,
  File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/sklearn.py", line 714, in fit
    self._Booster = train(params, train_set,
  File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/engine.py", line 260, in train
    booster = Booster(params=params, train_set=train_set)
  File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/basic.py", line 2537, in __init__
    _safe_call(_LIB.LGBM_BoosterCreate(
  File "/home/arno/minicondadai_py38/lib/python3.8/site-packages/lightgbm/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: bin size 257 cannot run on GPU

Process finished with exit code 1

jameslamb · 2021-12-28T05:24:44Z

Thanks very much @arnocandel !

But are you able to provide a reproducible example starting from raw data in a text-based format, generated from scratch with pandas / numpy / scipy code, or using a widely-distributed dataset like those available in sklearn.datasets?

I personally don't ever load pickle files whose origin I don't know, and I expect others wanting to contribute to fixing this issue might share that hesistation.

From https://docs.python.org/3/library/pickle.html

Warning The pickle module is not secure. Only unpickle data you trust.

It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.

arnocandel · 2022-01-05T18:58:01Z

@jameslamb - ok
use this instead: X_y.zip

import pandas as pd
X=pd.read_csv("X.csv").values
y=pd.read_csv("y.csv").values.ravel()

lewis-morris · 2022-01-12T08:59:24Z

I'm having the same issue over here!

bin size 257 cannot run on GPU

arnocandel · 2022-02-24T02:01:14Z

@jameslamb - were you able to check with above two .csv files for X and y?

Here the full thing for simplicity:
https://github.com/microsoft/LightGBM/files/7817145/X_y.zip

import lightgbm as lgb
print(lgb.__version__)
import pandas as pd
X=pd.read_csv("X.csv").values
y=pd.read_csv("y.csv").values.ravel()

from lightgbm.sklearn import LGBMRegressor
model = LGBMRegressor(max_bin=252, device_type='gpu')
model.fit(X, y)
print("OK1")

model = LGBMRegressor(max_bin=253, device_type='gpu')
model.fit(X, y)
print("OK2")

jameslamb · 2022-02-24T04:16:10Z

were you able to check with above two .csv files for X and y

I was not. If you're subscribed to this issue, you'll be notified when someone picks this up or has new information to share.

jiluojiluo · 2022-07-11T10:18:49Z

this is a bug for lightGBM for GPU,when use CPU,it is OK.

ahmedshahriar · 2023-02-12T04:09:25Z

Any update so far on this issue?

lilianabs · 2023-02-12T17:58:20Z

I'm having the same issue :(

chixujohnny · 2023-03-16T14:06:47Z

same issue too :(

holma91 · 2024-04-12T18:14:18Z

Still have this issue.

matousfamera · 2024-05-08T15:11:09Z

I have the same issue

"LightGBMError: bin size 1973 cannot run on GPU."

It runs alright using CPU.

shiyu1994 · 2024-05-08T15:44:37Z

For everyone who encounters this issue with the -DUSE_GPU=ON version of LightGBM, please check our latest GPU version which should be compiled with -DUSE_CUDA=ON.
https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#build-cuda-version. Thanks.

cocoderss · 2024-08-10T13:04:24Z

For everyone who encounters this issue with the -DUSE_GPU=ON version of LightGBM, please check our latest GPU version which should be compiled with -DUSE_CUDA=ON. https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#build-cuda-version. Thanks.

I have followed these instructions to install the CUDA version instead of the GPU version, but I still have the same issue:
LightGBMError: bin size XXX cannot run on GPU.

For more info, I am running on a linux server with cuda 12.1 with A100. Let me know if more info are needed to fix this issue.

wil70 · 2024-08-13T21:48:42Z

Same issue with GPU version on windows, works fine on CPU
[LightGBM] [Fatal] bin size 260 cannot run on GPU

[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Load from binary file wil10_8_data_2004_2006_split_train.csv.bin
[LightGBM] [Warning] Parameter two_round works only in case of loading data directly from text file. It will be ignored when loading from binary file.
[LightGBM] [Info] Finished loading data in 286.006354 seconds
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 278556290
[LightGBM] [Info] Number of data points in the train set: 30472, number of used features: 2398793
[LightGBM] [Fatal] bin size 260 cannot run on GPU
Met Exceptions:
bin size 260 cannot run on GPU

cocoderss · 2024-08-23T14:26:14Z

For everyone who encounters this issue with the -DUSE_GPU=ON version of LightGBM, please check our latest GPU version which should be compiled with -DUSE_CUDA=ON. https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#build-cuda-version. Thanks.

I have realized that after compiling lightgbm with the cuda option, and then using the command sudo sh ./build-python.sh install --precompile to install it as highlighted in the documentation, it defaults to installing the pip repo version. I have not verified that by inspecting the build-python.sh script, but my workaround was to build the pip wheel package myself. This solves the issue, and when specifying device_type=cuda works correctly as expected.

On a side note, the main issue of cuda memory still persists, and this relates to the fact that a categorical feature has too many unique values (I tested by omitting that feature and it works fine on both gpu, cuda and cpu). But when including that feature, using the gpu version I get LightGBMError: bin size XXX cannot run on GPU, it works fine on the CPU, but takes a very long time, and using the cuda version, you can find the error below (optuna study multiple workers).

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/treelearner/cuda/cuda_best_split_finder.cu 2066

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

[LightGBM] [Warning] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

[LightGBM] [Warning] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_tree.cpp 37

[LightGBM] [Warning] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

terminate called after throwing an instance of 'std::runtime_error'
[LightGBM] [Warning] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

  what():  [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_tree.cpp 37

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

[LightGBM] [Warning] [CUDA] an illegal memory access was encountered /home/user/notebooks/jupiter/tmp/src/io/cuda/cuda_column_data.cpp 67

So it seems that there is a limitation in the implementation when it comes to categorical features on cuda/gpu, that requires a fix.

jameslamb added the bug label Jul 29, 2021

jameslamb mentioned this issue Oct 29, 2021

Add transform support for LightGBM by open source FreeForm2 library #4733

Closed

jameslamb mentioned this issue Nov 5, 2021

[dask] [gpu] Distributed training is VERY slow #4761

Closed

jameslamb mentioned this issue Apr 14, 2022

[RFC] 4.0.0 Release #5153

Closed

60 tasks

pseudotensor mentioned this issue Jun 5, 2022

LightGBM: Sklearn and Native API equivalence again, leads to very bad scores with sklearn API #5268

Open

CVPaul mentioned this issue Aug 4, 2023

[Bug] LightGBMError: bin size 257 cannot run on GPU #3339

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin size 257 cannot run on GPU #4082

bin size 257 cannot run on GPU #4082

pseudotensor commented Mar 18, 2021 •

edited

Loading

pseudotensor commented Mar 18, 2021 •

edited

Loading

pseudotensor commented Mar 23, 2021

pseudotensor commented Apr 2, 2021

pseudotensor commented Jul 29, 2021 •

edited

Loading

arnocandel commented Oct 19, 2021

guolinke commented Oct 20, 2021

shiyu1994 commented Oct 20, 2021

pseudotensor commented Oct 20, 2021

shiyu1994 commented Oct 21, 2021

arnocandel commented Nov 30, 2021 •

edited

Loading

jameslamb commented Dec 28, 2021

arnocandel commented Jan 5, 2022 •

edited

Loading

lewis-morris commented Jan 12, 2022

arnocandel commented Feb 24, 2022 •

edited

Loading

jameslamb commented Feb 24, 2022

jiluojiluo commented Jul 11, 2022

ahmedshahriar commented Feb 12, 2023

lilianabs commented Feb 12, 2023

chixujohnny commented Mar 16, 2023

holma91 commented Apr 12, 2024

matousfamera commented May 8, 2024

shiyu1994 commented May 8, 2024

cocoderss commented Aug 10, 2024

wil70 commented Aug 13, 2024 •

edited

Loading

cocoderss commented Aug 23, 2024

bin size 257 cannot run on GPU #4082

bin size 257 cannot run on GPU #4082

Comments

pseudotensor commented Mar 18, 2021 • edited Loading

pseudotensor commented Mar 18, 2021 • edited Loading

pseudotensor commented Mar 23, 2021

pseudotensor commented Apr 2, 2021

pseudotensor commented Jul 29, 2021 • edited Loading

arnocandel commented Oct 19, 2021

guolinke commented Oct 20, 2021

shiyu1994 commented Oct 20, 2021

pseudotensor commented Oct 20, 2021

shiyu1994 commented Oct 21, 2021

arnocandel commented Nov 30, 2021 • edited Loading

jameslamb commented Dec 28, 2021

arnocandel commented Jan 5, 2022 • edited Loading

lewis-morris commented Jan 12, 2022

arnocandel commented Feb 24, 2022 • edited Loading

jameslamb commented Feb 24, 2022

jiluojiluo commented Jul 11, 2022

ahmedshahriar commented Feb 12, 2023

lilianabs commented Feb 12, 2023

chixujohnny commented Mar 16, 2023

holma91 commented Apr 12, 2024

matousfamera commented May 8, 2024

shiyu1994 commented May 8, 2024

cocoderss commented Aug 10, 2024

wil70 commented Aug 13, 2024 • edited Loading

cocoderss commented Aug 23, 2024

pseudotensor commented Mar 18, 2021 •

edited

Loading

pseudotensor commented Mar 18, 2021 •

edited

Loading

pseudotensor commented Jul 29, 2021 •

edited

Loading

arnocandel commented Nov 30, 2021 •

edited

Loading

arnocandel commented Jan 5, 2022 •

edited

Loading

arnocandel commented Feb 24, 2022 •

edited

Loading

wil70 commented Aug 13, 2024 •

edited

Loading