Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check failed: (best_split_info.left_count) > (0) #4946

Open
Tracked by #5153
chixujohnny opened this issue Jan 13, 2022 · 28 comments
Open
Tracked by #5153

Check failed: (best_split_info.left_count) > (0) #4946

chixujohnny opened this issue Jan 13, 2022 · 28 comments
Labels

Comments

@chixujohnny
Copy link

Hi, I found a bug when training with large X_train.

lgb-gpu version: 3.3.2
CUDA=11.1
CentOS
ram=2TB
GPU=A100-40G

when the X_train is more than (1800w, 1000), lgb-gpu will has a bug like this:
[LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) at LightGBM/src/treelearner/serial_tree_learner.app, line 686

When I use LGB==3.2.1 , I have the same problem as #4480 : when the GPU memory more than 8.3G will Memory Object Allocation Failure

in this LGB version==3.3.2 , LGB can't load more than 17G GPU memory (GPU has 40G memory), it seems like something problem occur in the tree split step and only happened when GPU memory loaded more than 17G.

Another colleague has this same problem, his lgb version is 3.3.1

@jameslamb jameslamb added the bug label Jan 13, 2022
@jameslamb
Copy link
Collaborator

Thanks very much for using LightGBM!

I suspect that this is the same as other issues that have been reported (e.g. #4739, #3679), but it's hard to say without more details.

Are you able to provide any (and hopefully all) of the following?

  • specific commands you used to install LightGBM
  • the most minimal possible complete version of the code you're running which can reproduce this issue

That would be very helpful. Without such information and with only an error message, significant investigation will probably be required to figure out why you encountered this error.

@chixujohnny
Copy link
Author

Thanks for your reply!

I build the source code in this command:
mkdir build ; cd build ; cmake -DUSE_GPU=1 .. && make -j

Sorry I can't paste my code because the code is on my comany machine. I can't screenshots or copy the code, but I can descript my code:
X = np.random.rand(18000000, 1000).astype('float32')
y = np.random.rand(18000000).astype('float32')
model = LGBMRegressor(**params)
model.fit(X, y)

@jameslamb
Copy link
Collaborator

That's ok, completely understand that the code might be sensitive.

Are you able to share the values for params? Configuration you're using might give us some clues to help narrow this down.

@chixujohnny
Copy link
Author

params = {
‘n_estimators’ : 500,
'learning_rate' : 0.05,
'subsample_freq' : 6 ,
'subsample' : 0.91,
'colsample_bytree' : 0.83,
'colsample_bynode' : 0.78,
'num_leaves': 64,
'max_depth' : 8,
'reg_alpha' : 9,
'reg_lambda' : 3.5,
'min_child_samples' : 200,
'min_child_weight' : 88,
'max_bin': 71,
'enable_sparse': false,
'device_type': "gpu",
'gpu_use_dp': false
}

@chixujohnny
Copy link
Author

That's ok, completely understand that the code might be sensitive.

Are you able to share the values for params? Configuration you're using might give us some clues to help narrow this down.

Hello, can you find a GPU which memory more than 17G (like V100-32G or A100-40G), and generate a random dataset to reappearance my code?

@jameslamb
Copy link
Collaborator

I personally don't have easy access to hardware like that. I might try at some point to get a VM from a cloud provider and work on some of the open GPU-specific issues in this project, but can't commit to that. Maybe some other maintainer or contributor will be able to help you.

@lironle6
Copy link

That's ok, completely understand that the code might be sensitive.
Are you able to share the values for params? Configuration you're using might give us some clues to help narrow this down.

Hello, can you find a GPU which memory more than 17G (like V100-32G or A100-40G), and generate a random dataset to reappearance my code?

I have A100-40G and the above code fails.
interestingly, when I run it on 9M rows instead of 18M it doesn't fail. a quick calculation shows that 9M rows is just below the 40GB mark so might be related, might not.
In my own code I get this error unless I lower num_leaves and have to tweak it again if I change max_bin, if this helps.
This is also company property so won't be able to share the data or code.
Other than that, will help any way I can.

@pavlexander
Copy link

shared some details on the issue in another topic #2793 (comment)

@habemus-papadum
Copy link

habemus-papadum commented Mar 3, 2023

I was able to reproduce this.

Details are below, but running the @chixujohnny sample code:

[LightGBM] [Info] Number of data points in the train set: 18000000, number of used features: 1000
[LightGBM] [Info] Using requested OpenCL platform 0 device 7
[LightGBM] [Info] Using GPU Device: NVIDIA A100-SXM4-40GB, Vendor: NVIDIA Corporation

dies with
[LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) at /home/me/src/LightGBM/src/treelearner/serial_tree_learner.cpp, line 682

If I switch from OpenCL to CUDA ('device_type': 'cuda'), then CUDA dies reporting out of memory.
Note: while CUDA does report out of memory, execution continues with repeated FATAL "out of memory log" messages and finally ending in a segfault. -- so the CUDA tree learner also needs some more robust error checking)

Details (for the curious):

# this recreates the error
import numpy as np
from lightgbm import LGBMRegressor
import sys

params = {
    'n_estimators': 500,
    'learning_rate': 0.05,
    'subsample_freq': 6,
    'subsample': 0.91,
    'colsample_bytree': 0.83,
    'colsample_bynode': 0.78,
    'num_leaves': 64,
    'max_depth': 8,
    'reg_alpha': 9,
    'reg_lambda': 3.5,
    'min_child_samples': 200,
    'min_child_weight': 88,
    'max_bin': 71,
    'enable_sparse': False,
    'device_type': "gpu",
    'gpu_use_dp': False,
    'gpu_platform_id': 0,
    'gpu_device_id': 7,
}

n = 18*1000*1000
m = 1000
samples = n*m
if len(sys.argv) == 2:
    n = int(sys.argv[1])
print(f"N: {n:,}, Samples: {samples:,}")
# pre-generated data (see below)
X = np.memmap('X.mmap', dtype='float32', mode='r', shape=(n,m))
y = np.memmap('y.mmap', dtype='float32', mode='r', shape=(n,))

model = LGBMRegressor(**params)
model.fit(X, y)

# building LightGBM
#build boost
wget https://boostorg.jfrog.io/artifactory/main/release/1.81.0/source/boost_1_81_0.tar.gz
tar xvf boost_1_81_0.tar.gz
cd boost_1_81_0
./bootstrap.sh --prefix=/home/me/install
./b2 install

#build LightGBM
# version
#commit e4231205a3bac13662a81db9433ddaea8924fbce (HEAD -> master, origin/master, origin/HEAD)
#Author: James Lamb <[email protected]>
#Date:   Tue Feb 28 23:35:20 2023 -0600
#
 #   [python-package] use keyword arguments in predict() calls (#5755)

cmake .. -DUSE_CUDA=YES -DUSE_GPU=YES  -DBOOST_ROOT:PATHNAME=/home/me/install/

#install python
conda create -n LGBM python=3.9 numpy scipy scikit-learn
conda activate LGBM
export PYTHONPATH=$(pwd)/python-package
# data generation
# why? I hit some weird issue where generating random numbers goes from ~3ns per sample for a few 1million
# to ~200ns per sample for 18e9.  The issue is mitigated by generating into memapped memory (~10ns per sample)
# mystery for another day....

import numpy as np
from lightgbm import LGBMRegressor
import sys
from time import time_ns

n = 18*1000*1000
m = 1000
samples = n*m
if len(sys.argv) == 2:
    n = int(sys.argv[1])
print(f"N: {n:,}, Samples: {samples:,}")

rng = np.random.default_rng()

print("generating data...")
s = time_ns()
X = rng.random((n,m), 'float32')

X = np.memmap('X.mmap', dtype='float32', mode='w+', shape=(n,m))
rng.random((n, m), 'float32', out=X)
e = time_ns()
print(f"  {(e-s)/samples:.2f} ns per sample, {(e-s)/1e9:.3f} s elapsed")
print("flushing")
X.flush()

y = np.memmap('y.mmap', dtype='float32', mode='w+', shape=(n,))
rng.random((n, ), 'float32', out=y)
y.flush()

@habemus-papadum
Copy link

Hi @jameslamb and others,
I've attached a log that demonstrates the [LightGBM] [Fatal] Check failed: (best_split_info.left_count) > (0) error.
(w/ GPU_DEBUG=5, 'verbose': 3, 'seed'=1234, 'gpu_use_dp': True -- full script below).

It is possible for me to deterministically recreate this error.

I was wondering if you had any pointers about how to go about debugging this further.

  • Is left_count==0 usually indicative of a NaN or Inf somewhere?
  • GPUTreeLearner::ConstructHistograms and SerialTreeLearner::FindBestSplitsFromHistograms seem to to bee the places to look, but I'm not entirely sure what to look for...

Thanks in advance for any advice!

log3_trimmed.txt

@habemus-papadum
Copy link

code for above

import numpy as np
from lightgbm import LGBMRegressor
import sys
from time import time_ns

params = {
    'n_estimators': 500,
    'learning_rate': 0.05,
    'subsample_freq': 6,
    'subsample': 0.91,
    'colsample_bytree': 0.83,
    'colsample_bynode': 0.78,
    'num_leaves': 64,
    'max_depth': 8,
    'reg_alpha': 9,
    'reg_lambda': 3.5,
    'min_child_samples': 200,
    'min_child_weight': 88,
    'max_bin': 71,
    'enable_sparse': False,
    'device_type': "gpu",
    'gpu_use_dp': True,
    'gpu_platform_id': 0,
    'gpu_device_id': 7,
    'seed': 1234,
    'verbose': 3
}

n = 18*1000*1000
m = 1000
if len(sys.argv) == 2:
    n = int(sys.argv[1])

samples = n*m
print(f"N: {n:,}, Samples: {samples:,}")
X = np.memmap('/raid/scratch/nehalp/X64.mmap', dtype='float64', mode='r', shape=(n,m))
y = np.memmap('/raid/scratch/nehalp/y64.mmap', dtype='float64', mode='r', shape=(n,))

model = LGBMRegressor(**params)
print("Calling fit...")
model.fit(X, y)

@tolleybot
Copy link

I was able to successfully run the large dataset with a change to src/treelearner/ocl/histogram256.cl. I wanted to see if there was some sort of type difference between C++ and OpenCL.
This function is defined as the following (with #ifdef constants removed for this post for clarity)

__kernel void histogram256(
__global const uchar4* feature_data_base,
__constant const uchar4* restrict feature_masks attribute((max_constant_size(65536))),
const data_size_t feature_size,
__global const data_size_t* data_indices,
const data_size_t num_data,
const score_t const_hessian,
__global const score_t* ordered_gradients, // <----- change to : __global const * ordered_gradients
__global char* restrict output_buf,
__global volatile int * sync_counters,
__global acc_type* restrict hist_buf_base
)

However, if you redefine ordered_gradients as __global const * ordered_gradients, the context will fill in the type, and the large training set runs. At first, I thought score_t was defined differently in the OpenCL code and C++, but I verified that they are both floats.
In order to validate the results. I ran two smaller dummy datasets with ordered_gradients defined explicitly and not. I compared the resulting model files and found that they were the same.

It's not yet clear to me why the change allows the program to finish and I am investigating this.

@jameslamb
Copy link
Collaborator

Thank you so much for the help! Whenever you feel you've identified the root cause, if you'd like to open a pull request we'd appreciate it, and can help with the contribution and testing process.

@Bhuvanamitra
Copy link

I have the same issue. I changed histogram256.ocl file as advised in this thread. But still the issue persists.
Surprisingly, if I add 'is_unbalance=true' in my config, it trains the model without any problem.
I run a huge dataset of 14GB on GPU.
Can someone explain why this worked with addition of 'is_unbalance=true' inclusion?

@tolleybot
Copy link

Thats interesting. I'll see if I can mimic that later this week when I have more bandwidth. I did trace down where that parameter is used. It's used in binary_objective.hpp line 93:

if (is_unbalance_ && cnt_positive > 0 && cnt_negative > 0) {
if (cnt_positive > cnt_negative) {
label_weights_[1] = 1.0f;
label_weights_[0] = static_cast(cnt_positive) / cnt_negative;
} else {
label_weights_[1] = static_cast(cnt_negative) / cnt_positive;
label_weights_[0] = 1.0f;
}
}

@tolleybot
Copy link

I still get the same error without modifying the cl file and using this config. Let me know if yours differs @Bhuvanamitra :

task = train
objective = regression
boosting_type = gbdt
metric = l2
num_leaves = 64
max_depth = 8
learning_rate = 0.05
n_estimators = 500
subsample_freq = 6
subsample = 0.91
colsample_bytree = 0.83
colsample_bynode = 0.78
min_child_samples = 200
min_child_weight = 88
max_bin = 71
enable_sparse = false
device_type = gpu
gpu_use_dp = false
gpu_platform_id = 0
gpu_device_id = 7
is_unbalance=true

@tolleybot
Copy link

tolleybot commented Jun 5, 2023

This error seems to occur during the tree building process, indicating a situation where a split was found but the left child of the split doesn't contain any data.

As a workaround, I found that setting min_split_gain to 1 avoids this issue. However, I understand that this solution might not be suitable for all use cases, as it could potentially make the model more conservative about creating new splits, which might not yield the best results in all scenarios.

I wanted to bring this to your attention and see if there might be a more general solution to this issue. Any insights or suggestions would be greatly appreciated.

@mglowacki100
Copy link

Hi @tolleybot
I'm able to reproduce this issue with example from autogluon https://auto.gluon.ai/stable/tutorials/tabular/tabular-quick-start.html and manually upgrading lightgbm to 4.0.0 version. I've run it on google collab CPU.
Unfortunately when min_split_gain trick doesn't work here as I got error:

Fitting model: LightGBM ...
	Warning: Exception caused LightGBM to fail during training... Skipping this model.
		Check failed: (best_split_info.right_count) > (0) at /__w/1/s/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 855 .

Detailed Traceback:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/autogluon/core/trainer/abstract_trainer.py", line 1733, in _train_and_save
    model = self._train_single(X, y, model, X_val, y_val, total_resources=total_resources, **model_fit_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/autogluon/core/trainer/abstract_trainer.py", line 1684, in _train_single
    model = model.fit(X=X, y=y, X_val=X_val, y_val=y_val, total_resources=total_resources, **model_fit_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/autogluon/core/models/abstract/abstract_model.py", line 829, in fit
    out = self._fit(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/autogluon/tabular/models/lgb/lgb_model.py", line 194, in _fit
    self.model = train_lgb_model(early_stopping_callback_kwargs=early_stopping_callback_kwargs, **train_params)
  File "/usr/local/lib/python3.10/dist-packages/autogluon/tabular/models/lgb/lgb_utils.py", line 124, in train_lgb_model
    return lgb.train(**train_params)
  File "/usr/local/lib/python3.10/dist-packages/lightgbm/engine.py", line 266, in train
    booster.update(fobj=fobj)
  File "/usr/local/lib/python3.10/dist-packages/lightgbm/basic.py", line 3557, in update
    _safe_call(_LIB.LGBM_BoosterUpdateOneIter(
  File "/usr/local/lib/python3.10/dist-packages/lightgbm/basic.py", line 237, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Check failed: (best_split_info.right_count) > (0) at /__w/1/s/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 855 .

No base models to train on, skipping auxiliary stack level 2...

Note that this time complains about right_count.

@WatsonCao
Copy link

Same issue. Is there any method to avoid this issue like changing some params?

@dsilverberg95
Copy link

I had this problem and resolved it by setting both min_child_sample and min_child_weight to 1 and 1/X_train.shape[0], respectively. I received the error when they were both set to 0.

@chixujohnny
Copy link
Author

I had this problem and resolved it by setting both min_child_sample and min_child_weight to 1 and 1/X_train.shape[0], respectively. I received the error when they were both set to 0.

I use min_child_weight to 1 and 1/X_train.shape[0]. Can't solve this error bro

@wqxl309
Copy link

wqxl309 commented Jan 23, 2024

Is this problem solved in any 4.0.0 and above versions?

@flexlev
Copy link

flexlev commented Feb 21, 2024

I still have this issue using version 4.3.0. Anyone found a solution? Anything I can do to help solve the problem?

@wqxl309
Copy link

wqxl309 commented Feb 29, 2024

I still have this issue using version 4.3.0. Anyone found a solution? Anything I can do to help solve the problem?

I updated from 3.3.2(the version that I had this problem) to 4.3.0, and did not do any recompiles (so my previous GPU related compiles come from 3.3.2 and i used cmake), just run the previous code directly and the problem disapeared, not sure what happened exactly

@chixujohnny
Copy link
Author

@jameslamb
Hello, this issue has been troubling me for many years. Currently, my training data far exceeds 64GB, yet I can only sample a dataset of 64GB in size, which will inevitably affect the model's generalization capabilities to some extent. Could you please schedule a solution for this? Many people have reported this problem, but it remains unresolved despite version updates.

@chixujohnny
Copy link
Author

I still have this issue using version 4.3.0. Anyone found a solution? Anything I can do to help solve the problem?

I updated from 3.3.2(the version that I had this problem) to 4.3.0, and did not do any recompiles (so my previous GPU related compiles come from 3.3.2 and i used cmake), just run the previous code directly and the problem disapeared, not sure what happened exactly

I will give it a try to see if it resolves the issue and get back to you with a response in the near future. This problem has truly been a source of long-standing distress for me

@jameslamb
Copy link
Collaborator

Do you have an NVIDIA GPU? If so, please try the CUDA version of LightGBM instead.

Instructions for that build:

To use it, pass {"device": "cuda"} in params.

That version is more actively maintained and faster, and might not suffer from this issue.

Many people have reported this problem, but it remains unresolved despite version updates.

The OpenCL-based GPU version of LightGBM is effectively unmaintained right now.

For those "many people" watching this, here's how you could help:

  1. provide a clear, minimal, reproducible example that always triggers this error
  2. if you understand OpenCL, please come work on updating the -DUSE_GPU=ON build of LightGBM here, and investigate this issue

@chixujohnny
Copy link
Author

Do you have an NVIDIA GPU? If so, please try the CUDA version of LightGBM instead.

Instructions for that build:

To use it, pass {"device": "cuda"} in params.

That version is more actively maintained and faster, and might not suffer from this issue.

Many people have reported this problem, but it remains unresolved despite version updates.

The OpenCL-based GPU version of LightGBM is effectively unmaintained right now.

For those "many people" watching this, here's how you could help:

  1. provide a clear, minimal, reproducible example that always triggers this error

  2. if you understand OpenCL, please come work on updating the -DUSE_GPU=ON build of LightGBM here, and investigate this issue

Thank you very much, I'll give it a try

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.