[CUDA] illegal memory access when using CUDA and large max_bin and large dataset #6512

LZhen0711 · 2024-07-01T17:33:35Z

Description

By using CUDA histogram of the master branch, the simple python code report memory error if it uses large max_bin size

Reproducible example

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import lightgbm as lgbm
X,y = make_regression(n_samples=4000000, n_features=50)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
model = lgbm.LGBMRegressor(device="cuda", max_bin=300)
model.fit(X_train, y_train)

And it will report error:

[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Info] Total Bins 15000
[LightGBM] [Info] Number of data points in the train set: 3000000, number of used features: 50
[LightGBM] [Info] Start training from score 0.023500
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /usr/local/src/lightgbm/LightGBM/src/treelearner/cuda/cuda_data_partition.cu 987

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /usr/local/src/lightgbm/LightGBM/src/io/cuda/cuda_tree.cpp 37

terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] an illegal memory access was encountered /usr/local/src/lightgbm/LightGBM/src/io/cuda/cuda_tree.cpp 37

Aborted

Environment info

GPU: NVIDIA GeForce RTX 3060
Python: 3.12.4
LightGBM version or commit hash: master branch

# FROM nvidia/cuda:8.0-cudnn5-devel
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04


#################################################################################################################
#           Global
#################################################################################################################
# apt-get to skip any interactive post-install configuration steps with DEBIAN_FRONTEND=noninteractive and apt-get install -y

ENV LANG=C.UTF-8 LC_ALL=C.UTF-8
ARG DEBIAN_FRONTEND=noninteractive

#################################################################################################################
#           Global Path Setting
#################################################################################################################

ENV CUDA_HOME /usr/local/cuda
ENV LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64
ENV LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:/usr/local/lib

ENV OPENCL_LIBRARIES /usr/local/cuda/lib64
ENV OPENCL_INCLUDE_DIR /usr/local/cuda/include

#################################################################################################################
#           TINI
#################################################################################################################

# Install tini
ENV TINI_VERSION v0.14.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini
RUN chmod +x /tini

#################################################################################################################
#           SYSTEM
#################################################################################################################
# update: downloads the package lists from the repositories and "updates" them to get information on the newest versions of packages and their
# dependencies. It will do this for all repositories and PPAs.

RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
curl \
bzip2 \
ca-certificates \
libglib2.0-0 \
libxext6 \
libsm6 \
libxrender1 \
git \
vim \
mercurial \
subversion \
cmake \
libboost-dev \
libboost-system-dev \
libboost-filesystem-dev \
gcc \
g++

# Add OpenCL ICD files for LightGBM
RUN mkdir -p /etc/OpenCL/vendors && \
    echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd

#################################################################################################################
#           CONDA
#################################################################################################################

ARG CONDA_DIR=/opt/miniforge
# add to path
ENV PATH $CONDA_DIR/bin:$PATH

# Install miniforge
RUN echo "export PATH=$CONDA_DIR/bin:"'$PATH' > /etc/profile.d/conda.sh && \
    curl -sL https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh -o ~/miniforge.sh && \
    /bin/bash ~/miniforge.sh -b -p $CONDA_DIR && \
    rm ~/miniforge.sh

RUN conda config --set always_yes yes --set changeps1 no && \
    conda create -y -q -n py3 numpy scipy scikit-learn jupyter notebook ipython pandas matplotlib

#################################################################################################################
#           LightGBM
#################################################################################################################

RUN cd /usr/local/src && mkdir lightgbm && cd lightgbm && \
    git clone --recursive https://github.com/microsoft/LightGBM && \
    cd LightGBM && \
    mkdir build && cd build && cmake -DUSE_CUDA=1 .. && make -j4 && cd ..

ENV PATH /usr/local/src/lightgbm/LightGBM:${PATH}

RUN /bin/bash -c "source activate py3 && cd /usr/local/src/lightgbm/LightGBM && sh ./build-python.sh install --precompile && source deactivate"

#################################################################################################################
#           System CleanUp
#################################################################################################################
# apt-get autoremove: used to remove packages that were automatically installed to satisfy dependencies for some package and that are no more needed.
# apt-get clean: removes the aptitude cache in /var/cache/apt/archives. You'd be amazed how much is in there! the only drawback is that the packages
# have to be downloaded again if you reinstall them.

RUN apt-get autoremove -y && apt-get clean && \
    rm -rf /var/lib/apt/lists/* && \
    conda clean -a -y

#################################################################################################################
#           JUPYTER
#################################################################################################################

# password: keras
# password key: --NotebookApp.password='sha1:98b767162d34:8da1bc3c75a0f29145769edc977375a373407824'

# Add a notebook profile.
RUN mkdir -p -m 700 ~/.jupyter/ && \
    echo "c.NotebookApp.ip = '*'" >> ~/.jupyter/jupyter_notebook_config.py

VOLUME /home
WORKDIR /home

# IPython
EXPOSE 8888

ENTRYPOINT [ "/tini", "--" ]
CMD /bin/bash -c "source activate py3 && jupyter notebook --allow-root --no-browser --NotebookApp.password='sha1:98b767162d34:8da1bc3c75a0f29145769edc977375a373407824' && source deactivate"

LightGBM version or commit hash:

Command(s) you used to install LightGBM

Additional Comments

The text was updated successfully, but these errors were encountered:

Ryednap · 2024-07-17T08:51:00Z

I am also encountering similar issues when using a large dataset with CUDA. I have verified this behavior in at least 3 different machines. Every time I get similar logs before the Python script or notebook crashes.

In my case, I have a dataset with 11 million Rows and close to 1 GB. I am unsure if large bins are the reason because it crashes even on default settings. Here's my small setup

fixed_params = {
    "objective": "binary",
    "metric": "auc",
    "boosting_type": "gbdt",
    "data_sample_strategy": "bagging",
    "num_iterations": 5000,
    "device_type": "cuda",
    "random_state": 6241,
    "force_row_wise": True,
    "bagging_seed": 113,
    "early_stopping_rounds": 100,
    "verbose": 2,
}
  gbm = lightgbm.train(
      **fixed_params,
      train_pool,
      valid_sets=[valid_pool],
      valid_names=['valid'],
  )

Here's the LGBM log before it crashes

Here are my Env Info

Driver Version: 535.104.05 CUDA Version: 12.2
lightgbm==4.4.0 but I have verified that this behavior is the same in v4.2.0.
T4 GPU on colab with 15 gigs of GPU RAM.

NotOneRing · 2024-11-12T05:29:55Z

I am also encountering similar issues when using a large dataset with CUDA. I have verified this behavior in at least 3 different machines. Every time I get similar logs before the Python script or notebook crashes.

In my case, I have a dataset with 11 million Rows and close to 1 GB. I am unsure if large bins are the reason because it crashes even on default settings. Here's my small setup

fixed_params = {
"objective": "binary",
"metric": "auc",
"boosting_type": "gbdt",
"data_sample_strategy": "bagging",
"num_iterations": 5000,
"device_type": "cuda",
"random_state": 6241,
"force_row_wise": True,
"bagging_seed": 113,
"early_stopping_rounds": 100,
"verbose": 2,
}
gbm = lightgbm.train(
**fixed_params,
train_pool,
valid_sets=[valid_pool],
valid_names=['valid'],
)
Here's the LGBM log before it crashes

Here are my Env Info

Driver Version: 535.104.05 CUDA Version: 12.2

lightgbm==4.4.0 but I have verified that this behavior is the same in v4.2.0.

T4 GPU on colab with 15 gigs of GPU RAM.

May I have your dataset? I wanna try and solve this question.

yuhorun · 2024-11-15T04:26:31Z

我也有这个问题，我的训练数据大概有500m，环境配置按着官网的步骤来的。用gpu版本说我超出内存才换的cuda版，现在cuda版本报这个错，请问我该如何处理。

版本：
lightgbm：4.5.0.99
ubuntu:24
python：3.10
cuda：12.2
gpu：2080ti 12gb

错误：
[flaml.automl.logger: 11-15 12:16:27] {1739} INFO - Evaluation method: cv
[flaml.automl.logger: 11-15 12:16:27] {1838} INFO - Minimizing error metric: 1-accuracy
[flaml.automl.logger: 11-15 12:16:27] {1955} INFO - List of ML learners in AutoML Run: ['gpulgbm']
[flaml.automl.logger: 11-15 12:16:27] {2258} INFO - iteration 0, current learner gpulgbm
[flaml.automl.logger: 11-15 12:16:29] {2393} INFO - Estimated sufficient time budget=19567s. Estimated necessary time budget=20s.
[flaml.automl.logger: 11-15 12:16:29] {2442} INFO - at 3.5s, estimator gpulgbm's best error=0.2830, best estimator gpulgbm's best error=0.2830
[flaml.automl.logger: 11-15 12:16:29] {2258} INFO - iteration 1, current learner gpulgbm
[flaml.automl.logger: 11-15 12:16:31] {2442} INFO - at 5.1s, estimator gpulgbm's best error=0.2830, best estimator gpulgbm's best error=0.2830
[flaml.automl.logger: 11-15 12:16:31] {2258} INFO - iteration 2, current learner gpulgbm
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/yuhr/桌面/LightGBM/src/treelearner/cuda/cuda_data_partition.cu 987

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/yuhr/桌面/LightGBM/src/io/cuda/cuda_tree.cpp 37

terminate called after throwing an instance of 'std::runtime_error'
what(): [CUDA] an illegal memory access was encountered /home/yuhr/桌面/LightGBM/src/io/cuda/cuda_tree.cpp 37

NotOneRing · 2024-11-17T04:09:42Z

我也有这个问题，我的训练数据大概有500m，环境配置按着官网的步骤来的。用gpu版本说我超出内存才换的cuda版，现在cuda版本报这个错，请问我该如何处理。

版本： lightgbm：4.5.0.99 ubuntu:24 python：3.10 cuda：12.2 gpu：2080ti 12gb

错误： [flaml.automl.logger: 11-15 12:16:27] {1739} INFO - Evaluation method: cv [flaml.automl.logger: 11-15 12:16:27] {1838} INFO - Minimizing error metric: 1-accuracy [flaml.automl.logger: 11-15 12:16:27] {1955} INFO - List of ML learners in AutoML Run: ['gpulgbm'] [flaml.automl.logger: 11-15 12:16:27] {2258} INFO - iteration 0, current learner gpulgbm [flaml.automl.logger: 11-15 12:16:29] {2393} INFO - Estimated sufficient time budget=19567s. Estimated necessary time budget=20s. [flaml.automl.logger: 11-15 12:16:29] {2442} INFO - at 3.5s, estimator gpulgbm's best error=0.2830, best estimator gpulgbm's best error=0.2830 [flaml.automl.logger: 11-15 12:16:29] {2258} INFO - iteration 1, current learner gpulgbm [flaml.automl.logger: 11-15 12:16:31] {2442} INFO - at 5.1s, estimator gpulgbm's best error=0.2830, best estimator gpulgbm's best error=0.2830 [flaml.automl.logger: 11-15 12:16:31] {2258} INFO - iteration 2, current learner gpulgbm [LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/yuhr/桌面/LightGBM/src/treelearner/cuda/cuda_data_partition.cu 987

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/yuhr/桌面/LightGBM/src/io/cuda/cuda_tree.cpp 37

terminate called after throwing an instance of 'std::runtime_error' what(): [CUDA] an illegal memory access was encountered /home/yuhr/桌面/LightGBM/src/io/cuda/cuda_tree.cpp 37

I'm running the examples
"
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import lightgbm as lgbm
X,y = make_regression(n_samples=4000000, n_features=50)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
model = lgbm.LGBMRegressor(device="cuda", max_bin=300)
model.fit(X_train, y_train)
"
in a server with gpu and cuda currently.
I will try to locate and solve this problem in the near future.

shiyu1994 · 2024-11-21T13:14:10Z

Thanks for reporting this issue. With large max bin values, kernels constructing histogram using global memory may be used, which has not been tested heavily. I'm debugging this.

jameslamb added the bug label Jul 1, 2024

shiyu1994 self-assigned this Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] illegal memory access when using CUDA and large max_bin and large dataset #6512

[CUDA] illegal memory access when using CUDA and large max_bin and large dataset #6512

LZhen0711 commented Jul 1, 2024

Ryednap commented Jul 17, 2024

NotOneRing commented Nov 12, 2024

yuhorun commented Nov 15, 2024 •

edited

Loading

NotOneRing commented Nov 17, 2024

shiyu1994 commented Nov 21, 2024

[CUDA] illegal memory access when using CUDA and large max_bin and large dataset #6512

[CUDA] illegal memory access when using CUDA and large max_bin and large dataset #6512

Comments

LZhen0711 commented Jul 1, 2024

Description

Reproducible example

Environment info

Additional Comments

Ryednap commented Jul 17, 2024

NotOneRing commented Nov 12, 2024

yuhorun commented Nov 15, 2024 • edited Loading

NotOneRing commented Nov 17, 2024

shiyu1994 commented Nov 21, 2024

yuhorun commented Nov 15, 2024 •

edited

Loading