Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA] illegal memory access when using CUDA and large max_bin and large dataset #6512

Open
LZhen0711 opened this issue Jul 1, 2024 · 5 comments
Assignees
Labels

Comments

@LZhen0711
Copy link

Description

By using CUDA histogram of the master branch, the simple python code report memory error if it uses large max_bin size

Reproducible example

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import lightgbm as lgbm
X,y = make_regression(n_samples=4000000, n_features=50)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
model = lgbm.LGBMRegressor(device="cuda", max_bin=300)
model.fit(X_train, y_train)

And it will report error:

[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Info] Total Bins 15000
[LightGBM] [Info] Number of data points in the train set: 3000000, number of used features: 50
[LightGBM] [Info] Start training from score 0.023500
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /usr/local/src/lightgbm/LightGBM/src/treelearner/cuda/cuda_data_partition.cu 987

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /usr/local/src/lightgbm/LightGBM/src/io/cuda/cuda_tree.cpp 37

terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] an illegal memory access was encountered /usr/local/src/lightgbm/LightGBM/src/io/cuda/cuda_tree.cpp 37

Aborted

Environment info

GPU: NVIDIA GeForce RTX 3060
Python: 3.12.4
LightGBM version or commit hash: master branch

# FROM nvidia/cuda:8.0-cudnn5-devel
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04


#################################################################################################################
#           Global
#################################################################################################################
# apt-get to skip any interactive post-install configuration steps with DEBIAN_FRONTEND=noninteractive and apt-get install -y

ENV LANG=C.UTF-8 LC_ALL=C.UTF-8
ARG DEBIAN_FRONTEND=noninteractive

#################################################################################################################
#           Global Path Setting
#################################################################################################################

ENV CUDA_HOME /usr/local/cuda
ENV LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:${CUDA_HOME}/lib64
ENV LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:/usr/local/lib

ENV OPENCL_LIBRARIES /usr/local/cuda/lib64
ENV OPENCL_INCLUDE_DIR /usr/local/cuda/include

#################################################################################################################
#           TINI
#################################################################################################################

# Install tini
ENV TINI_VERSION v0.14.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini
RUN chmod +x /tini

#################################################################################################################
#           SYSTEM
#################################################################################################################
# update: downloads the package lists from the repositories and "updates" them to get information on the newest versions of packages and their
# dependencies. It will do this for all repositories and PPAs.

RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
curl \
bzip2 \
ca-certificates \
libglib2.0-0 \
libxext6 \
libsm6 \
libxrender1 \
git \
vim \
mercurial \
subversion \
cmake \
libboost-dev \
libboost-system-dev \
libboost-filesystem-dev \
gcc \
g++

# Add OpenCL ICD files for LightGBM
RUN mkdir -p /etc/OpenCL/vendors && \
    echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd

#################################################################################################################
#           CONDA
#################################################################################################################

ARG CONDA_DIR=/opt/miniforge
# add to path
ENV PATH $CONDA_DIR/bin:$PATH

# Install miniforge
RUN echo "export PATH=$CONDA_DIR/bin:"'$PATH' > /etc/profile.d/conda.sh && \
    curl -sL https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh -o ~/miniforge.sh && \
    /bin/bash ~/miniforge.sh -b -p $CONDA_DIR && \
    rm ~/miniforge.sh

RUN conda config --set always_yes yes --set changeps1 no && \
    conda create -y -q -n py3 numpy scipy scikit-learn jupyter notebook ipython pandas matplotlib

#################################################################################################################
#           LightGBM
#################################################################################################################

RUN cd /usr/local/src && mkdir lightgbm && cd lightgbm && \
    git clone --recursive https://github.com/microsoft/LightGBM && \
    cd LightGBM && \
    mkdir build && cd build && cmake -DUSE_CUDA=1 .. && make -j4 && cd ..

ENV PATH /usr/local/src/lightgbm/LightGBM:${PATH}

RUN /bin/bash -c "source activate py3 && cd /usr/local/src/lightgbm/LightGBM && sh ./build-python.sh install --precompile && source deactivate"

#################################################################################################################
#           System CleanUp
#################################################################################################################
# apt-get autoremove: used to remove packages that were automatically installed to satisfy dependencies for some package and that are no more needed.
# apt-get clean: removes the aptitude cache in /var/cache/apt/archives. You'd be amazed how much is in there! the only drawback is that the packages
# have to be downloaded again if you reinstall them.

RUN apt-get autoremove -y && apt-get clean && \
    rm -rf /var/lib/apt/lists/* && \
    conda clean -a -y

#################################################################################################################
#           JUPYTER
#################################################################################################################

# password: keras
# password key: --NotebookApp.password='sha1:98b767162d34:8da1bc3c75a0f29145769edc977375a373407824'

# Add a notebook profile.
RUN mkdir -p -m 700 ~/.jupyter/ && \
    echo "c.NotebookApp.ip = '*'" >> ~/.jupyter/jupyter_notebook_config.py

VOLUME /home
WORKDIR /home

# IPython
EXPOSE 8888

ENTRYPOINT [ "/tini", "--" ]
CMD /bin/bash -c "source activate py3 && jupyter notebook --allow-root --no-browser --NotebookApp.password='sha1:98b767162d34:8da1bc3c75a0f29145769edc977375a373407824' && source deactivate"



LightGBM version or commit hash:

Command(s) you used to install LightGBM

Additional Comments

@jameslamb jameslamb added the bug label Jul 1, 2024
@Ryednap
Copy link

Ryednap commented Jul 17, 2024

I am also encountering similar issues when using a large dataset with CUDA. I have verified this behavior in at least 3 different machines. Every time I get similar logs before the Python script or notebook crashes.
image

In my case, I have a dataset with 11 million Rows and close to 1 GB. I am unsure if large bins are the reason because it crashes even on default settings. Here's my small setup

fixed_params = {
    "objective": "binary",
    "metric": "auc",
    "boosting_type": "gbdt",
    "data_sample_strategy": "bagging",
    "num_iterations": 5000,
    "device_type": "cuda",
    "random_state": 6241,
    "force_row_wise": True,
    "bagging_seed": 113,
    "early_stopping_rounds": 100,
    "verbose": 2,
}
  gbm = lightgbm.train(
      **fixed_params,
      train_pool,
      valid_sets=[valid_pool],
      valid_names=['valid'],
  )

Here's the LGBM log before it crashes
image

Here are my Env Info

  1. Driver Version: 535.104.05 CUDA Version: 12.2
  2. lightgbm==4.4.0 but I have verified that this behavior is the same in v4.2.0.
  3. T4 GPU on colab with 15 gigs of GPU RAM.

@NotOneRing
Copy link

I am also encountering similar issues when using a large dataset with CUDA. I have verified this behavior in at least 3 different machines. Every time I get similar logs before the Python script or notebook crashes. image

In my case, I have a dataset with 11 million Rows and close to 1 GB. I am unsure if large bins are the reason because it crashes even on default settings. Here's my small setup

fixed_params = {
"objective": "binary",
"metric": "auc",
"boosting_type": "gbdt",
"data_sample_strategy": "bagging",
"num_iterations": 5000,
"device_type": "cuda",
"random_state": 6241,
"force_row_wise": True,
"bagging_seed": 113,
"early_stopping_rounds": 100,
"verbose": 2,
}
gbm = lightgbm.train(
**fixed_params,
train_pool,
valid_sets=[valid_pool],
valid_names=['valid'],
)
Here's the LGBM log before it crashes image

Here are my Env Info

  1. Driver Version: 535.104.05 CUDA Version: 12.2
  2. lightgbm==4.4.0 but I have verified that this behavior is the same in v4.2.0.
  3. T4 GPU on colab with 15 gigs of GPU RAM.

May I have your dataset? I wanna try and solve this question.

@yuhorun
Copy link

yuhorun commented Nov 15, 2024

我也有这个问题,我的训练数据大概有500m,环境配置按着官网的步骤来的。用gpu版本说我超出内存才换的cuda版,现在cuda版本报这个错,请问我该如何处理。

版本:
lightgbm:4.5.0.99
ubuntu:24
python:3.10
cuda:12.2
gpu:2080ti 12gb

错误:
[flaml.automl.logger: 11-15 12:16:27] {1739} INFO - Evaluation method: cv
[flaml.automl.logger: 11-15 12:16:27] {1838} INFO - Minimizing error metric: 1-accuracy
[flaml.automl.logger: 11-15 12:16:27] {1955} INFO - List of ML learners in AutoML Run: ['gpulgbm']
[flaml.automl.logger: 11-15 12:16:27] {2258} INFO - iteration 0, current learner gpulgbm
[flaml.automl.logger: 11-15 12:16:29] {2393} INFO - Estimated sufficient time budget=19567s. Estimated necessary time budget=20s.
[flaml.automl.logger: 11-15 12:16:29] {2442} INFO - at 3.5s, estimator gpulgbm's best error=0.2830, best estimator gpulgbm's best error=0.2830
[flaml.automl.logger: 11-15 12:16:29] {2258} INFO - iteration 1, current learner gpulgbm
[flaml.automl.logger: 11-15 12:16:31] {2442} INFO - at 5.1s, estimator gpulgbm's best error=0.2830, best estimator gpulgbm's best error=0.2830
[flaml.automl.logger: 11-15 12:16:31] {2258} INFO - iteration 2, current learner gpulgbm
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/yuhr/桌面/LightGBM/src/treelearner/cuda/cuda_data_partition.cu 987

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/yuhr/桌面/LightGBM/src/io/cuda/cuda_tree.cpp 37

terminate called after throwing an instance of 'std::runtime_error'
what(): [CUDA] an illegal memory access was encountered /home/yuhr/桌面/LightGBM/src/io/cuda/cuda_tree.cpp 37

@NotOneRing
Copy link

我也有这个问题,我的训练数据大概有500m,环境配置按着官网的步骤来的。用gpu版本说我超出内存才换的cuda版,现在cuda版本报这个错,请问我该如何处理。

版本: lightgbm:4.5.0.99 ubuntu:24 python:3.10 cuda:12.2 gpu:2080ti 12gb

错误: [flaml.automl.logger: 11-15 12:16:27] {1739} INFO - Evaluation method: cv [flaml.automl.logger: 11-15 12:16:27] {1838} INFO - Minimizing error metric: 1-accuracy [flaml.automl.logger: 11-15 12:16:27] {1955} INFO - List of ML learners in AutoML Run: ['gpulgbm'] [flaml.automl.logger: 11-15 12:16:27] {2258} INFO - iteration 0, current learner gpulgbm [flaml.automl.logger: 11-15 12:16:29] {2393} INFO - Estimated sufficient time budget=19567s. Estimated necessary time budget=20s. [flaml.automl.logger: 11-15 12:16:29] {2442} INFO - at 3.5s, estimator gpulgbm's best error=0.2830, best estimator gpulgbm's best error=0.2830 [flaml.automl.logger: 11-15 12:16:29] {2258} INFO - iteration 1, current learner gpulgbm [flaml.automl.logger: 11-15 12:16:31] {2442} INFO - at 5.1s, estimator gpulgbm's best error=0.2830, best estimator gpulgbm's best error=0.2830 [flaml.automl.logger: 11-15 12:16:31] {2258} INFO - iteration 2, current learner gpulgbm [LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/yuhr/桌面/LightGBM/src/treelearner/cuda/cuda_data_partition.cu 987

[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered /home/yuhr/桌面/LightGBM/src/io/cuda/cuda_tree.cpp 37

terminate called after throwing an instance of 'std::runtime_error' what(): [CUDA] an illegal memory access was encountered /home/yuhr/桌面/LightGBM/src/io/cuda/cuda_tree.cpp 37

I'm running the examples
"
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import lightgbm as lgbm
X,y = make_regression(n_samples=4000000, n_features=50)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
model = lgbm.LGBMRegressor(device="cuda", max_bin=300)
model.fit(X_train, y_train)
"
in a server with gpu and cuda currently.
I will try to locate and solve this problem in the near future.

@shiyu1994
Copy link
Collaborator

Thanks for reporting this issue. With large max bin values, kernels constructing histogram using global memory may be used, which has not been tested heavily. I'm debugging this.

@shiyu1994 shiyu1994 self-assigned this Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants