Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuBLAS API failed with status 15 - Error #174

Open
rmivdc opened this issue Mar 26, 2023 · 27 comments
Open

cuBLAS API failed with status 15 - Error #174

rmivdc opened this issue Mar 26, 2023 · 27 comments

Comments

@rmivdc
Copy link

rmivdc commented Mar 26, 2023

Hi,
During the finetune.py command launch i'm encoutering this error titled above.
i'm using Fedora 36 with Cuda12, Python 3.10.10, initializing seems begining like so :

CUDA SETUP: CUDA runtime path found: /usr/local/cuda-12.0/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 120

and then later after loading some files :

Loading cached split indices for dataset at /home/rmivdc/.cache/huggingface/datasets/json/default-fac87d4e05e14783/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-e521db28b6879419.arrow and /home/rmivdc/.cache/huggingface/datasets/json/default-fac87d4e05e14783/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-eb712e2459ca28b6.arrow
/home/rmivdc/.local/lib/python3.10/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
warnings.warn(
0%| | 0/1170 [00:00<?, ?it/s]cuBLAS API failed with status 15
A: torch.Size([2048, 4096]), B: torch.Size([4096, 4096]), C: (2048, 4096); (lda, ldb, ldc): (c_int(65536), c_int(131072), c_int(65536)); (m, n, k): (c_int(2048), c_int(4096), c_int(4096))

am i using some wrong libs versions ?
thx for your help

@loganlebanoff
Copy link

I ran into this issue as well with torch==2.0. When I uninstalled it and re-installed as torch==1.13.1, then it seemed to fix the issue.

@rmivdc
Copy link
Author

rmivdc commented Mar 27, 2023

Thanks ! this version fixed it.
EDIT : at least for cpu running, gpu running still throws that error

@rmivdc rmivdc closed this as completed Mar 27, 2023
@rmivdc rmivdc reopened this Mar 27, 2023
@loganlebanoff
Copy link

The error went away for me on GPU

@rmivdc
Copy link
Author

rmivdc commented Mar 27, 2023

The error went away for me on GPU

May i know what Cuda version are you using / nvidia drivers version and your :

accelerate
appdirs
bitsandbytes
black
black[jupyter]
datasets
fire
gradio

pip packages versions ? (if not last one used)

thanks !

@loganlebanoff
Copy link

CUDA 11.7. Also I'm used conda for install pytorch with cuda (conda install pytorch=1.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia)

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0
$ nvidia-smi                                                                                                                                                        Mon Mar 27 20:19:20 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB           On | 00000000:05:00.0 Off |                    0 |
| N/A   29C    P0               63W / 400W|   7429MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB           On | 00000000:06:00.0 Off |                    0 |
| N/A   26C    P0               63W / 400W|   7717MiB / 81920MiB |      0%      Default |
$ pip list
Package             Version     Editable project location
------------------- ----------- ------------------------------
accelerate          0.18.0
aiofiles            23.1.0
aiohttp             3.8.4
aiosignal           1.3.1
altair              4.2.2
anyio               3.6.2
appdirs             1.4.4
async-timeout       4.0.2
attrs               22.2.0
bitsandbytes        0.37.2
certifi             2022.12.7
charset-normalizer  3.1.0
click               8.1.3
contourpy           1.0.7
cycler              0.11.0
datasets            2.10.1
deepspeed           0.8.3
defusedxml          0.7.1
dill                0.3.6
entrypoints         0.4
fastapi             0.95.0
ffmpy               0.3.0
filelock            3.10.6
fire                0.5.0
flit_core           3.8.0
fonttools           4.39.2
frozenlist          1.3.3
fsspec              2023.3.0
Glances             3.3.1.1
gradio              3.23.0
h11                 0.14.0
hjson               3.1.0
httpcore            0.16.3
httptools           0.5.0
httpx               0.23.3
huggingface-hub     0.13.3
idna                3.4
importlib-resources 5.12.0
Jinja2              3.1.2
jmespath            1.0.1
jsonschema          4.17.3
kiwisolver          1.4.4
linkify-it-py       2.0.0
loralib             0.1.1
markdown-it-py      2.2.0
MarkupSafe          2.1.2
matplotlib          3.7.1
mdit-py-plugins     0.3.3
mdurl               0.1.2
multidict           6.0.4
multiprocess        0.70.14
ninja               1.11.1
numpy               1.24.2
openai              0.27.2
orjson              3.8.8
packaging           23.0
pandas              1.5.3
peft                0.3.0.dev0  /home/fsuser/peft
Pillow              9.4.0
pip                 23.0.1
psutil              5.9.4
py-cpuinfo          9.0.0
pyarrow             11.0.0
pydantic            1.10.7
pydub               0.25.1
pyparsing           3.0.9
pyrsistent          0.19.3
python-dateutil     2.8.2
python-dotenv       1.0.0
python-multipart    0.0.6
pytz                2023.2
PyYAML              6.0
regex               2023.3.23
requests            2.28.2
responses           0.18.0
rfc3986             1.5.0
semantic-version    2.10.0
sentencepiece       0.1.97
setuptools          65.6.3
six                 1.16.0
sniffio             1.3.0
starlette           0.26.1
termcolor           2.2.0
tokenizers          0.13.2
toolz               0.12.0
torch               1.13.1
tqdm                4.65.0
transformers        4.28.0.dev0 /home/fsuser/transformers_main
typing_extensions   4.4.0
uc-micro-py         1.0.1
ujson               5.7.0
urllib3             1.26.15
uvicorn             0.21.1
uvloop              0.17.0
watchfiles          0.18.1
websockets          10.4
wheel               0.38.4
xxhash              3.2.0
yarl                1.8.2
zipp                3.15.0

@leehanchung
Copy link

leehanchung commented Apr 1, 2023

CUDA 12 is not compatible with PyTorch 2.0.

https://github.com/pytorch/pytorch/blob/master/RELEASE.md#release-compatibility-matrix

Following is the Release Compatibility Matrix for PyTorch releases:

PyTorch version Python Stable CUDA Experimental CUDA
2.0 >=3.8, <=3.11 CUDA 11.7, CUDNN 8.5.0.96 CUDA 11.8, CUDNN 8.7.0.84
1.13 >=3.7, <=3.10 CUDA 11.6, CUDNN 8.3.2.44 CUDA 11.7, CUDNN 8.5.0.96
1.12 >=3.7, <=3.10 CUDA 11.3, CUDNN 8.3.2.44 CUDA 11.6, CUDNN 8.3.2.44

Also, Python 3.11 is not compatible either; the max version is 3.10.

@mudomau
Copy link

mudomau commented Apr 2, 2023

Getting the same issue here trying to run inference on the google t5-xl model.

Error:

cuBLAS API failed with status 15
A: torch.Size([1, 2048]), B: torch.Size([2048, 2048]), C: (1, 2048); (lda, ldb, ldc): (c_int(32), c_int(65536), c_int(32)); (m, n, k): (c_int(1), c_int(2048), c_int(2048))
...
 File "/home/mau/.conda/envs/test/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 377, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/home/mau/.conda/envs/test/lib/python3.9/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
    raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!

I've tried all the fixes proposed here but no luck.

Environment packages:

_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
accelerate 0.18.0 pypi_0 pypi
bitsandbytes 0.37.2 pypi_0 pypi
blas 1.0 mkl
brotlipy 0.7.0 py39h27cfd23_1003
bzip2 1.0.8 h7b6447c_0
ca-certificates 2023.01.10 h06a4308_0 anaconda
certifi 2022.12.7 py39h06a4308_0 anaconda
cffi 1.15.1 py39h5eee18b_3
charset-normalizer 2.0.4 pyhd3eb1b0_0
cryptography 39.0.1 py39h9ce1e76_0
cuda-cudart 11.7.99 0 nvidia
cuda-cupti 11.7.101 0 nvidia
cuda-libraries 11.7.1 0 nvidia
cuda-nvrtc 11.7.99 0 nvidia
cuda-nvtx 11.7.91 0 nvidia
cuda-runtime 11.7.1 0 nvidia
cudatoolkit 11.3.1 h2bc3f7f_2 anaconda
ffmpeg 4.3 hf484d3e_0 pytorch
filelock 3.10.7 pypi_0 pypi
flit-core 3.8.0 py39h06a4308_0
freetype 2.12.1 h4a9f257_0
giflib 5.2.1 h5eee18b_3
gmp 6.2.1 h295c915_3
gnutls 3.6.15 he1e5248_0
huggingface-hub 0.13.3 pypi_0 pypi
idna 3.4 py39h06a4308_0
intel-openmp 2021.4.0 h06a4308_3561
jpeg 9e h5eee18b_1
lame 3.100 h7b6447c_0
lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.38 h1181459_1
lerc 3.0 h295c915_0
libcublas 11.10.3.66 0 nvidia
libcufft 10.7.2.124 h4fbf590_0 nvidia
libcufile 1.6.0.25 0 nvidia
libcurand 10.3.2.56 0 nvidia
libcusolver 11.4.0.1 0 nvidia
libcusparse 11.7.4.91 0 nvidia
libdeflate 1.17 h5eee18b_0
libffi 3.4.2 h6a678d5_6
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libiconv 1.16 h7f8727e_2
libidn2 2.3.2 h7f8727e_0
libnpp 11.7.4.75 0 nvidia
libnvjpeg 11.8.0.2 0 nvidia
libpng 1.6.39 h5eee18b_0
libstdcxx-ng 11.2.0 h1234567_1
libtasn1 4.16.0 h27cfd23_0
libtiff 4.5.0 h6a678d5_2
libunistring 0.9.10 h27cfd23_0
libwebp 1.2.4 h11a3e52_1
libwebp-base 1.2.4 h5eee18b_1
lz4-c 1.9.4 h6a678d5_0
mkl 2021.4.0 h06a4308_640
mkl-service 2.4.0 py39h7f8727e_0
mkl_fft 1.3.1 py39hd3c417c_0
mkl_random 1.2.2 py39h51133e4_0
ncurses 6.4 h6a678d5_0
nettle 3.7.3 hbbd107a_1
numpy 1.23.5 py39h14f4228_0
numpy-base 1.23.5 py39h31eccc5_0
openh264 2.1.1 h4ff587b_0
openssl 1.1.1t h7f8727e_0
packaging 23.0 pypi_0 pypi
pillow 9.4.0 py39h6a678d5_0
pip 23.0.1 py39h06a4308_0
psutil 5.9.4 pypi_0 pypi
pycparser 2.21 pyhd3eb1b0_0
pyopenssl 23.0.0 py39h06a4308_0
pysocks 1.7.1 py39h06a4308_0
python 3.9.16 h7a1cb2a_2
pytorch 1.13.1 py3.9_cuda11.7_cudnn8.5.0_0 pytorch
pytorch-cuda 11.7 h778d358_3 pytorch
pytorch-mutex 1.0 cuda pytorch
pyyaml 6.0 pypi_0 pypi
readline 8.2 h5eee18b_0
regex 2023.3.23 pypi_0 pypi
requests 2.28.1 py39h06a4308_1
sentencepiece 0.1.97 pypi_0 pypi
setuptools 65.6.3 py39h06a4308_0
six 1.16.0 pyhd3eb1b0_1
sqlite 3.41.1 h5eee18b_0
tk 8.6.12 h1ccaba5_0
tokenizers 0.13.2 pypi_0 pypi
torchaudio 0.13.1 py39_cu117 pytorch
torchvision 0.14.1 py39_cu117 pytorch
tqdm 4.65.0 pypi_0 pypi
transformers 4.28.0.dev0 pypi_0 pypi
typing_extensions 4.4.0 py39h06a4308_0
tzdata 2022g h04d1e81_0
urllib3 1.26.15 py39h06a4308_0
wheel 0.38.4 py39h06a4308_0
xz 5.2.10 h5eee18b_1
zlib 1.2.13 h5eee18b_0
zstd 1.5.4 hc292b87_0

@rmivdc
Copy link
Author

rmivdc commented Apr 2, 2023

@mudomau
Do you have the same issue with "decapoda-research/llama-7b-hf" ?

I'm encountering another error now but the last Dockerfile install uploaded 3 days ago fixed that cuBLAS error for me.

@samuelcardoso
Copy link

same problem here.

trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
A: torch.Size([5120, 4096]), B: torch.Size([4096, 4096]), C: (5120, 4096); (lda, ldb, ldc): (c_int(163840), c_int(131072), c_int(163840)); (m, n, k): (c_int(5120), c_int(4096), c_int(4096))
cuBLAS API failed with status 15
error detected
$ nvidia-smi
Tue Apr 11 21:25:11 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:08:00.0  On |                  N/A |
|  0%   53C    P8    18W / 220W |   1020MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1292      G   /usr/lib/xorg/Xorg                460MiB |
|    0   N/A  N/A      1577      G   /usr/bin/gnome-shell              172MiB |
|    0   N/A  N/A      3884      G   ...RendererForSitePerProcess       86MiB |
|    0   N/A  N/A      5441      G   ...983706979455292193,131072      249MiB |
+-----------------------------------------------------------------------------+

$ /usr/local/cuda-11.6/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Fri_Dec_17_18:16:03_PST_2021
Cuda compilation tools, release 11.6, V11.6.55
Build cuda_11.6.r11.6/compiler.30794723_0
$ pip list
Package                  Version
------------------------ -------------
accelerate               0.18.0
aiofiles                 23.1.0
aiohttp                  3.8.4
aiosignal                1.3.1
altair                   4.2.2
anyio                    3.6.2
appdirs                  1.4.4
apturl                   0.5.2
asttokens                2.2.1
async-timeout            4.0.2
attrs                    22.2.0
backcall                 0.2.0
bitsandbytes             0.37.2
black                    23.3.0
blinker                  1.4
Brlapi                   0.8.3
certifi                  2020.6.20
chardet                  4.0.0
charset-normalizer       3.1.0
click                    8.0.3
cmake                    3.26.3
colorama                 0.4.4
command-not-found        0.3
contourpy                1.0.7
cryptography             3.4.8
cupshelpers              1.0
cycler                   0.11.0
datasets                 2.11.0
dbus-python              1.2.18
decorator                5.1.1
defer                    1.0.6
dill                     0.3.6
distro                   1.7.0
distro-info              1.1build1
entrypoints              0.4
executing                1.2.0
fastapi                  0.95.0
ffmpy                    0.3.0
filelock                 3.11.0
fire                     0.5.0
fonttools                4.39.3
frozenlist               1.3.3
fsspec                   2023.4.0
GPUtil                   1.4.0
gradio                   3.25.0
gradio_client            0.0.10
h11                      0.14.0
httpcore                 0.17.0
httplib2                 0.20.2
httpx                    0.24.0
huggingface-hub          0.13.4
idna                     3.3
importlib-metadata       4.6.4
ipython                  8.12.0
jedi                     0.18.2
jeepney                  0.7.1
Jinja2                   3.1.2
jsonschema               4.17.3
keyring                  23.5.0
kiwisolver               1.4.4
language-selector        0.1
launchpadlib             1.10.16
lazr.restfulclient       0.14.4
lazr.uri                 1.0.6
linkify-it-py            2.0.0
lit                      16.0.1
llvmlite                 0.39.1
loralib                  0.1.1
louis                    3.20.0
macaroonbakery           1.3.1
markdown-it-py           2.2.0
MarkupSafe               2.1.2
matplotlib               3.7.1
matplotlib-inline        0.1.6
mdit-py-plugins          0.3.3
mdurl                    0.1.2
more-itertools           8.10.0
mpmath                   1.3.0
multidict                6.0.4
multiprocess             0.70.14
mypy-extensions          1.0.0
netifaces                0.11.0
networkx                 3.1
numba                    0.56.4
numpy                    1.23.5
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-cupti-cu11   11.7.101
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
nvidia-cufft-cu11        10.9.0.58
nvidia-curand-cu11       10.2.10.91
nvidia-cusolver-cu11     11.4.0.1
nvidia-cusparse-cu11     11.7.4.91
nvidia-nccl-cu11         2.14.3
nvidia-nvtx-cu11         11.7.91
oauthlib                 3.2.0
olefile                  0.46
orjson                   3.8.10
packaging                23.0
pandas                   2.0.0
parso                    0.8.3
pathspec                 0.11.1
peft                     0.3.0.dev0
pexpect                  4.8.0
pickleshare              0.7.5
Pillow                   9.0.1
pip                      22.0.2
platformdirs             3.2.0
prompt-toolkit           3.0.38
protobuf                 3.12.4
psutil                   5.9.4
ptyprocess               0.7.0
pure-eval                0.2.2
pyarrow                  11.0.0
pycairo                  1.20.1
pycups                   2.0.1
pydantic                 1.10.7
pydub                    0.25.1
Pygments                 2.15.0
PyGObject                3.42.1
PyJWT                    2.3.0
pymacaroons              0.13.0
PyNaCl                   1.5.0
pynvml                   11.5.0
pyparsing                2.4.7
pyRFC3339                1.1
pyrsistent               0.19.3
python-apt               2.4.0+ubuntu1
python-dateutil          2.8.2
python-debian            0.1.43ubuntu1
python-multipart         0.0.6
pytz                     2022.1
pyxdg                    0.27
PyYAML                   5.4.1
regex                    2023.3.23
reportlab                3.6.8
requests                 2.25.1
responses                0.18.0
rich                     13.3.3
screen-resolution-extra  0.0.0
SecretStorage            3.3.1
semantic-version         2.10.0
sentencepiece            0.1.97
setuptools               59.6.0
six                      1.16.0
sniffio                  1.3.0
stack-data               0.6.2
starlette                0.26.1
sympy                    1.11.1
systemd-python           234
termcolor                2.2.0
tokenize-rt              5.0.0
tokenizers               0.13.3
tomli                    2.0.1
toolz                    0.12.0
torch                    1.13.1+cu116
torchaudio               0.13.1+cu116
torchvision              0.14.1+cu116
tqdm                     4.65.0
traitlets                5.9.0
transformers             4.28.0.dev0
triton                   2.0.0
typing_extensions        4.5.0
tzdata                   2023.3
ubuntu-advantage-tools   8001
ubuntu-drivers-common    0.0.0
uc-micro-py              1.0.1
ufw                      0.36.1
unattended-upgrades      0.1
urllib3                  1.26.5
uvicorn                  0.21.1
wadllib                  1.3.6
wcwidth                  0.2.6
websockets               11.0.1
wheel                    0.37.1
xdg                      5
xkit                     0.0.0
xxhash                   3.2.0
yarl                     1.8.2
zipp                     1.0.0

@arvindsun
Copy link

arvindsun commented Apr 15, 2023

I am running into the same issue as well on a H100:

torch 1.13.1, bitsandbytes==0.38.1, cuda 11.8, python 3.10, cublas 11.11.3.6


    result = super().forward(x)
  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 320, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 500, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 397, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1436, in igemmlt
    raise Exception('cublasLt ran into an error!')
> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

@SVEEu
Copy link

SVEEu commented Apr 27, 2023

Same issue comes to me when finetuning 30b and 65b models, even on different clouds.

For 65b model, it randomly occurs with a probability of about 70%. For 30b model, it occurs every time.

@Malfaro43
Copy link

Malfaro43 commented May 12, 2023

I am running into the same issue as well on a H100:

torch 1.13.1, bitsandbytes==0.38.1, cuda 11.8, python 3.10, cublas 11.11.3.6




    result = super().forward(x)

  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 320, in forward

    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)

  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 500, in matmul

    return MatMul8bitLt.apply(A, B, out, bias, state)

  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 397, in forward

    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)

  File "/home/arvind/.local/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1436, in igemmlt

    raise Exception('cublasLt ran into an error!')


> nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2022 NVIDIA Corporation

Built on Wed_Sep_21_10:33:58_PDT_2022

Cuda compilation tools, release 11.8, V11.8.89

Build cuda_11.8.r11.8/compiler.31833905_0

@arvindsun Have you fixed this? I'm also running into this issue when using an H100 on Lambda Labs.

@daniel-furman
Copy link

Getting the same error on an H100 on Lambda Labs

@jonataslaw
Copy link

Getting the same error on an H100 on Lambda Labs too

@leehanchung
Copy link

Getting the same error on an H100 on Lambda Labs too

Try to run it w/o 8-bit mode since you are on H100

@jonataslaw
Copy link

Getting the same error on an H100 on Lambda Labs too

Try to run it w/o 8-bit mode since you are on H100

I tried it.

Lambda instances of H100 has cuda 11.8, and pytorch 2.0.1 compiled to 117, which is not compatible. the bitsandbytes version also has a problem, and you need to rename the cuda version you are using.

I tried to install cuda version 12 too, to use the latest version of torch, but strangely the installation is aborted, without fail, so I gave up on testing it on the H100, I had already spent 3h of my time trying to configure it. I'll try it on another runpod instance, as locally I could successfully train it with 3 epochs, but I needed more computation to train it with 10, my RTX4090 will take weeks for it.

@zubair-ahmed-ai
Copy link

Facing the same error on lambda labds H100 instance trying to load Falcon-40B in 8 bit, what's the solution?

@jonataslaw
Copy link

jonataslaw commented Jun 5, 2023

export this variables:

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Install the compatible cuda (11.7 hasn't support to H100):

sudo apt install cuda-nvcc-11-8 libcusparse-11-8 libcusparse-dev-11-8 libcublas-dev-11-8 libcublas-11-8 libcusolver-dev-11-8 libcusolver-11-8

Remove old cuda:

apt remove cuda-nvcc-11-7

Install the compatible pytorch:

pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.0 --extra-index-url https://download.pytorch.org/whl/cu118
pip install pytorch-lightning==1.9.0

If you will use deepspeed to make CPU offload (it makes the train faster) you need:

pip install deepspeed==0.7.0

Edit these files (using VIM, nano, or SFPT) changing the import for inf from torch._six with import from math

/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/utils.py
/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py

@Thytu
Copy link

Thytu commented Jun 8, 2023

Facing the same error on lambda labds H100 instance trying to load Falcon-40B in 8 bit, what's the solution?

Ended up moving back to an A100 😅

@daniel-furman
Copy link

Has anyone else tried and confirmed the efficacy of @jonataslaw's solution two comments above? Will test myself over the weekend.

@daniel-furman
Copy link

daniel-furman commented Jun 14, 2023

I was able to solve this error with the conda install approach found here: bitsandbytes-foundation/bitsandbytes#85

# jupyter setup
wget http://repo.continuum.io/archive/Anaconda3-2023.03-1-Linux-x86_64.sh
bash Anaconda3-2023.03-1-Linux-x86_64.sh
source ~/.bashrc

conda create --name cap
conda activate cap
conda install pip
conda install cudatoolkit
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

git clone https://github.com/timdettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=118 make cuda11x
python setup.py install

pip install scipy
python -m bitsandbytes
# should be successfull build

@huawei-lin
Copy link

huawei-lin commented Jul 24, 2023

I met this issue on H100 GPU, and fixed it by changing load_in_8bit=True to load_in_8bit=False in the 114-th line of finetune.py.

@zubair-ahmed-ai
Copy link

zubair-ahmed-ai commented Aug 3, 2023

@daniel-furman

I was able to solve this error with the conda install approach found here: TimDettmers/bitsandbytes#85

# jupyter setup
wget http://repo.continuum.io/archive/Anaconda3-2023.03-1-Linux-x86_64.sh
bash Anaconda3-2023.03-1-Linux-x86_64.sh
source ~/.bashrc

conda create --name cap
conda activate cap
conda install pip
conda install cudatoolkit
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

git clone https://github.com/timdettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=118 make cuda11x
python setup.py install

pip install scipy
python -m bitsandbytes
# should be successfull build

Sadly it gave me the below error

Downloading (…)fetensors.index.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36.2k/36.2k [00:00<00:00, 10.6MB/s]
Downloading (…)of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.96G/9.96G [03:00<00:00, 55.3MB/s]
Downloading (…)of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.86G/9.86G [02:57<00:00, 55.4MB/s]
Downloading (…)of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.86G/9.86G [02:57<00:00, 55.4MB/s]
Downloading (…)of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.36G/1.36G [00:24<00:00, 55.2MB/s]
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [09:22<00:00, 140.63s/it]

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/ubuntu/miniconda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/home/ubuntu/miniconda/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/home/ubuntu/miniconda/envs/starchat/lib/libcudart.so'), PosixPath('/home/ubuntu/miniconda/envs/starchat/lib/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /home/ubuntu/miniconda/envs/starchat/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 9.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/ubuntu/miniconda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading checkpoint shards:   0%|                                                                                                                                                 | 0/4 [00:00<?, ?it/s]Error named symbol not found at line 528 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/ops.cu

@Jacobsolawetz
Copy link

Got this issue on H100 on runpod

@HaishuoFang
Copy link

same got this on H100 with 8bit. H100 works with 16bits

@jieWANGforwork
Copy link

Got this error on H100 using 8bit Llama. If anyone can make it on H100?

@huawei-lin
Copy link

Got this error on H100 using 8bit Llama. If anyone can make it on H100?

You can avoid to use 8 bit. 4bit and 16bit are fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests