-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuBLAS API failed with status 15 - Error #174
Comments
I ran into this issue as well with torch==2.0. When I uninstalled it and re-installed as torch==1.13.1, then it seemed to fix the issue. |
Thanks ! this version fixed it. |
The error went away for me on GPU |
May i know what Cuda version are you using / nvidia drivers version and your : accelerate pip packages versions ? (if not last one used) thanks ! |
CUDA 11.7. Also I'm used conda for install pytorch with cuda (
|
CUDA 12 is not compatible with PyTorch 2.0. https://github.com/pytorch/pytorch/blob/master/RELEASE.md#release-compatibility-matrix Following is the Release Compatibility Matrix for PyTorch releases:
Also, Python 3.11 is not compatible either; the max version is 3.10. |
Getting the same issue here trying to run inference on the google t5-xl model. Error:
I've tried all the fixes proposed here but no luck. Environment packages:
|
@mudomau I'm encountering another error now but the last Dockerfile install uploaded 3 days ago fixed that cuBLAS error for me. |
same problem here.
|
I am running into the same issue as well on a H100: torch 1.13.1, bitsandbytes==0.38.1, cuda 11.8, python 3.10, cublas 11.11.3.6
|
Same issue comes to me when finetuning 30b and 65b models, even on different clouds. For 65b model, it randomly occurs with a probability of about 70%. For 30b model, it occurs every time. |
@arvindsun Have you fixed this? I'm also running into this issue when using an H100 on Lambda Labs. |
Getting the same error on an H100 on Lambda Labs |
Getting the same error on an H100 on Lambda Labs too |
Try to run it w/o 8-bit mode since you are on H100 |
I tried it. Lambda instances of H100 has cuda 11.8, and pytorch 2.0.1 compiled to 117, which is not compatible. the bitsandbytes version also has a problem, and you need to rename the cuda version you are using. I tried to install cuda version 12 too, to use the latest version of torch, but strangely the installation is aborted, without fail, so I gave up on testing it on the H100, I had already spent 3h of my time trying to configure it. I'll try it on another runpod instance, as locally I could successfully train it with 3 epochs, but I needed more computation to train it with 10, my RTX4090 will take weeks for it. |
Facing the same error on lambda labds H100 instance trying to load Falcon-40B in 8 bit, what's the solution? |
export this variables:
Install the compatible cuda (11.7 hasn't support to H100):
Remove old cuda:
Install the compatible pytorch:
If you will use deepspeed to make CPU offload (it makes the train faster) you need:
Edit these files (using VIM, nano, or SFPT) changing the import for inf from
|
Ended up moving back to an A100 😅 |
Has anyone else tried and confirmed the efficacy of @jonataslaw's solution two comments above? Will test myself over the weekend. |
I was able to solve this error with the conda install approach found here: bitsandbytes-foundation/bitsandbytes#85
|
I met this issue on H100 GPU, and fixed it by changing |
Sadly it gave me the below error
|
Got this issue on H100 on runpod |
same got this on H100 with 8bit. H100 works with 16bits |
Got this error on H100 using 8bit Llama. If anyone can make it on H100? |
You can avoid to use 8 bit. 4bit and 16bit are fine. |
Hi,
During the finetune.py command launch i'm encoutering this error titled above.
i'm using Fedora 36 with Cuda12, Python 3.10.10, initializing seems begining like so :
and then later after loading some files :
am i using some wrong libs versions ?
thx for your help
The text was updated successfully, but these errors were encountered: