[Bug] Exception: cublasLt ran into an error! during fine-tuning LLM in 8bit mode #538

NanoCode012 · 2023-06-24T04:44:35Z

Problem

Hello, I'm getting this weird cublasLt error on a lambdalabs H100 with cuda 118, pytorch 2.0.1, python3.10 Miniconda while trying to fine-tune a 3B param open-llama using LORA with 8bit loading. This only happens if we turn on 8bit loading. Lora alone or 4bit loading (qlora) works.

The same commands did work 2 weeks ago and stopped working a week ago.

I've tried bitsandbytes version 0.39.0 and 0.39.1 as prior versions don't work with H100. Source gives me a different issue as mentioned in Env section.

Expected

No error

Reproduce

Setup Miniconda then follow https://github.com/OpenAccess-AI-Collective/axolotl 's readme on lambdalabs and run the default open llama lora config.

Trace

0.39.0

File "/home/ubuntu/axolotl/scripts/finetune.py", line 352, in <module>
    fire.Fire(train)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)   
  File "/home/ubuntu/axolotl/scripts/finetune.py", line 337, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/trainer.py", line 1531, in train
    return inner_training_loop(
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/trainer.py", line 1795, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/trainer.py", line 2640, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/trainer.py", line 2665, in compute_loss
    outputs = model(**inputs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs) 
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/accelerate/utils/operations.py", line 553, in forward
    return model_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/accelerate/utils/operations.py", line 541, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/peft/peft_model.py", line 827, in forward
    return self.base_model(
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs) 
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 691, in forward
    outputs = self.model(
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs) 
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 579, in forward
    layer_outputs = decoder_layer(
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs) 
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 293, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs) 
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 195, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs) 
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/peft/tuners/lora.py", line 942, in forward
    result = super().forward(x)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 402, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 562, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 400, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1781, in igemmlt
    raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!

Env

python -m bitsandbytes

on main branch: I get error same as here Error named symbol not found at line 116 in file bitsandbytes/csrc/ops.cu #382
on 0.39.0

bin /home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12
.0'] files: {PosixPath('/home/ubuntu/miniconda3/envs/py310/lib/libcudart.so.11.0'), PosixPath('/home/ubuntu/miniconda3/envs/py310/lib/libcudart.so')}.. We'll flip a coin and try one of
 these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0
'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /home/ubuntu/miniconda3/envs/py310/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 9.0
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++ ANACONDA CUDA PATHS ++++++++++++++++++++
/home/ubuntu/miniconda3/envs/py310/lib/libcudart.so
/home/ubuntu/miniconda3/envs/py310/lib/stubs/libcuda.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/lib/libtorch_cuda_linalg.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/lib/libc10_cuda.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda110.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda114.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda120.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda111_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda120_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda114_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda112.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda111.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda110_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda115_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda115.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so
/home/ubuntu/miniconda3/envs/py310/nsight-compute/2023.1.1/target/linux-desktop-glibc_2_19_0-ppc64le/libcuda-injection.so
/home/ubuntu/miniconda3/envs/py310/nsight-compute/2023.1.1/target/linux-desktop-glibc_2_11_3-x64/libcuda-injection.so
/home/ubuntu/miniconda3/envs/py310/nsight-compute/2023.1.1/target/linux-desktop-t210-a64/libcuda-injection.so

++++++++++++++++++ /usr/local CUDA PATHS +++++++++++++++++++


+++++++++++++++ WORKING DIRECTORY CUDA PATHS +++++++++++++++


++++++++++++++++++ LD_LIBRARY CUDA PATHS +++++++++++++++++++
+++++++++++ /usr/lib/x86_64-linux-gnu CUDA PATHS +++++++++++
/usr/lib/x86_64-linux-gnu/libcudart.so
/usr/lib/x86_64-linux-gnu/stubs/libcuda.so
/usr/lib/x86_64-linux-gnu/libcuda.so

++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++
COMPILED_WITH_CUDA = True
COMPUTE_CAPABILITIES_PER_GPU = ['9.0']
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Running a quick check that:
    + library is importable
    + CUDA function is callable


WARNING: Please be sure to sanitize sensible info from any such env vars!

SUCCESS!
Installation was successful!

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

Misc

All related issues:

lambdalabs h100 cuBLAS API failed with status 15 - Error tloen/alpaca-lora#174
older gpu: "cublasLt ran into an error" with older GPU in 8-bit mode oobabooga/text-generation-webui#379
says due to card issue: 单卡能跑，多卡报错，raise Exception('cublasLt ran into an error!') Facico/Chinese-Vicuna#3
3070: 3070ti训练时报错：cublasLt ran into an error! Facico/Chinese-Vicuna#41
3060: cuBLAS API failed with status 15 oobabooga/text-generation-webui#569
no resolution: cublasLt runs into an error on 8 bit quantized Lightning-AI/lit-llama#315
low vram Exception: cublasLt ran into an error! deep-diver/LLM-As-Chatbot#16
no resolution: Error while loading a model on 8bit huggingface/transformers#21371

Also tried install cudatoolkit via conda.

The text was updated successfully, but these errors were encountered:

jvhoffbauer · 2023-06-25T20:49:28Z

I have the same issue - it occurs when running an 8bit model in the following docker container

FROM nvidia/cuda:11.7.0-cudnn8-devel-ubuntu22.04

RUN apt update
RUN apt install git -y 
RUN apt install wget -y 
RUN apt install python3 python3-pip -y



# Install dependencies (one-by-one for better caching)
#RUN pip install --upgrade pip
RUN pip install torch
RUN pip install transformers
RUN pip install datasets
RUN pip install evaluate
RUN pip install xformers
RUN pip install wandb
RUN pip install peft 
RUN pip install trl 
RUN pip install scipy 
RUN pip install accelerate 
RUN pip install scikit-learn
RUN pip install pandas 
RUN pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt

RUN git clone https://github.com/EleutherAI/lm-evaluation-harness
RUN pip install -e lm-evaluation-harness

RUN git clone https://github.com/timdettmers/bitsandbytes.git
# CUDA_VERSIONS in {110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 120}
# make argument in {cuda110, cuda11x, cuda12x}
# if you do not know what CUDA you have, try looking at the output of: python -m bitsandbytes
ENV CUDA_VERSION=117
RUN cd bitsandbytes && git checkout ac5550a0238286377ee3f58a85aeba1c40493e17
RUN cd bitsandbytes && make cuda11x
RUN cd bitsandbytes && python3 setup.py install
#RUN pip install bitsandbytes
#RUN python3 check_bnb_install.py

# Init wandb
#COPY ./wandb /wandb
ENV WANDB_CONFIG_DIR=/wandb

ENV HF_DATASETS_CACHE="/hf_cache/datasets"
ENV HUGGINGFACE_HUB_CACHE="/hf_cache/hub"

# Copy the code
COPY . /code

# Set the working directory
WORKDIR /code

# Install a useful helper to check bitsandbytes installation. Only works at runtime.
RUN wget https://gist.githubusercontent.com/TimDettmers/1f5188c6ee6ed69d211b7fe4e381e713/raw/4d17c3d09ccdb57e9ab7eca0171f2ace6e4d2858/check_bnb_install.py

sumukshashidhar · 2023-06-29T03:44:54Z

+1ing this. I notice it with local conda on H100 lambdalabs. Although I'm unsure whether this is a bitsandbytes error or something to do with CUDA for the H100s.

pribadihcr · 2023-06-29T07:20:25Z

+1

TimDettmers · 2023-07-14T02:54:50Z

This is the same error as #533. The problem was that I forgot to compile CUDA 11.8 for sm_90, which are H100 GPUs. The error message basically says that the code is not compiled for your GPU. I will fix this soon. Please continue the discussion in issue #533 until I have fixed this issue.

Ar770 · 2023-07-16T17:50:36Z

Trying to run today on a H100 instance, confirmed installation of 0.40.1 that I saw that was supposed to work now with this GPU,
I still get:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-23-3435b262f1ae> in <module>
----> 1 trainer.train()

~/.local/lib/python3.8/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1643             self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1644         )
-> 1645         return inner_training_loop(
   1646             args=args,
   1647             resume_from_checkpoint=resume_from_checkpoint,

~/.local/lib/python3.8/site-packages/transformers/trainer.py in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1936 
   1937                 with self.accelerator.accumulate(model):
-> 1938                     tr_loss_step = self.training_step(model, inputs)
   1939 
   1940                 if (

~/.local/lib/python3.8/site-packages/transformers/trainer.py in training_step(self, model, inputs)
   2757 
   2758         with self.compute_loss_context_manager():
-> 2759             loss = self.compute_loss(model, inputs)
   2760 
   2761         if self.args.n_gpu > 1:

~/.local/lib/python3.8/site-packages/transformers/trainer.py in compute_loss(self, model, inputs, return_outputs)
   2782         else:
   2783             labels = None
-> 2784         outputs = model(**inputs)
   2785         # Save past state if it exists
   2786         # TODO: this needs to be fixed and made cleaner later.

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.8/site-packages/accelerate/utils/operations.py in forward(*args, **kwargs)
    579 
    580     def forward(*args, **kwargs):
--> 581         return model_forward(*args, **kwargs)
    582 
    583     # To act like a decorator so that it can be popped when doing `extract_model_from_parallel`

~/.local/lib/python3.8/site-packages/accelerate/utils/operations.py in __call__(self, *args, **kwargs)
    567 
    568     def __call__(self, *args, **kwargs):
--> 569         return convert_to_fp32(self.model_forward(*args, **kwargs))
    570 
    571     def __getstate__(self):

/usr/lib/python3/dist-packages/torch/amp/autocast_mode.py in decorate_autocast(*args, **kwargs)
     12     def decorate_autocast(*args, **kwargs):
     13         with autocast_instance:
---> 14             return func(*args, **kwargs)
     15     decorate_autocast.__script_unsupported = '@autocast() decorator is not supported in script mode'  # type: ignore[attr-defined]
     16     return decorate_autocast

~/.local/lib/python3.8/site-packages/peft/peft_model.py in forward(self, *args, **kwargs)
    413         Forward pass of the model.
    414         """
--> 415         return self.get_base_model()(*args, **kwargs)
    416 
    417     def _get_base_model_class(self, is_prompt_tuning=False):

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.8/site-packages/accelerate/hooks.py in new_forward(*args, **kwargs)
    163                 output = old_forward(*args, **kwargs)
    164         else:
--> 165             output = old_forward(*args, **kwargs)
    166         return module._hf_hook.post_forward(module, output)
    167 

~/.local/lib/python3.8/site-packages/transformers/models/whisper/modeling_whisper.py in forward(self, input_features, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1417                 )
   1418 
-> 1419         outputs = self.model(
   1420             input_features,
   1421             attention_mask=attention_mask,

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.8/site-packages/accelerate/hooks.py in new_forward(*args, **kwargs)
    163                 output = old_forward(*args, **kwargs)
    164         else:
--> 165             output = old_forward(*args, **kwargs)
    166         return module._hf_hook.post_forward(module, output)
    167 

~/.local/lib/python3.8/site-packages/transformers/models/whisper/modeling_whisper.py in forward(self, input_features, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, decoder_inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
   1266             input_features = self._mask_input_features(input_features, attention_mask=attention_mask)
   1267 
-> 1268             encoder_outputs = self.encoder(
   1269                 input_features,
   1270                 head_mask=head_mask,

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.8/site-packages/accelerate/hooks.py in new_forward(*args, **kwargs)
    163                 output = old_forward(*args, **kwargs)
    164         else:
--> 165             output = old_forward(*args, **kwargs)
    166         return module._hf_hook.post_forward(module, output)
    167 

~/.local/lib/python3.8/site-packages/transformers/models/whisper/modeling_whisper.py in forward(self, input_features, attention_mask, head_mask, output_attentions, output_hidden_states, return_dict)
    854                         return custom_forward
    855 
--> 856                     layer_outputs = torch.utils.checkpoint.checkpoint(
    857                         create_custom_forward(encoder_layer),
    858                         hidden_states,

/usr/lib/python3/dist-packages/torch/utils/checkpoint.py in checkpoint(function, use_reentrant, *args, **kwargs)
    247 
    248     if use_reentrant:
--> 249         return CheckpointFunction.apply(function, preserve, *args)
    250     else:
    251         return _checkpoint_without_reentrant(

/usr/lib/python3/dist-packages/torch/autograd/function.py in apply(cls, *args, **kwargs)
    504             # See NOTE: [functorch vjp and autograd interaction]
    505             args = _functorch.utils.unwrap_dead_wrappers(args)
--> 506             return super().apply(*args, **kwargs)  # type: ignore[misc]
    507 
    508         if cls.setup_context == _SingleLevelFunction.setup_context:

/usr/lib/python3/dist-packages/torch/utils/checkpoint.py in forward(ctx, run_function, preserve_rng_state, *args)
    105 
    106         with torch.no_grad():
--> 107             outputs = run_function(*args)
    108         return outputs
    109 

~/.local/lib/python3.8/site-packages/transformers/models/whisper/modeling_whisper.py in custom_forward(*inputs)
    850                     def create_custom_forward(module):
    851                         def custom_forward(*inputs):
--> 852                             return module(*inputs, output_attentions)
    853 
    854                         return custom_forward

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.8/site-packages/accelerate/hooks.py in new_forward(*args, **kwargs)
    163                 output = old_forward(*args, **kwargs)
    164         else:
--> 165             output = old_forward(*args, **kwargs)
    166         return module._hf_hook.post_forward(module, output)
    167 

~/.local/lib/python3.8/site-packages/transformers/models/whisper/modeling_whisper.py in forward(self, hidden_states, attention_mask, layer_head_mask, output_attentions)
    429         residual = hidden_states
    430         hidden_states = self.self_attn_layer_norm(hidden_states)
--> 431         hidden_states, attn_weights, _ = self.self_attn(
    432             hidden_states=hidden_states,
    433             attention_mask=attention_mask,

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.8/site-packages/accelerate/hooks.py in new_forward(*args, **kwargs)
    163                 output = old_forward(*args, **kwargs)
    164         else:
--> 165             output = old_forward(*args, **kwargs)
    166         return module._hf_hook.post_forward(module, output)
    167 

~/.local/lib/python3.8/site-packages/transformers/models/whisper/modeling_whisper.py in forward(self, hidden_states, key_value_states, past_key_value, attention_mask, layer_head_mask, output_attentions)
    288 
    289         # get query proj
--> 290         query_states = self.q_proj(hidden_states) * self.scaling
    291         # get key, value proj
    292         # `past_key_value[0].shape[2] == key_value_states.shape[1]`

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.8/site-packages/peft/tuners/lora.py in forward(self, x)
   1052 
   1053         def forward(self, x: torch.Tensor):
-> 1054             result = super().forward(x)
   1055 
   1056             if self.disable_adapters or self.active_adapter not in self.lora_A.keys():

~/.local/lib/python3.8/site-packages/bitsandbytes/nn/modules.py in forward(self, x)
    412             self.bias.data = self.bias.data.to(x.dtype)
    413 
--> 414         out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
    415 
    416         if not self.state.has_fp16_weights:

~/.local/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py in matmul(A, B, out, state, threshold, bias)
    561     if threshold > 0.0:
    562         state.threshold = threshold
--> 563     return MatMul8bitLt.apply(A, B, out, bias, state)
    564 
    565 

/usr/lib/python3/dist-packages/torch/autograd/function.py in apply(cls, *args, **kwargs)
    504             # See NOTE: [functorch vjp and autograd interaction]
    505             args = _functorch.utils.unwrap_dead_wrappers(args)
--> 506             return super().apply(*args, **kwargs)  # type: ignore[misc]
    507 
    508         if cls.setup_context == _SingleLevelFunction.setup_context:

~/.local/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py in forward(ctx, A, B, out, bias, state)
    399         if using_igemmlt:
    400             C32A, SA = F.transform(CA, "col32")
--> 401             out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
    402             if bias is None or bias.dtype == torch.float16:
    403                 # we apply the fused bias here

~/.local/lib/python3.8/site-packages/bitsandbytes/functional.py in igemmlt(A, B, SA, SB, out, Sout, dtype)
   1790     if has_error == 1:
   1791         print(f'A: {shapeA}, B: {shapeB}, C: {Sout[0]}; (lda, ldb, ldc): {(lda, ldb, ldc)}; (m, n, k): {(m, n, k)}')
-> 1792         raise Exception('cublasLt ran into an error!')
   1793 
   1794     torch.cuda.set_device(prev_device)

Exception: cublasLt ran into an error!

So frustrating...
Please help, Thank you for the great work!

piperino11 · 2023-09-19T15:40:11Z

Same error for me

basteran · 2023-09-19T16:28:52Z

Hello,

any news? Same error here, I cannot find anything useful in order to use the 8 bit quantization on the H100 GPUs.

shashank140195 · 2023-09-25T20:48:52Z

This is the same error as #533. The problem was that I forgot to compile CUDA 11.8 for sm_90, which are H100 GPUs. The error message basically says that the code is not compiled for your GPU. I will fix this soon. Please continue the discussion in issue #533 until I have fixed this issue.

Hi @TimDettmers Do we have the fix yet?

shashank140195 · 2023-09-29T22:55:13Z

Hello,

any news? Same error here, I cannot find anything useful in order to use the 8 bit quantization on the H100 GPUs.

@basteran Did you find the fix? @TimDettmers Any updates?

mikecipolla · 2023-10-28T00:02:30Z

are there any updates here? am I missing something or did they just "forget" to support H100 GPUs and even months later this hasn't been fixed? has anyone found a workaround? @TimDettmers ?

TimDettmers · 2023-11-01T16:16:05Z

This is actually a more complicated issue. The 8-bit implementation uses cuBLASLt which uses special format for 8-bit matrix multiplication. There are special formats for Ampere, Turning, and now Hopper GPUs. Hopper GPUs do not support Ampere or Turing formats. This means multiple CUDA kernels and the cuBLASLt integration need to be implemented to make 8-bit work on Hopper GPUs.

I think for now, the more realistic thing is to throw and error to let the user know that this features is currently not supported.

swumagic · 2023-11-10T06:40:28Z

Bitsandbytes was not supported windows before, but my method can support windows.（yuhuang）
1 open folder J:\StableDiffusion\sdwebui，Click the address bar of the folder and enter CMD
or WIN+R, CMD 。enter，cd /d J:\StableDiffusion\sdwebui
2 J:\StableDiffusion\sdwebui\py310\python.exe -m pip uninstall bitsandbytes

3 J:\StableDiffusion\sdwebui\py310\python.exe -m pip uninstall bitsandbytes-windows

4 J:\StableDiffusion\sdwebui\py310\python.exe -m pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.1-py3-none-win_amd64.whl

Replace your SD venv directory file（python.exe Folder） here（J:\StableDiffusion\sdwebui\py310）

swumagic · 2023-11-12T01:54:52Z

OR you are Linux distribution (Ubuntu, MacOS, etc.)system ,AND CUDA Version: 11.X.

Bitsandbytes can support ubuntu.（yuhuang）
1 open folder J:\StableDiffusion\sdwebui，Click the address bar of the folder and enter CMD
or WIN+R, CMD 。enter，cd /d J:\StableDiffusion\sdwebui
2 J:\StableDiffusion\sdwebui\py310\python.exe -m pip uninstall bitsandbytes

3 J:\StableDiffusion\sdwebui\py310\python.exe -m pip uninstall bitsandbytes-windows

4 J:\StableDiffusion\sdwebui\py310\python.exe -m pip install https://github.com/TimDettmers/bitsandbytes/releases/download/0.41.0/bitsandbytes-0.41.0-py3-none-any.whl

Replace your SD venv directory file（python.exe Folder） here（J:\StableDiffusion\sdwebui\py310）

github-actions · 2023-12-20T15:15:39Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

PyroGenesis · 2024-01-02T23:23:17Z

Can we please keep this issue (or #383 or #599 ) open? I still want to see this issue resolved, if possible.

adrian-branescu · 2024-01-05T15:10:21Z

This is actually a more complicated issue. The 8-bit implementation uses cuBLASLt which uses special format for 8-bit matrix multiplication. There are special formats for Ampere, Turning, and now Hopper GPUs. Hopper GPUs do not support Ampere or Turing formats. This means multiple CUDA kernels and the cuBLASLt integration need to be implemented to make 8-bit work on Hopper GPUs.

I think for now, the more realistic thing is to throw and error to let the user know that this features is currently not supported.

@TimDettmers could you use https://github.com/NVIDIA/TransformerEngine ?

At the first sight the exposed API seems too high-level for your needs, but their building blocks are tailored for Hopper (H100) and Ada (RTX4090) architectures, e.g. https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/common/gemm/cublaslt_gemm.cu

monk1337 · 2024-02-20T22:20:01Z

+1ing this. I notice it with local conda on H100 lambdalabs. Although I'm unsure whether this is a bitsandbytes error or something to do with CUDA for the H100s.

This error is related to H100, I tried loading the model on H100 and got the error, the same load8bit was tried on A100 and it's working fine.

0-hero · 2024-03-25T09:04:47Z

Anyone able to resolve this?

hayoung-jeremy · 2024-03-28T09:07:09Z

Is still not available on H100 GPU instance?

0-hero · 2024-03-28T09:34:05Z

Not yet unfortunately

ionutmodo · 2024-05-07T22:19:38Z

do you guys have some solution for this?

ZhouFang-Intel · 2024-05-15T05:42:04Z

Observing the same issue with H100, too.

FoolPlayer · 2024-05-22T09:27:02Z

Also with H800.

khayamgondal · 2024-06-17T19:31:43Z

This is actually a more complicated issue. The 8-bit implementation uses cuBLASLt which uses special format for 8-bit matrix multiplication. There are special formats for Ampere, Turning, and now Hopper GPUs. Hopper GPUs do not support Ampere or Turing formats. This means multiple CUDA kernels and the cuBLASLt integration need to be implemented to make 8-bit work on Hopper GPUs.

I think for now, the more realistic thing is to throw and error to let the user know that this features is currently not supported.

Any plan to fix this?

suzewei · 2024-06-18T11:40:33Z

The same problem comes for H20

zhuconv · 2024-08-01T05:00:08Z

The same with H800

matthewdouglas · 2024-08-01T12:42:21Z

Hi all,

I will keep this issue open, but please be aware that for now that 8bit is not supported in bitsandbytes on Hopper. It is recommended to use nf4 or fp4 instead.

RaccoonOnion · 2024-08-03T12:46:03Z

Just want to add to this thread. Tried in H100 and not working. really hope bitesandbytes team and support this feature given that more and more ppl is gonna switch to newer version GPUs

NuoJohnChen · 2024-08-12T07:14:39Z

Same to me. Not work after changing to bf16, fp16, fp4, or else.

surdarla · 2024-08-13T06:50:55Z

Having same issue with H100E

Boltzmachine · 2024-09-18T22:23:44Z

Same problem

crinoiddream · 2024-10-10T05:56:09Z

The same with H800 and H100

zihaohe123 · 2024-10-18T00:58:59Z

Still having the same issue

suhyeok-jang · 2024-11-05T12:38:46Z

Still having the same issue on H100

sreemanti-abacusai · 2024-11-06T12:39:37Z

Still having same issue on H100

krjoha · 2024-11-06T12:46:01Z

Well, just came here to say I also ran into this issue using 8bit and H100. Would be very useful to have this working!

matthewdouglas · 2024-11-07T16:31:00Z

Hi all! We are currently working on LLM.int8 support for Hopper in PR #1401. I cannot give an accurate ETA for a release at the moment, but it will be supported soon!

rodaw92 · 2024-11-15T12:46:47Z

same problem occurred

yz26cn · 2024-11-16T03:09:24Z

Would be very appreciated to have this working on H100.

Davido111200 · 2024-11-26T09:30:38Z

Still get the same problem with H100.

rahuldshetty mentioned this issue Jul 3, 2023

Server error: cublasLt ran into an error! huggingface/text-generation-inference#504

Closed

4 tasks

TimDettmers closed this as completed Jul 14, 2023

TimDettmers added bug Something isn't working high priority (first issues that will be worked on) labels Jul 14, 2023

Ar770 mentioned this issue Jul 16, 2023

bitsandbytes cannot find CUDA and GPU #533

Closed

Ar770 mentioned this issue Jul 17, 2023

cublaslt ran into an error! On h100 #599

Closed

TimDettmers reopened this Jul 17, 2023

TimDettmers added enhancement New feature or request bug Something isn't working and removed bug Something isn't working labels Nov 1, 2023

github-actions bot closed this as completed Dec 31, 2023

hjh0119 mentioned this issue May 17, 2024

internvl-chat-v1.5-int8 推理时报错，应该如何处理 modelscope/ms-swift#949

Closed

This was referenced Jun 6, 2024

load_in_8bit hangs on ROCm #1236

Open

ROCm and 8-bit quantization #1245

Open

matthewdouglas reopened this Aug 1, 2024

matthewdouglas added medium priority (will be worked on after all high priority issues) and removed high priority (first issues that will be worked on) labels Aug 1, 2024

winglian mentioned this issue Aug 17, 2024

add validation to prevent 8bit lora finetuning on H100s axolotl-ai-cloud/axolotl#1827

Merged

matthewdouglas linked a pull request Nov 7, 2024 that will close this issue

LLM.int8() Refactoring: Part 1 #1401

Open

8 tasks

[Bug] Exception: cublasLt ran into an error! during fine-tuning LLM in 8bit mode #538

[Bug] Exception: cublasLt ran into an error! during fine-tuning LLM in 8bit mode #538

Comments

NanoCode012 commented Jun 24, 2023 • edited Loading

Problem

Expected

Reproduce

Trace

Env

Misc

jvhoffbauer commented Jun 25, 2023

sumukshashidhar commented Jun 29, 2023

pribadihcr commented Jun 29, 2023

TimDettmers commented Jul 14, 2023

Ar770 commented Jul 16, 2023 • edited Loading

piperino11 commented Sep 19, 2023

basteran commented Sep 19, 2023

shashank140195 commented Sep 25, 2023

shashank140195 commented Sep 29, 2023

mikecipolla commented Oct 28, 2023

TimDettmers commented Nov 1, 2023

swumagic commented Nov 10, 2023

swumagic commented Nov 12, 2023

github-actions bot commented Dec 20, 2023

PyroGenesis commented Jan 2, 2024

adrian-branescu commented Jan 5, 2024 • edited Loading

monk1337 commented Feb 20, 2024

0-hero commented Mar 25, 2024

hayoung-jeremy commented Mar 28, 2024

0-hero commented Mar 28, 2024

ionutmodo commented May 7, 2024

ZhouFang-Intel commented May 15, 2024

FoolPlayer commented May 22, 2024

khayamgondal commented Jun 17, 2024

suzewei commented Jun 18, 2024

zhuconv commented Aug 1, 2024

matthewdouglas commented Aug 1, 2024

RaccoonOnion commented Aug 3, 2024

NuoJohnChen commented Aug 12, 2024

surdarla commented Aug 13, 2024

Boltzmachine commented Sep 18, 2024

crinoiddream commented Oct 10, 2024

zihaohe123 commented Oct 18, 2024

suhyeok-jang commented Nov 5, 2024

sreemanti-abacusai commented Nov 6, 2024

krjoha commented Nov 6, 2024

matthewdouglas commented Nov 7, 2024

rodaw92 commented Nov 15, 2024

yz26cn commented Nov 16, 2024

Davido111200 commented Nov 26, 2024

NanoCode012 commented Jun 24, 2023 •

edited

Loading

Ar770 commented Jul 16, 2023 •

edited

Loading

adrian-branescu commented Jan 5, 2024 •

edited

Loading