Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Exception: cublasLt ran into an error! during fine-tuning LLM in 8bit mode #538

Open
NanoCode012 opened this issue Jun 24, 2023 · 40 comments · May be fixed by #1401
Open

[Bug] Exception: cublasLt ran into an error! during fine-tuning LLM in 8bit mode #538

NanoCode012 opened this issue Jun 24, 2023 · 40 comments · May be fixed by #1401
Labels
enhancement New feature or request medium priority (will be worked on after all high priority issues)

Comments

@NanoCode012
Copy link

NanoCode012 commented Jun 24, 2023

Problem

Hello, I'm getting this weird cublasLt error on a lambdalabs H100 with cuda 118, pytorch 2.0.1, python3.10 Miniconda while trying to fine-tune a 3B param open-llama using LORA with 8bit loading. This only happens if we turn on 8bit loading. Lora alone or 4bit loading (qlora) works.

The same commands did work 2 weeks ago and stopped working a week ago.

I've tried bitsandbytes version 0.39.0 and 0.39.1 as prior versions don't work with H100. Source gives me a different issue as mentioned in Env section.

Expected

No error

Reproduce

Setup Miniconda then follow https://github.com/OpenAccess-AI-Collective/axolotl 's readme on lambdalabs and run the default open llama lora config.

Trace

0.39.0

File "/home/ubuntu/axolotl/scripts/finetune.py", line 352, in <module>
    fire.Fire(train)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)   
  File "/home/ubuntu/axolotl/scripts/finetune.py", line 337, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/trainer.py", line 1531, in train
    return inner_training_loop(
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/trainer.py", line 1795, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/trainer.py", line 2640, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/trainer.py", line 2665, in compute_loss
    outputs = model(**inputs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs) 
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/accelerate/utils/operations.py", line 553, in forward
    return model_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/accelerate/utils/operations.py", line 541, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/peft/peft_model.py", line 827, in forward
    return self.base_model(
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs) 
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 691, in forward
    outputs = self.model(
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs) 
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 579, in forward
    layer_outputs = decoder_layer(
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs) 
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 293, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs) 
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 195, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs) 
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/peft/tuners/lora.py", line 942, in forward
    result = super().forward(x)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 402, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 562, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 400, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1781, in igemmlt
    raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!

Env

python -m bitsandbytes

bin /home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12
.0'] files: {PosixPath('/home/ubuntu/miniconda3/envs/py310/lib/libcudart.so.11.0'), PosixPath('/home/ubuntu/miniconda3/envs/py310/lib/libcudart.so')}.. We'll flip a coin and try one of
 these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0
'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /home/ubuntu/miniconda3/envs/py310/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 9.0
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++ ANACONDA CUDA PATHS ++++++++++++++++++++
/home/ubuntu/miniconda3/envs/py310/lib/libcudart.so
/home/ubuntu/miniconda3/envs/py310/lib/stubs/libcuda.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/lib/libtorch_cuda_linalg.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/torch/lib/libc10_cuda.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda110.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda114.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda120.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda111_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda112_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda120_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda114_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda112.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda111.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda110_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda115_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda116_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda115.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so
/home/ubuntu/miniconda3/envs/py310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113.so
/home/ubuntu/miniconda3/envs/py310/nsight-compute/2023.1.1/target/linux-desktop-glibc_2_19_0-ppc64le/libcuda-injection.so
/home/ubuntu/miniconda3/envs/py310/nsight-compute/2023.1.1/target/linux-desktop-glibc_2_11_3-x64/libcuda-injection.so
/home/ubuntu/miniconda3/envs/py310/nsight-compute/2023.1.1/target/linux-desktop-t210-a64/libcuda-injection.so

++++++++++++++++++ /usr/local CUDA PATHS +++++++++++++++++++


+++++++++++++++ WORKING DIRECTORY CUDA PATHS +++++++++++++++


++++++++++++++++++ LD_LIBRARY CUDA PATHS +++++++++++++++++++
+++++++++++ /usr/lib/x86_64-linux-gnu CUDA PATHS +++++++++++
/usr/lib/x86_64-linux-gnu/libcudart.so
/usr/lib/x86_64-linux-gnu/stubs/libcuda.so
/usr/lib/x86_64-linux-gnu/libcuda.so

++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++
COMPILED_WITH_CUDA = True
COMPUTE_CAPABILITIES_PER_GPU = ['9.0']
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Running a quick check that:
    + library is importable
    + CUDA function is callable


WARNING: Please be sure to sanitize sensible info from any such env vars!

SUCCESS!
Installation was successful!
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

Misc

All related issues:

Also tried install cudatoolkit via conda.

@jvhoffbauer
Copy link

I have the same issue - it occurs when running an 8bit model in the following docker container

FROM nvidia/cuda:11.7.0-cudnn8-devel-ubuntu22.04

RUN apt update
RUN apt install git -y 
RUN apt install wget -y 
RUN apt install python3 python3-pip -y



# Install dependencies (one-by-one for better caching)
#RUN pip install --upgrade pip
RUN pip install torch
RUN pip install transformers
RUN pip install datasets
RUN pip install evaluate
RUN pip install xformers
RUN pip install wandb
RUN pip install peft 
RUN pip install trl 
RUN pip install scipy 
RUN pip install accelerate 
RUN pip install scikit-learn
RUN pip install pandas 
RUN pip install bleurt@https://github.com/google-research/bleurt/archive/b610120347ef22b494b6d69b4316e303f5932516.zip#egg=bleurt

RUN git clone https://github.com/EleutherAI/lm-evaluation-harness
RUN pip install -e lm-evaluation-harness

RUN git clone https://github.com/timdettmers/bitsandbytes.git
# CUDA_VERSIONS in {110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 120}
# make argument in {cuda110, cuda11x, cuda12x}
# if you do not know what CUDA you have, try looking at the output of: python -m bitsandbytes
ENV CUDA_VERSION=117
RUN cd bitsandbytes && git checkout ac5550a0238286377ee3f58a85aeba1c40493e17
RUN cd bitsandbytes && make cuda11x
RUN cd bitsandbytes && python3 setup.py install
#RUN pip install bitsandbytes
#RUN python3 check_bnb_install.py

# Init wandb
#COPY ./wandb /wandb
ENV WANDB_CONFIG_DIR=/wandb

ENV HF_DATASETS_CACHE="/hf_cache/datasets"
ENV HUGGINGFACE_HUB_CACHE="/hf_cache/hub"

# Copy the code
COPY . /code

# Set the working directory
WORKDIR /code

# Install a useful helper to check bitsandbytes installation. Only works at runtime.
RUN wget https://gist.githubusercontent.com/TimDettmers/1f5188c6ee6ed69d211b7fe4e381e713/raw/4d17c3d09ccdb57e9ab7eca0171f2ace6e4d2858/check_bnb_install.py 

@sumukshashidhar
Copy link

+1ing this. I notice it with local conda on H100 lambdalabs. Although I'm unsure whether this is a bitsandbytes error or something to do with CUDA for the H100s.

@pribadihcr
Copy link

+1

@TimDettmers
Copy link
Collaborator

This is the same error as #533. The problem was that I forgot to compile CUDA 11.8 for sm_90, which are H100 GPUs. The error message basically says that the code is not compiled for your GPU. I will fix this soon. Please continue the discussion in issue #533 until I have fixed this issue.

@TimDettmers TimDettmers added bug Something isn't working high priority (first issues that will be worked on) labels Jul 14, 2023
@Ar770
Copy link

Ar770 commented Jul 16, 2023

Trying to run today on a H100 instance, confirmed installation of 0.40.1 that I saw that was supposed to work now with this GPU,
I still get:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-23-3435b262f1ae> in <module>
----> 1 trainer.train()

~/.local/lib/python3.8/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1643             self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1644         )
-> 1645         return inner_training_loop(
   1646             args=args,
   1647             resume_from_checkpoint=resume_from_checkpoint,

~/.local/lib/python3.8/site-packages/transformers/trainer.py in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1936 
   1937                 with self.accelerator.accumulate(model):
-> 1938                     tr_loss_step = self.training_step(model, inputs)
   1939 
   1940                 if (

~/.local/lib/python3.8/site-packages/transformers/trainer.py in training_step(self, model, inputs)
   2757 
   2758         with self.compute_loss_context_manager():
-> 2759             loss = self.compute_loss(model, inputs)
   2760 
   2761         if self.args.n_gpu > 1:

~/.local/lib/python3.8/site-packages/transformers/trainer.py in compute_loss(self, model, inputs, return_outputs)
   2782         else:
   2783             labels = None
-> 2784         outputs = model(**inputs)
   2785         # Save past state if it exists
   2786         # TODO: this needs to be fixed and made cleaner later.

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.8/site-packages/accelerate/utils/operations.py in forward(*args, **kwargs)
    579 
    580     def forward(*args, **kwargs):
--> 581         return model_forward(*args, **kwargs)
    582 
    583     # To act like a decorator so that it can be popped when doing `extract_model_from_parallel`

~/.local/lib/python3.8/site-packages/accelerate/utils/operations.py in __call__(self, *args, **kwargs)
    567 
    568     def __call__(self, *args, **kwargs):
--> 569         return convert_to_fp32(self.model_forward(*args, **kwargs))
    570 
    571     def __getstate__(self):

/usr/lib/python3/dist-packages/torch/amp/autocast_mode.py in decorate_autocast(*args, **kwargs)
     12     def decorate_autocast(*args, **kwargs):
     13         with autocast_instance:
---> 14             return func(*args, **kwargs)
     15     decorate_autocast.__script_unsupported = '@autocast() decorator is not supported in script mode'  # type: ignore[attr-defined]
     16     return decorate_autocast

~/.local/lib/python3.8/site-packages/peft/peft_model.py in forward(self, *args, **kwargs)
    413         Forward pass of the model.
    414         """
--> 415         return self.get_base_model()(*args, **kwargs)
    416 
    417     def _get_base_model_class(self, is_prompt_tuning=False):

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.8/site-packages/accelerate/hooks.py in new_forward(*args, **kwargs)
    163                 output = old_forward(*args, **kwargs)
    164         else:
--> 165             output = old_forward(*args, **kwargs)
    166         return module._hf_hook.post_forward(module, output)
    167 

~/.local/lib/python3.8/site-packages/transformers/models/whisper/modeling_whisper.py in forward(self, input_features, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   1417                 )
   1418 
-> 1419         outputs = self.model(
   1420             input_features,
   1421             attention_mask=attention_mask,

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.8/site-packages/accelerate/hooks.py in new_forward(*args, **kwargs)
    163                 output = old_forward(*args, **kwargs)
    164         else:
--> 165             output = old_forward(*args, **kwargs)
    166         return module._hf_hook.post_forward(module, output)
    167 

~/.local/lib/python3.8/site-packages/transformers/models/whisper/modeling_whisper.py in forward(self, input_features, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, decoder_inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
   1266             input_features = self._mask_input_features(input_features, attention_mask=attention_mask)
   1267 
-> 1268             encoder_outputs = self.encoder(
   1269                 input_features,
   1270                 head_mask=head_mask,

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.8/site-packages/accelerate/hooks.py in new_forward(*args, **kwargs)
    163                 output = old_forward(*args, **kwargs)
    164         else:
--> 165             output = old_forward(*args, **kwargs)
    166         return module._hf_hook.post_forward(module, output)
    167 

~/.local/lib/python3.8/site-packages/transformers/models/whisper/modeling_whisper.py in forward(self, input_features, attention_mask, head_mask, output_attentions, output_hidden_states, return_dict)
    854                         return custom_forward
    855 
--> 856                     layer_outputs = torch.utils.checkpoint.checkpoint(
    857                         create_custom_forward(encoder_layer),
    858                         hidden_states,

/usr/lib/python3/dist-packages/torch/utils/checkpoint.py in checkpoint(function, use_reentrant, *args, **kwargs)
    247 
    248     if use_reentrant:
--> 249         return CheckpointFunction.apply(function, preserve, *args)
    250     else:
    251         return _checkpoint_without_reentrant(

/usr/lib/python3/dist-packages/torch/autograd/function.py in apply(cls, *args, **kwargs)
    504             # See NOTE: [functorch vjp and autograd interaction]
    505             args = _functorch.utils.unwrap_dead_wrappers(args)
--> 506             return super().apply(*args, **kwargs)  # type: ignore[misc]
    507 
    508         if cls.setup_context == _SingleLevelFunction.setup_context:

/usr/lib/python3/dist-packages/torch/utils/checkpoint.py in forward(ctx, run_function, preserve_rng_state, *args)
    105 
    106         with torch.no_grad():
--> 107             outputs = run_function(*args)
    108         return outputs
    109 

~/.local/lib/python3.8/site-packages/transformers/models/whisper/modeling_whisper.py in custom_forward(*inputs)
    850                     def create_custom_forward(module):
    851                         def custom_forward(*inputs):
--> 852                             return module(*inputs, output_attentions)
    853 
    854                         return custom_forward

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.8/site-packages/accelerate/hooks.py in new_forward(*args, **kwargs)
    163                 output = old_forward(*args, **kwargs)
    164         else:
--> 165             output = old_forward(*args, **kwargs)
    166         return module._hf_hook.post_forward(module, output)
    167 

~/.local/lib/python3.8/site-packages/transformers/models/whisper/modeling_whisper.py in forward(self, hidden_states, attention_mask, layer_head_mask, output_attentions)
    429         residual = hidden_states
    430         hidden_states = self.self_attn_layer_norm(hidden_states)
--> 431         hidden_states, attn_weights, _ = self.self_attn(
    432             hidden_states=hidden_states,
    433             attention_mask=attention_mask,

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.8/site-packages/accelerate/hooks.py in new_forward(*args, **kwargs)
    163                 output = old_forward(*args, **kwargs)
    164         else:
--> 165             output = old_forward(*args, **kwargs)
    166         return module._hf_hook.post_forward(module, output)
    167 

~/.local/lib/python3.8/site-packages/transformers/models/whisper/modeling_whisper.py in forward(self, hidden_states, key_value_states, past_key_value, attention_mask, layer_head_mask, output_attentions)
    288 
    289         # get query proj
--> 290         query_states = self.q_proj(hidden_states) * self.scaling
    291         # get key, value proj
    292         # `past_key_value[0].shape[2] == key_value_states.shape[1]`

/usr/lib/python3/dist-packages/torch/nn/modules/module.py in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

~/.local/lib/python3.8/site-packages/peft/tuners/lora.py in forward(self, x)
   1052 
   1053         def forward(self, x: torch.Tensor):
-> 1054             result = super().forward(x)
   1055 
   1056             if self.disable_adapters or self.active_adapter not in self.lora_A.keys():

~/.local/lib/python3.8/site-packages/bitsandbytes/nn/modules.py in forward(self, x)
    412             self.bias.data = self.bias.data.to(x.dtype)
    413 
--> 414         out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
    415 
    416         if not self.state.has_fp16_weights:

~/.local/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py in matmul(A, B, out, state, threshold, bias)
    561     if threshold > 0.0:
    562         state.threshold = threshold
--> 563     return MatMul8bitLt.apply(A, B, out, bias, state)
    564 
    565 

/usr/lib/python3/dist-packages/torch/autograd/function.py in apply(cls, *args, **kwargs)
    504             # See NOTE: [functorch vjp and autograd interaction]
    505             args = _functorch.utils.unwrap_dead_wrappers(args)
--> 506             return super().apply(*args, **kwargs)  # type: ignore[misc]
    507 
    508         if cls.setup_context == _SingleLevelFunction.setup_context:

~/.local/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py in forward(ctx, A, B, out, bias, state)
    399         if using_igemmlt:
    400             C32A, SA = F.transform(CA, "col32")
--> 401             out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
    402             if bias is None or bias.dtype == torch.float16:
    403                 # we apply the fused bias here

~/.local/lib/python3.8/site-packages/bitsandbytes/functional.py in igemmlt(A, B, SA, SB, out, Sout, dtype)
   1790     if has_error == 1:
   1791         print(f'A: {shapeA}, B: {shapeB}, C: {Sout[0]}; (lda, ldb, ldc): {(lda, ldb, ldc)}; (m, n, k): {(m, n, k)}')
-> 1792         raise Exception('cublasLt ran into an error!')
   1793 
   1794     torch.cuda.set_device(prev_device)

Exception: cublasLt ran into an error!

So frustrating...
Please help, Thank you for the great work!

@piperino11
Copy link

Same error for me

@basteran
Copy link

Hello,

any news? Same error here, I cannot find anything useful in order to use the 8 bit quantization on the H100 GPUs.

@shashank140195
Copy link

This is the same error as #533. The problem was that I forgot to compile CUDA 11.8 for sm_90, which are H100 GPUs. The error message basically says that the code is not compiled for your GPU. I will fix this soon. Please continue the discussion in issue #533 until I have fixed this issue.

Hi @TimDettmers Do we have the fix yet?

@shashank140195
Copy link

Hello,

any news? Same error here, I cannot find anything useful in order to use the 8 bit quantization on the H100 GPUs.

@basteran Did you find the fix? @TimDettmers Any updates?

@mikecipolla
Copy link

are there any updates here? am I missing something or did they just "forget" to support H100 GPUs and even months later this hasn't been fixed? has anyone found a workaround? @TimDettmers ?

@TimDettmers
Copy link
Collaborator

This is actually a more complicated issue. The 8-bit implementation uses cuBLASLt which uses special format for 8-bit matrix multiplication. There are special formats for Ampere, Turning, and now Hopper GPUs. Hopper GPUs do not support Ampere or Turing formats. This means multiple CUDA kernels and the cuBLASLt integration need to be implemented to make 8-bit work on Hopper GPUs.

I think for now, the more realistic thing is to throw and error to let the user know that this features is currently not supported.

@TimDettmers TimDettmers added enhancement New feature or request bug Something isn't working and removed bug Something isn't working labels Nov 1, 2023
@swumagic
Copy link

Bitsandbytes was not supported windows before, but my method can support windows.(yuhuang)
1 open folder J:\StableDiffusion\sdwebui,Click the address bar of the folder and enter CMD
or WIN+R, CMD 。enter,cd /d J:\StableDiffusion\sdwebui
2 J:\StableDiffusion\sdwebui\py310\python.exe -m pip uninstall bitsandbytes

3 J:\StableDiffusion\sdwebui\py310\python.exe -m pip uninstall bitsandbytes-windows

4 J:\StableDiffusion\sdwebui\py310\python.exe -m pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.1-py3-none-win_amd64.whl

Replace your SD venv directory file(python.exe Folder) here(J:\StableDiffusion\sdwebui\py310)

@swumagic
Copy link

OR you are Linux distribution (Ubuntu, MacOS, etc.)system ,AND CUDA Version: 11.X.

Bitsandbytes can support ubuntu.(yuhuang)
1 open folder J:\StableDiffusion\sdwebui,Click the address bar of the folder and enter CMD
or WIN+R, CMD 。enter,cd /d J:\StableDiffusion\sdwebui
2 J:\StableDiffusion\sdwebui\py310\python.exe -m pip uninstall bitsandbytes

3 J:\StableDiffusion\sdwebui\py310\python.exe -m pip uninstall bitsandbytes-windows

4 J:\StableDiffusion\sdwebui\py310\python.exe -m pip install https://github.com/TimDettmers/bitsandbytes/releases/download/0.41.0/bitsandbytes-0.41.0-py3-none-any.whl

Replace your SD venv directory file(python.exe Folder) here(J:\StableDiffusion\sdwebui\py310)

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@PyroGenesis
Copy link

Can we please keep this issue (or #383 or #599 ) open? I still want to see this issue resolved, if possible.

@adrian-branescu
Copy link

adrian-branescu commented Jan 5, 2024

This is actually a more complicated issue. The 8-bit implementation uses cuBLASLt which uses special format for 8-bit matrix multiplication. There are special formats for Ampere, Turning, and now Hopper GPUs. Hopper GPUs do not support Ampere or Turing formats. This means multiple CUDA kernels and the cuBLASLt integration need to be implemented to make 8-bit work on Hopper GPUs.

I think for now, the more realistic thing is to throw and error to let the user know that this features is currently not supported.

@TimDettmers could you use https://github.com/NVIDIA/TransformerEngine ?

At the first sight the exposed API seems too high-level for your needs, but their building blocks are tailored for Hopper (H100) and Ada (RTX4090) architectures, e.g. https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/common/gemm/cublaslt_gemm.cu

@monk1337
Copy link

+1ing this. I notice it with local conda on H100 lambdalabs. Although I'm unsure whether this is a bitsandbytes error or something to do with CUDA for the H100s.

This error is related to H100, I tried loading the model on H100 and got the error, the same load8bit was tried on A100 and it's working fine.

@0-hero
Copy link

0-hero commented Mar 25, 2024

Anyone able to resolve this?

@hayoung-jeremy
Copy link

Is still not available on H100 GPU instance?

@0-hero
Copy link

0-hero commented Mar 28, 2024

Not yet unfortunately

@ionutmodo
Copy link

do you guys have some solution for this?

@ZhouFang-Intel
Copy link

Observing the same issue with H100, too.

@FoolPlayer
Copy link

Also with H800.

@khayamgondal
Copy link

This is actually a more complicated issue. The 8-bit implementation uses cuBLASLt which uses special format for 8-bit matrix multiplication. There are special formats for Ampere, Turning, and now Hopper GPUs. Hopper GPUs do not support Ampere or Turing formats. This means multiple CUDA kernels and the cuBLASLt integration need to be implemented to make 8-bit work on Hopper GPUs.

I think for now, the more realistic thing is to throw and error to let the user know that this features is currently not supported.

Any plan to fix this?

@suzewei
Copy link

suzewei commented Jun 18, 2024

The same problem comes for H20

@zhuconv
Copy link

zhuconv commented Aug 1, 2024

The same with H800

@matthewdouglas matthewdouglas reopened this Aug 1, 2024
@matthewdouglas
Copy link
Member

Hi all,

I will keep this issue open, but please be aware that for now that 8bit is not supported in bitsandbytes on Hopper. It is recommended to use nf4 or fp4 instead.

@matthewdouglas matthewdouglas added medium priority (will be worked on after all high priority issues) and removed high priority (first issues that will be worked on) labels Aug 1, 2024
@RaccoonOnion
Copy link

Just want to add to this thread. Tried in H100 and not working. really hope bitesandbytes team and support this feature given that more and more ppl is gonna switch to newer version GPUs

@NuoJohnChen
Copy link

Same to me. Not work after changing to bf16, fp16, fp4, or else.

@surdarla
Copy link

Having same issue with H100E

@Boltzmachine
Copy link

Same problem

@crinoiddream
Copy link

The same with H800 and H100

@zihaohe123
Copy link

Still having the same issue

@suhyeok-jang
Copy link

Still having the same issue on H100

@sreemanti-abacusai
Copy link

Still having same issue on H100

@krjoha
Copy link

krjoha commented Nov 6, 2024

Well, just came here to say I also ran into this issue using 8bit and H100. Would be very useful to have this working!

@matthewdouglas
Copy link
Member

Hi all! We are currently working on LLM.int8 support for Hopper in PR #1401. I cannot give an accurate ETA for a release at the moment, but it will be supported soon!

@matthewdouglas matthewdouglas linked a pull request Nov 7, 2024 that will close this issue
8 tasks
@rodaw92
Copy link

rodaw92 commented Nov 15, 2024

same problem occurred

@yz26cn
Copy link

yz26cn commented Nov 16, 2024

Would be very appreciated to have this working on H100.

@Davido111200
Copy link

Still get the same problem with H100.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request medium priority (will be worked on after all high priority issues)
Projects
None yet
Development

Successfully merging a pull request may close this issue.