Error while loading a model on 8bit #21371

toma-x · 2023-01-30T15:00:32Z

I'm trying to run inference on a model which doesn't fit on my GPU using this code:

import torch

device_map = {'transformer.wte': 0,
 'transformer.drop': 0,
 'transformer.h.0': 0,
 'transformer.h.1': 0,
 'transformer.h.2': 0,
 'transformer.h.3': 0,
 'transformer.h.4': 0,
 'transformer.h.5': 0,
 'transformer.h.6': 0,
 'transformer.h.7': 0,
 'transformer.h.8': 0,
 'transformer.h.9': 0,
 'transformer.h.10': 0,
 'transformer.h.11': 0,
 'transformer.h.12': 0,
 'transformer.h.13': 0,
 'transformer.h.14': 0,
 'transformer.h.15': 0,
 'transformer.h.16': 0,
 'transformer.h.17': 0,
 'transformer.h.18': 0,
 'transformer.h.19': 0,
 'transformer.h.20': 0,
 'transformer.h.21': 0,
 'transformer.h.22': 0,
 'transformer.h.23': 'cpu',
 'transformer.h.24': 'cpu',
 'transformer.h.25': 'cpu',
 'transformer.h.26': 'cpu',
 'transformer.h.27': 'cpu',
 'transformer.ln_f': 'cpu',
 'lm_head': 'cpu'}
tokenizer = AutoTokenizer.from_pretrained("tomaxe/fr-boris-sharded")
model = AutoModelForCausalLM.from_pretrained("tomaxe/fr-boris-sharded", load_in_8bit = True, load_in_8bit_skip_modules = ['lm_head',
                                                                                                                          'transformer.ln_f',
                                                                                                                          'transformer.h.27',
                                                                                                                          'transformer.h.26',
                                                                                                                          'transformer.h.25',
                                                                                                                          'transformer.h.24',
                                                                                                                          'transformer.h.23'], device_map = device_map)
input_text = "salut"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids, max_length = 20)
print(tokenizer.decode(outputs[0]))

And I'm running into this error :
@younesbelkada Do you know what I could do ? Thanks

Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
CUDA SETUP: CUDA runtime path found: /home/thomas/anaconda3/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/thomas/anaconda3/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading checkpoint shards:   0%|          | 0/30 [00:00<?, ?it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A: torch.Size([2, 4096]), B: torch.Size([4096, 4096]), C: (2, 4096); (lda, ldb, ldc): (c_int(64), c_int(131072), c_int(64)); (m, n, k): (c_int(2), c_int(4096), c_int(4096))
Traceback (most recent call last):

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/spyder_kernels/py3compat.py", line 356, in compat_exec
    exec(code, globals, locals)

  File "/home/thomas/Downloads/infersharded.py", line 46, in <module>
    outputs = model.generate(input_ids, max_length = 20)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/transformers/generation/utils.py", line 1391, in generate
    return self.greedy_search(

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/transformers/generation/utils.py", line 2179, in greedy_search
    outputs = self(

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/accelerate/hooks.py", line 156, in new_forward
    output = old_forward(*args, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/transformers/models/gptj/modeling_gptj.py", line 813, in forward
    transformer_outputs = self.transformer(

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/transformers/models/gptj/modeling_gptj.py", line 668, in forward
    outputs = block(

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/accelerate/hooks.py", line 156, in new_forward
    output = old_forward(*args, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/transformers/models/gptj/modeling_gptj.py", line 302, in forward
    attn_outputs = self.attn(

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/accelerate/hooks.py", line 156, in new_forward
    output = old_forward(*args, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/transformers/models/gptj/modeling_gptj.py", line 203, in forward
    query = self.q_proj(hidden_states)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/accelerate/hooks.py", line 156, in new_forward
    output = old_forward(*args, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 254, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 405, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 311, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
    raise Exception('cublasLt ran into an error!')

Exception: cublasLt ran into an error!

cuBLAS API failed with status 15
error detected```

The text was updated successfully, but these errors were encountered:

younesbelkada · 2023-01-30T17:16:10Z

Hi @toma-x
Thanks for the issue,
What you are currently trying to do (mixing cpu + int8) is not supported yet
I think that this feature should be addressed in QuantizationConfig in the next weeks, I will keep you updated in this issue

toma-x · 2023-01-31T07:10:07Z

Glad to know this, looking forward to hear you soon about this @younesbelkada

younesbelkada · 2023-03-07T14:21:49Z

Hi @toma-x
This is now supported on the main branch of transformers, can you check this section of the docs? 🙏
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu

toma-x · 2023-03-09T05:52:20Z

Hi @younesbelkada thank you for letting me updated, I will sure take a look this is very interesting, have a great day 😁

github-actions · 2023-04-02T15:02:26Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface deleted a comment from github-actions bot Mar 7, 2023

Facico mentioned this issue Mar 24, 2023

单卡能跑，多卡报错，raise Exception('cublasLt ran into an error!') Facico/Chinese-Vicuna#3

Closed

github-actions bot closed this as completed Apr 10, 2023

NanoCode012 mentioned this issue Jun 24, 2023

[Bug] Exception: cublasLt ran into an error! during fine-tuning LLM in 8bit mode bitsandbytes-foundation/bitsandbytes#538

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while loading a model on 8bit #21371

Error while loading a model on 8bit #21371

toma-x commented Jan 30, 2023 •

edited

Loading

younesbelkada commented Jan 30, 2023

toma-x commented Jan 31, 2023

younesbelkada commented Mar 7, 2023

toma-x commented Mar 9, 2023

github-actions bot commented Apr 2, 2023

Error while loading a model on 8bit #21371

Error while loading a model on 8bit #21371

Comments

toma-x commented Jan 30, 2023 • edited Loading

younesbelkada commented Jan 30, 2023

toma-x commented Jan 31, 2023

younesbelkada commented Mar 7, 2023

toma-x commented Mar 9, 2023

github-actions bot commented Apr 2, 2023

toma-x commented Jan 30, 2023 •

edited

Loading