Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while loading a model on 8bit #21371

Closed
toma-x opened this issue Jan 30, 2023 · 5 comments
Closed

Error while loading a model on 8bit #21371

toma-x opened this issue Jan 30, 2023 · 5 comments

Comments

@toma-x
Copy link

toma-x commented Jan 30, 2023

I'm trying to run inference on a model which doesn't fit on my GPU using this code:

import torch

device_map = {'transformer.wte': 0,
 'transformer.drop': 0,
 'transformer.h.0': 0,
 'transformer.h.1': 0,
 'transformer.h.2': 0,
 'transformer.h.3': 0,
 'transformer.h.4': 0,
 'transformer.h.5': 0,
 'transformer.h.6': 0,
 'transformer.h.7': 0,
 'transformer.h.8': 0,
 'transformer.h.9': 0,
 'transformer.h.10': 0,
 'transformer.h.11': 0,
 'transformer.h.12': 0,
 'transformer.h.13': 0,
 'transformer.h.14': 0,
 'transformer.h.15': 0,
 'transformer.h.16': 0,
 'transformer.h.17': 0,
 'transformer.h.18': 0,
 'transformer.h.19': 0,
 'transformer.h.20': 0,
 'transformer.h.21': 0,
 'transformer.h.22': 0,
 'transformer.h.23': 'cpu',
 'transformer.h.24': 'cpu',
 'transformer.h.25': 'cpu',
 'transformer.h.26': 'cpu',
 'transformer.h.27': 'cpu',
 'transformer.ln_f': 'cpu',
 'lm_head': 'cpu'}
tokenizer = AutoTokenizer.from_pretrained("tomaxe/fr-boris-sharded")
model = AutoModelForCausalLM.from_pretrained("tomaxe/fr-boris-sharded", load_in_8bit = True, load_in_8bit_skip_modules = ['lm_head',
                                                                                                                          'transformer.ln_f',
                                                                                                                          'transformer.h.27',
                                                                                                                          'transformer.h.26',
                                                                                                                          'transformer.h.25',
                                                                                                                          'transformer.h.24',
                                                                                                                          'transformer.h.23'], device_map = device_map)
input_text = "salut"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids, max_length = 20)
print(tokenizer.decode(outputs[0]))

And I'm running into this error :
@younesbelkada Do you know what I could do ? Thanks

Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
CUDA SETUP: CUDA runtime path found: /home/thomas/anaconda3/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/thomas/anaconda3/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading checkpoint shards:   0%|          | 0/30 [00:00<?, ?it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A: torch.Size([2, 4096]), B: torch.Size([4096, 4096]), C: (2, 4096); (lda, ldb, ldc): (c_int(64), c_int(131072), c_int(64)); (m, n, k): (c_int(2), c_int(4096), c_int(4096))
Traceback (most recent call last):

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/spyder_kernels/py3compat.py", line 356, in compat_exec
    exec(code, globals, locals)

  File "/home/thomas/Downloads/infersharded.py", line 46, in <module>
    outputs = model.generate(input_ids, max_length = 20)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/transformers/generation/utils.py", line 1391, in generate
    return self.greedy_search(

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/transformers/generation/utils.py", line 2179, in greedy_search
    outputs = self(

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/accelerate/hooks.py", line 156, in new_forward
    output = old_forward(*args, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/transformers/models/gptj/modeling_gptj.py", line 813, in forward
    transformer_outputs = self.transformer(

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/transformers/models/gptj/modeling_gptj.py", line 668, in forward
    outputs = block(

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/accelerate/hooks.py", line 156, in new_forward
    output = old_forward(*args, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/transformers/models/gptj/modeling_gptj.py", line 302, in forward
    attn_outputs = self.attn(

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/accelerate/hooks.py", line 156, in new_forward
    output = old_forward(*args, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/transformers/models/gptj/modeling_gptj.py", line 203, in forward
    query = self.q_proj(hidden_states)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/accelerate/hooks.py", line 156, in new_forward
    output = old_forward(*args, **kwargs)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 254, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 405, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 311, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)

  File "/home/thomas/anaconda3/lib/python3.9/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
    raise Exception('cublasLt ran into an error!')

Exception: cublasLt ran into an error!

cuBLAS API failed with status 15
error detected```
@younesbelkada
Copy link
Contributor

Hi @toma-x
Thanks for the issue,
What you are currently trying to do (mixing cpu + int8) is not supported yet
I think that this feature should be addressed in QuantizationConfig in the next weeks, I will keep you updated in this issue

@toma-x
Copy link
Author

toma-x commented Jan 31, 2023

Glad to know this, looking forward to hear you soon about this @younesbelkada

@huggingface huggingface deleted a comment from github-actions bot Mar 7, 2023
@younesbelkada
Copy link
Contributor

Hi @toma-x
This is now supported on the main branch of transformers, can you check this section of the docs? 🙏
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu

@toma-x
Copy link
Author

toma-x commented Mar 9, 2023

Hi @younesbelkada thank you for letting me updated, I will sure take a look this is very interesting, have a great day 😁

@github-actions
Copy link

github-actions bot commented Apr 2, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants