-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup problem with GPTQModel #90
Comments
Hi @ChenMnZ , would you mind provide the reproduce scripts of the triton v2 backend? :) |
@LeiWang1999
to
args.model should be the path of a standard GPTQ packed model. And the code will automatically choose the triton v2 kernel for 2-bit quantization. |
@ChenMnZ Thanks, that's interesting, I‘ll take a look. |
Hi, @LeiWang1999 |
hi @ChenMnZ , can you provide huggingface model repos for us to reproduce? |
@LeiWang1999 |
hi @ChenMnZ , have you met this error when loading Traceback (most recent call last):
File "/root/BitBLAS/debug/gptq.py", line 35, in <module>
main()
File "/root/BitBLAS/debug/gptq.py", line 27, in main
output = model.generate(inputs=input_ids, do_sample=True, top_k=10, max_new_tokens=256)
File "/opt/conda/lib/python3.10/site-packages/gptqmodel-0.9.3.dev0+cu1201010-py3.10-linux-x86_64.egg/gptqmodel/models/base.py", line 466, in generate
return self.model.generate(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2024, in generate
result = self._sample(
File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 3020, in _sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 |
@LeiWang1999 Replace the |
@ChenMnZ what GPU(s) are you running for the experiments? |
@w32zhong Nvidia-A100 80GB |
Hi
I test bitblas models with the https://github.com/ModelCloud/GPTQModel repo.
I found that the output is correct. However, BitBLAS obtains similar token generation speed in low-bits (2-bit and 4-bit) model with FP16 model. Detailed results are as follow:
the corresponding test code is:
Do you know what is the potential problem to hinder speedup. Thank you.
The text was updated successfully, but these errors were encountered: