Speedup problem with GPTQModel #90

ChenMnZ · 2024-07-19T12:03:52Z

Hi

I test bitblas models with the https://github.com/ModelCloud/GPTQModel repo.

I found that the output is correct. However, BitBLAS obtains similar token generation speed in low-bits (2-bit and 4-bit) model with FP16 model. Detailed results are as follow:

the corresponding test code is:

import torch
from transformers import AutoTokenizer
from gptqmodel import GPTQModel, QuantizeConfig, get_backend

import time

def main():
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument("--model", default=None, type=str, help="direction for saving quantization model")
    parser.add_argument("--wbits", type=int, default=4, help="quantization bits")
    parser.add_argument("--group_size", type=int, default=128, help="quantization group size")
    parser.add_argument("--test_speed", action="store_true")

    


    args = parser.parse_args()
    tokenizer = AutoTokenizer.from_pretrained(args.model, use_fast=False,legacy=False)
    model = GPTQModel.from_quantized(args.model, device_map='auto',torch_dtype=torch.float16,backend=get_backend('BITBLAS'))
    model.cuda()
    print(f"memory footprint after loading quantized model: {torch.cuda.max_memory_allocated('cuda') / 1024**3:.2f}GiB")


    if args.test_speed:
        prompt = "Write a poem about large language model:"
        input_ids = tokenizer(prompt, return_tensors='pt').input_ids.cuda()
        start_time = time.time()
        output = model.generate(inputs=input_ids, do_sample=True, top_k=10, max_new_tokens=256)
        end_time = time.time()
        speed = len(output[0])/(end_time-start_time)
        print(tokenizer.decode(output[0]))
        print(f"generation speed:{speed}token/s")
        

if __name__ =='__main__':
    main()

Do you know what is the potential problem to hinder speedup. Thank you.

The text was updated successfully, but these errors were encountered:

LeiWang1999 · 2024-07-19T12:17:43Z

Hi @ChenMnZ , would you mind provide the reproduce scripts of the triton v2 backend? :)

ChenMnZ · 2024-07-19T12:27:07Z

@LeiWang1999
Thanks for the quik reply. For the testing of triton v2, just replace model loading manner from

GPTQModel.from_quantized(args.model, device_map='auto',torch_dtype=torch.float16,backend=get_backend('BITBLAS'))

to

GPTQModel.from_quantized(args.model, device_map='auto',torch_dtype=torch.float16)

args.model should be the path of a standard GPTQ packed model. And the code will automatically choose the triton v2 kernel for 2-bit quantization.

LeiWang1999 · 2024-07-19T12:32:28Z

@ChenMnZ Thanks, that's interesting, I‘ll take a look.

ChenMnZ · 2024-08-31T01:41:33Z

Hi, @LeiWang1999
Have you found a solution to this inference speed problem.

LeiWang1999 · 2024-08-31T04:14:25Z

hi @ChenMnZ , can you provide huggingface model repos for us to reproduce?

ChenMnZ · 2024-08-31T06:08:25Z

@LeiWang1999
You can find the related model at https://huggingface.co/collections/ChenMnZ/efficientqatgptq-format-669e050e0060107f091edc95.

LeiWang1999 · 2024-08-31T08:15:46Z

hi @ChenMnZ , have you met this error when loading python gptq.py --model ChenMnZ/Llama-3-8b-instruct-EfficientQAT-w4g128-BitBLAS --test_speed

Traceback (most recent call last):
  File "/root/BitBLAS/debug/gptq.py", line 35, in <module>
    main()
  File "/root/BitBLAS/debug/gptq.py", line 27, in main
    output = model.generate(inputs=input_ids, do_sample=True, top_k=10, max_new_tokens=256)
  File "/opt/conda/lib/python3.10/site-packages/gptqmodel-0.9.3.dev0+cu1201010-py3.10-linux-x86_64.egg/gptqmodel/models/base.py", line 466, in generate
    return self.model.generate(**kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2024, in generate
    result = self._sample(
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 3020, in _sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

ChenMnZ · 2024-08-31T09:01:43Z

@LeiWang1999
Sorry for the misleading.

Replace the model.cuda() to model.model.cuda() in code can solve this problem.

w32zhong · 2024-09-11T19:57:35Z

@ChenMnZ what GPU(s) are you running for the experiments?

ChenMnZ · 2024-09-12T01:57:41Z

@w32zhong Nvidia-A100 80GB

ChenMnZ mentioned this issue Sep 11, 2024

Is 7B llama speed expected to be slow? OpenGVLab/EfficientQAT#19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup problem with GPTQModel #90

Speedup problem with GPTQModel #90

ChenMnZ commented Jul 19, 2024 •

edited

Loading

LeiWang1999 commented Jul 19, 2024

ChenMnZ commented Jul 19, 2024

LeiWang1999 commented Jul 19, 2024

ChenMnZ commented Aug 31, 2024

LeiWang1999 commented Aug 31, 2024

ChenMnZ commented Aug 31, 2024

LeiWang1999 commented Aug 31, 2024

ChenMnZ commented Aug 31, 2024

w32zhong commented Sep 11, 2024

ChenMnZ commented Sep 12, 2024

Speedup problem with GPTQModel #90

Speedup problem with GPTQModel #90

Comments

ChenMnZ commented Jul 19, 2024 • edited Loading

LeiWang1999 commented Jul 19, 2024

ChenMnZ commented Jul 19, 2024

LeiWang1999 commented Jul 19, 2024

ChenMnZ commented Aug 31, 2024

LeiWang1999 commented Aug 31, 2024

ChenMnZ commented Aug 31, 2024

LeiWang1999 commented Aug 31, 2024

ChenMnZ commented Aug 31, 2024

w32zhong commented Sep 11, 2024

ChenMnZ commented Sep 12, 2024

ChenMnZ commented Jul 19, 2024 •

edited

Loading