Skip to content

Releases: AutoGPTQ/AutoGPTQ

v0.7.1: patch release

01 Mar 13:14
Compare
Choose a tag to compare

Support loading sharded quantized checkpoints

Sharded checkpoints can now be loaded in the from_quantized method.

  • Support loading sharded quantized checkpoints. by @LaaZa in #425

Gemma GPTQ quantization

Gemma model can be quantized with AutoGPTQ.

Other changes and fixes

Full Changelog: v0.7.0...v0.7.1

v0.7.0: Marlin int4*fp16 kernel, AWQ checkpoints loading

16 Feb 13:10
Compare
Choose a tag to compare

Marlin efficient int4*fp16 kernel on Ampere GPUs, AWQ checkpoints loading

@efrantar, GPTQ author, released Marlin, an optimized CUDA kernel for Ampere GPUs for int4*fp16 matrix multiplication, with per-group symmetric quantization support (without act-order), which significantly outperforms other existing kernels when using batching.

This kernel can be used in AutoGPTQ when loading models with the use_marlin=True argument. Using this flag will repack the quantized weights as the Marlin kernel expects a different layout. The repacked weight is then saved locally so as to avoid the need to repack again. Example:

import torch

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-GPTQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-GPTQ", torch_dtype=torch.float16, use_marlin=True, device="cuda:0")

prompt = "Is quantization a good compression technique?"

inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")

res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))

# Repacking weights to be compatible with Marlin kernel...: 100%|████████████████████████████████████████████████████████████| 566/566 [00:29<00:00, 19.17it/s]
# 
# <s> Is quantization a good compression technique?
# 
# Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in audio and image compression, as well as in scientific and engineering applications.

A complete benchmark can be found at: https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark

Visual tables coming soon.

Ability to load AWQ checkpoints in AutoGPTQ

Note: The AWQ checkpoints repacking step is currently slow, and a faster implementation can be implemented.

AWQ's original implementation adopted a serialization format different than the one expected by current GPTQ kernels (triton, cuda_old, exllama, exllamav2), but the computation happen to be the same. We allow loading AWQ checkpoints in AutoGPTQ to leverage exllama/exllamav2 kernels that may be more performant for some problem sizes (see the PR below, notably at sequence_length = 1 and for long sequences).

Example:

import torch

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-AWQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-AWQ", torch_dtype=torch.float16, device="cuda:0")

prompt = "Is quantization a good compression technique?"

inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")

res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))

# Repacking model.layers.9.self_attn.v_proj...: 100%|████████████████████████████████████████████████████████████████████████| 280/280 [05:29<00:00,  1.18s/it]
# 
# <s> Is quantization a good compression technique?
# 
# Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in digital signal processing and image compression.

Qwen2, LongLLaMA, Deci_lm models support

These models can be quantized with AutoGPTQ.

Other changes and bugfixes

New Contributors

Full Changelog: v0.6.0...v0.7.0

v0.6.0: Mixtral, StableLM, DeciLM, Yi support, Transformers 4.36 compatibility

15 Dec 06:50
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.5.1...v0.6.0

v0.5.1: Patch release

09 Nov 14:55
Compare
Choose a tag to compare

Mainly fixes Windows support.

What's Changed

  • Update README and version following 0.5.0 release by @fxmarty in #397
  • Fix windows support by @fxmarty in #407
  • Fix quantize method with None mask by @fxmarty in #408
  • Improve message about buffer size in exllama v1 backend by @fxmarty in #410
  • Fix windows (no triton) and cpu-only support by @fxmarty in #411
  • Fix workflows to use pip instead of conda by @fxmarty in #419

Full Changelog: v0.5.0...v0.5.1

v0.5.0: Exllama v2 GPTQ kernels, RoCm 5.6/5.7 support, many bugfixes

02 Nov 22:16
Compare
Choose a tag to compare

Exllama v2 GPTQ kernel support

The more performant GPTQ kernels from @turboderp's exllamav2 library are now available directly in AutoGPTQ, and are the default backend choice.

A comprehensive benchmark is available here.

CPU inference support

This is experimental.

Loading from safetensors is now the default

  • Allow using a model with basename model, use_safetensors defaults to True by @fxmarty in #383

Falcon, Mistral support

  • Add support for Falcon as part of Transformers 4.33.0, including new Falcon 180B by @TheBloke in #326
  • Add support for Mistral models. by @LaaZa in #362

Other changes and bugfixes

New Contributors

Full Changelog: v0.4.2...v0.5.0

v0.4.2: Patch release

24 Aug 19:05
Compare
Choose a tag to compare

Major bugfix: exllama backend with arbitrary input length

This patch release includes a major bugfix to have the exllama backend work with input length > 2048 through a reconfigurable buffer size:

from auto_gptq import exllama_set_max_input_length

...
model = exllama_set_max_input_length(model, 4096)
  • Expose a function to update exllama max input length by @fxmarty in #281

Exllama kernels support in Windows wheels

This patch tentatively includes the exllama kernels in the wheels for Windows.

  • Add PyPI build workflow, tentatively fix exllama on windows by @fxmarty in #282

What's Changed

Full Changelog: v0.4.1...v0.4.2

v0.4.1: Patch Fix

13 Aug 09:29
eea67b7
Compare
Choose a tag to compare

Overview

  • Fix typo so not only pytorch==2.0.0 but also pytorch>=2.0.0 can be used for llama fused attention.
  • Patch exllama QuantLinear to avoid modifying the state dict to make the integration with transformers smoother.

Change Log

What's Changed

  • Patch exllama QuantLinear to avoid modifying the state dict by @fxmarty in #243

Full Changelog: v0.4.0...v0.4.1

v0.4.0

09 Aug 11:10
Compare
Choose a tag to compare

Overview

  • New platform: support ROCm platform (5.4.2 for now, and will extend to 5.5 and 5.6 as soon as pytorch officially release 2.1.0).
  • New kernels: support exllama q4 kernels to get at least 1.3x inference speedup.
  • New quantization strategy: support to specify static_groups=True on quantization which can futher improve quantized model's performance and close the gap of PPL again un-quantized model.
  • New model: qwen

Full Change Log

What's Changed

New Contributors

Full Changelog: v0.3.2...v0.4.0

v0.3.2: Patch Fix

26 Jul 11:25
Compare
Choose a tag to compare

Overview

  • Fix CUDA kernel bug that cause desc_act and group_size can't be used together
  • Improve user experience of manually installation
  • Improve user experience of loading quantized model
  • Add perplexity_utils.py to gracefully calculate PPL so that the result can be used to compare with other libraries fairly
  • Remove save_dir argument from from_quantized model, and now only model_name_or_path argument is supported in this method

Full Change Log

What's Changed

  • Fix cuda bug by @qwopqwop200 in #202
  • Fix revision and other huggingface_hub kwargs in .from_quantized() by @TheBloke in #205
  • Change the install script so it attempts to build the CUDA extension in all cases by @TheBloke in #206
  • Add a central version number by @TheBloke in #207
  • Add Safetensors metadata saving, with some values saved to each .safetensor file by @TheBloke in #208
  • [FEATURE] Implement perplexity metric to compare against llama.cpp by @casperbh96 in #166
  • Fix error raised when CUDA kernels are not installed by @PanQiWei in #209
  • Fix build on non-CUDA machines after #206 by @casperbh96 in #212

New Contributors

  • @casperbh96 made their first contribution in #166

Full Changelog: v0.3.0...v0.3.2

v0.3.0

16 Jul 08:11
Compare
Choose a tag to compare

Overview

  • CUDA kernels improvement: support models whose hidden_size can only divisible by 32/64 instead of 256.
  • Peft integration: support training and inference using LoRA, AdaLoRA, AdaptionPrompt, etc.
  • New models: BaiChuan, InternLM.
  • Other updates: see 'Full Change Log' below for details.

Full Change Log

What's Changed

New Contributors

Full Changelog: v0.2.1...v0.3.0