Releases: AutoGPTQ/AutoGPTQ
v0.7.1: patch release
Support loading sharded quantized checkpoints
Sharded checkpoints can now be loaded in the from_quantized
method.
Gemma GPTQ quantization
Gemma model can be quantized with AutoGPTQ.
Other changes and fixes
- Add back missing import by @fxmarty in #553
- Fix bias materialization for Marlin by @fxmarty in #554
- Fix shape check marlin by @fxmarty in #557
- Explicitely check compute capability in marlin's QLinear by @fxmarty in #567
- Compatibility with latest transformers by @fxmarty in #573
Full Changelog: v0.7.0...v0.7.1
v0.7.0: Marlin int4*fp16 kernel, AWQ checkpoints loading
Marlin efficient int4*fp16 kernel on Ampere GPUs, AWQ checkpoints loading
@efrantar, GPTQ author, released Marlin, an optimized CUDA kernel for Ampere GPUs for int4*fp16 matrix multiplication, with per-group symmetric quantization support (without act-order), which significantly outperforms other existing kernels when using batching.
This kernel can be used in AutoGPTQ when loading models with the use_marlin=True
argument. Using this flag will repack the quantized weights as the Marlin kernel expects a different layout. The repacked weight is then saved locally so as to avoid the need to repack again. Example:
import torch
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-GPTQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-GPTQ", torch_dtype=torch.float16, use_marlin=True, device="cuda:0")
prompt = "Is quantization a good compression technique?"
inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")
res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))
# Repacking weights to be compatible with Marlin kernel...: 100%|████████████████████████████████████████████████████████████| 566/566 [00:29<00:00, 19.17it/s]
#
# <s> Is quantization a good compression technique?
#
# Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in audio and image compression, as well as in scientific and engineering applications.
A complete benchmark can be found at: https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark
Visual tables coming soon.
- add marlin kernel by @qwopqwop200 in #514
- updated marlin serialization by @rib-2 in #522
- Marlin repacking CUDA kernel by @fxmarty in #539
- Marlin kernel can be built against any compute capability by @fxmarty in #540
Ability to load AWQ checkpoints in AutoGPTQ
Note: The AWQ checkpoints repacking step is currently slow, and a faster implementation can be implemented.
AWQ's original implementation adopted a serialization format different than the one expected by current GPTQ kernels (triton, cuda_old, exllama, exllamav2), but the computation happen to be the same. We allow loading AWQ checkpoints in AutoGPTQ to leverage exllama/exllamav2 kernels that may be more performant for some problem sizes (see the PR below, notably at sequence_length = 1 and for long sequences).
Example:
import torch
from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-chat-AWQ")
model = AutoGPTQForCausalLM.from_quantized("TheBloke/Llama-2-13B-chat-AWQ", torch_dtype=torch.float16, device="cuda:0")
prompt = "Is quantization a good compression technique?"
inp = tokenizer(prompt, return_tensors="pt").to("cuda:0")
res = model.generate(**inp, max_new_tokens=200)
print(tokenizer.decode(res[0]))
# Repacking model.layers.9.self_attn.v_proj...: 100%|████████████████████████████████████████████████████████████████████████| 280/280 [05:29<00:00, 1.18s/it]
#
# <s> Is quantization a good compression technique?
#
# Quantization is a lossy compression technique that reduces the precision of a signal or image by representing it with fewer bits. It is commonly used in digital signal processing and image compression.
Qwen2, LongLLaMA, Deci_lm models support
These models can be quantized with AutoGPTQ.
- Add qwen2 by @JustinLin610 in #519
- Change deci_lm model type to deci by @LaaZa in #491
- Support for LongLLaMA models. by @LaaZa in #442
Other changes and bugfixes
- Update version & install instructions by @fxmarty in #485
- fix the support of Qwen by @hzhwcmhf in #495
- rocm6.0 compatible exllama by @seungrokj in #515
- Untie weights for safetensors serialization by @fxmarty in #536
- marlin update version 0.1.1 and fix marlin bug by @qwopqwop200 in #524
- Use ruff for linting by @fxmarty in #537
- Fix wheels build for torch==2.2.0 by @fxmarty in #541
- Fix repo owners in workflows by @fxmarty in #542
- Disable peft compatibility by @fxmarty in #543
- Improve README by @fxmarty in #544
- Add ROCm dockerfile by @fxmarty in #545
- Make all tests pass by @fxmarty in #546
- Fix cuda wheel build workflows by @fxmarty in #547
- Use bash in workflows by @fxmarty in #548
- Dissociate Windows & Linux CUDA build by @fxmarty in #549* Add more guards on compute capability in Marlin kernel by @fxmarty in #550
New Contributors
- @hzhwcmhf made their first contribution in #495
- @rib-2 made their first contribution in #522
- @seungrokj made their first contribution in #515
Full Changelog: v0.6.0...v0.7.0
v0.6.0: Mixtral, StableLM, DeciLM, Yi support, Transformers 4.36 compatibility
What's Changed
- Precise PyTorch version by @fxmarty in #421
- Fix triton unexpected keyword by @LaaZa in #423
- Add support for Yi models. by @LaaZa in #413
- Add support for Xverse models. by @LaaZa in #417
- Allow fp32 input to GPTQ linear by @fxmarty in #437
- Fix typos in tests by @fxmarty in #438
- Update _base.py - Remote (.bin) model load fix by @Shades-en in #465
- make build successful on Jetson device(L4T) by @mikeshi80 in #470
- Add option to disable qigen at build by @fxmarty in #471
- Stop trying to convert a list to int in setup.py when trying to retrieve cores_info by @wemoveon2 in #474
- Only make_quant on inside_layer_modules. by @LaaZa in #479
- Add support for DeciLM models. by @LaaZa in #481
- Support for StableLM Epoch models. by @LaaZa in #444
- Add support for Mixtral models. by @LaaZa in #480
- Fix compatibility with transformers 4.36 by @fxmarty in #483
New Contributors
- @Shades-en made their first contribution in #465
- @mikeshi80 made their first contribution in #470
- @wemoveon2 made their first contribution in #474
Full Changelog: v0.5.1...v0.6.0
v0.5.1: Patch release
Mainly fixes Windows support.
What's Changed
- Update README and version following 0.5.0 release by @fxmarty in #397
- Fix windows support by @fxmarty in #407
- Fix quantize method with None mask by @fxmarty in #408
- Improve message about buffer size in exllama v1 backend by @fxmarty in #410
- Fix windows (no triton) and cpu-only support by @fxmarty in #411
- Fix workflows to use pip instead of conda by @fxmarty in #419
Full Changelog: v0.5.0...v0.5.1
v0.5.0: Exllama v2 GPTQ kernels, RoCm 5.6/5.7 support, many bugfixes
Exllama v2 GPTQ kernel support
The more performant GPTQ kernels from @turboderp's exllamav2 library are now available directly in AutoGPTQ, and are the default backend choice.
A comprehensive benchmark is available here.
CPU inference support
This is experimental.
- Add AutoGPTQ's cpu kernel. by @qwopqwop200 in #245
Loading from safetensors is now the default
Falcon, Mistral support
- Add support for Falcon as part of Transformers 4.33.0, including new Falcon 180B by @TheBloke in #326
- Add support for Mistral models. by @LaaZa in #362
Other changes and bugfixes
- Fix setuptools classifier by @fxmarty in #285
- Update install instructions by @fxmarty in #286
- Install skip qigen(windows) by @qwopqwop200 in #309
- fix model type changed after calling .to() method by @PanQiWei in #310
- Update qwen.py for Qwen-VL by @JustinLin610 in #303
- fix typo in max_input_length by @SunMarc in #311
- Use
adapter_name
forget_gptq_peft_model
withtrain_mode=True
by @alex4321 in #347 - Ignore unknown parameters in quantize_config.json by @z80maniac in #335
- fix bug(breaking change) remove (zeors -= 1) by @qwopqwop200 in #325
- Revert "fix bug(breaking change) remove (zeors -= 1)" by @PanQiWei in #354
- import exllama QuantLinear instead of exllamav2's in
pack_model
by @PanQiWei in #355 - Modify qlinear_cuda for tracing the GPTQ model by @vivekkhandelwal1 in #367
- Fix QiGen kernel generation by @fxmarty in #379
- Improve RoCm support by @fxmarty in #382
- PEFT initialization fix by @alex4321 in #361
- Pin to accelerate>=0.22 by @fxmarty in #384
- Fix overflow in exllama with act-order by @fxmarty in #386
- Default to exllama kernel when exllama v2 is disabled by @fxmarty in #387
- Error out on exllama_set_max_input_length call without exllama backend by @fxmarty in #389
- Add fix for CPU Inference by @vivekkhandelwal1 in #385
- Fix dtype issues and add relevant tests by @fxmarty in #393
- Patch accelerate to use correct dtype by @fxmarty in #394
- Fixed missing cstdint include by @kodai2199 in #388
- Update RoCm workflow to build for RoCm 5.7 by @fxmarty in #395
- Fix Windows build by @fxmarty in #396
New Contributors
- @JustinLin610 made their first contribution in #303
- @SunMarc made their first contribution in #311
- @alex4321 made their first contribution in #347
- @vivekkhandelwal1 made their first contribution in #367
- @kodai2199 made their first contribution in #388
Full Changelog: v0.4.2...v0.5.0
v0.4.2: Patch release
Major bugfix: exllama backend with arbitrary input length
This patch release includes a major bugfix to have the exllama backend work with input length > 2048 through a reconfigurable buffer size:
from auto_gptq import exllama_set_max_input_length
...
model = exllama_set_max_input_length(model, 4096)
Exllama kernels support in Windows wheels
This patch tentatively includes the exllama kernels in the wheels for Windows.
What's Changed
- Build wheels on ubuntu 20.04 by @fxmarty in #272
- Free disk space for rocm build by @fxmarty in #273
- Use focal for RoCm build by @fxmarty in #274
- Use conda incubator for rocm build by @fxmarty in #276
- Update install instructions by @fxmarty in #275
- Use --extra-index-url to resolve dependencies by @fxmarty in #277
- Fix python version for rocm build by @fxmarty in #278
- Fix powershell in workflow by @fxmarty in #284
Full Changelog: v0.4.1...v0.4.2
v0.4.1: Patch Fix
Overview
- Fix typo so not only
pytorch==2.0.0
but alsopytorch>=2.0.0
can be used for llama fused attention. - Patch exllama QuantLinear to avoid modifying the state dict to make the integration with transformers smoother.
Change Log
What's Changed
Full Changelog: v0.4.0...v0.4.1
v0.4.0
Overview
- New platform: support ROCm platform (5.4.2 for now, and will extend to 5.5 and 5.6 as soon as pytorch officially release 2.1.0).
- New kernels: support exllama q4 kernels to get at least 1.3x inference speedup.
- New quantization strategy: support to specify
static_groups=True
on quantization which can futher improve quantized model's performance and close the gap of PPL again un-quantized model. - New model: qwen
Full Change Log
What's Changed
- Add RoCm support by @fxmarty in #214
- Fix revision used to load the quantization config by @fxmarty in #220
- [General Quant Linear] Register quant params of general quant linear for friendly post process. by @LeiWang1999 in #226
- Add exllama q4 kernel by @fxmarty in #219
- Suppprt static groups and fix bug by @qwopqwop200 in #236
- support qwen by @qwopqwop200 in #240
New Contributors
- @fxmarty made their first contribution in #214
- @LeiWang1999 made their first contribution in #226
Full Changelog: v0.3.2...v0.4.0
v0.3.2: Patch Fix
Overview
- Fix CUDA kernel bug that cause
desc_act
andgroup_size
can't be used together - Improve user experience of manually installation
- Improve user experience of loading quantized model
- Add
perplexity_utils.py
to gracefully calculate PPL so that the result can be used to compare with other libraries fairly - Remove
save_dir
argument fromfrom_quantized
model, and now onlymodel_name_or_path
argument is supported in this method
Full Change Log
What's Changed
- Fix cuda bug by @qwopqwop200 in #202
- Fix
revision
and other huggingface_hub kwargs in .from_quantized() by @TheBloke in #205 - Change the install script so it attempts to build the CUDA extension in all cases by @TheBloke in #206
- Add a central version number by @TheBloke in #207
- Add Safetensors metadata saving, with some values saved to each .safetensor file by @TheBloke in #208
- [FEATURE] Implement perplexity metric to compare against llama.cpp by @casperbh96 in #166
- Fix error raised when CUDA kernels are not installed by @PanQiWei in #209
- Fix build on non-CUDA machines after #206 by @casperbh96 in #212
New Contributors
- @casperbh96 made their first contribution in #166
Full Changelog: v0.3.0...v0.3.2
v0.3.0
Overview
- CUDA kernels improvement: support models whose hidden_size can only divisible by 32/64 instead of 256.
- Peft integration: support training and inference using LoRA, AdaLoRA, AdaptionPrompt, etc.
- New models: BaiChuan, InternLM.
- Other updates: see 'Full Change Log' below for details.
Full Change Log
What's Changed
- Pytorch qlinear by @qwopqwop200 in #116
- Specify UTF-8 encoding for README.md in setup.py by @EliEron in #132
- Support cuda 64dim by @qwopqwop200 in #126
- Support 32dim by @qwopqwop200 in #125
- Peft integration by @PanQiWei in #102
- Support setting inject_fused_attention and inject_fused_mlp to False by @TheBloke in #134
- Add transpose operator when replace Conv1d with qlinear_cuda_old by @geekinglcq in #140
- Add support for BaiChuan model by @LaaZa in #164
- Fix error message by @AngainorDev in #141
- Add support for InternLM by @cczhong11 in #189
- Fix stale documentation by @MarisaKirisame in #158
New Contributors
- @EliEron made their first contribution in #132
- @geekinglcq made their first contribution in #140
- @AngainorDev made their first contribution in #141
- @cczhong11 made their first contribution in #189
- @MarisaKirisame made their first contribution in #158
Full Changelog: v0.2.1...v0.3.0