Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support transformers-like api for woq quantization #1987

Merged
merged 88 commits into from
Sep 13, 2024
Merged
Show file tree
Hide file tree
Changes from 78 commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
48b0d23
pipeline pass
Kaihui-intel Aug 22, 2024
a496483
update import path
Kaihui-intel Aug 26, 2024
23e7428
add examples
Kaihui-intel Aug 26, 2024
40df805
add ut
Kaihui-intel Aug 26, 2024
8cc0ba5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 26, 2024
4224610
add gpu example
Kaihui-intel Aug 26, 2024
4e70c9f
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel Aug 26, 2024
0cc3c5d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 26, 2024
005ab85
update utility
Kaihui-intel Aug 26, 2024
2429a8a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 26, 2024
2d5b152
update torch fp8 mapping
Kaihui-intel Aug 27, 2024
1b4bae3
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel Aug 27, 2024
30959fd
update float8_e4m3fnuz
Kaihui-intel Aug 27, 2024
d53ebc8
use_ipex=False
Kaihui-intel Aug 27, 2024
1f04ee2
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 27, 2024
913b953
update float8_e4m3fnuz
Kaihui-intel Aug 27, 2024
b81b9fc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 27, 2024
f7d7003
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel Aug 27, 2024
af3698e
reset fp8 mapping
Kaihui-intel Aug 27, 2024
ae2d42a
add evaluation
Kaihui-intel Aug 28, 2024
036d438
update evaluation
Kaihui-intel Aug 28, 2024
7c6b646
remove use_neural_speed
Kaihui-intel Aug 28, 2024
94c177f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 28, 2024
50aa98c
update evaluation
Kaihui-intel Aug 28, 2024
fd63d0f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 28, 2024
ed862d6
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel Aug 28, 2024
6ed6656
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 28, 2024
cbef232
remove use_neural_speed from eval
Kaihui-intel Aug 28, 2024
3186f25
update copyright
Kaihui-intel Sep 3, 2024
6b1412e
follow master
Kaihui-intel Sep 5, 2024
6f6da99
rebase master
Kaihui-intel Sep 5, 2024
5c592f9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 5, 2024
3cdff01
update models
Kaihui-intel Sep 5, 2024
ac13d1a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 5, 2024
4dee955
update models
Kaihui-intel Sep 5, 2024
a9d42c2
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel Sep 5, 2024
53658bf
remove inc sq
Kaihui-intel Sep 5, 2024
22ec60e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 5, 2024
5b9b1cc
remove qat/static/dynamic
Kaihui-intel Sep 5, 2024
7c5e6cc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 5, 2024
fcc315e
update ut
Kaihui-intel Sep 6, 2024
37659c2
support layer wise
Kaihui-intel Sep 6, 2024
5933e98
update model_path
Kaihui-intel Sep 6, 2024
5aeedb5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 6, 2024
62b18a3
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel Sep 6, 2024
6cbb8ab
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 6, 2024
d902909
rebase
Kaihui-intel Sep 6, 2024
32fff22
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 6, 2024
c89b0c0
fix xpu
Kaihui-intel Sep 12, 2024
4ef08f7
remove use_ipex
Kaihui-intel Sep 12, 2024
e81ce43
add quant_lm_head
Kaihui-intel Sep 12, 2024
7bdd4f9
remove unused code
Kaihui-intel Sep 12, 2024
2b16b7a
clean utility.py
Kaihui-intel Sep 12, 2024
984f1eb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 12, 2024
19d1e41
remove neural_speed from lm eval
Kaihui-intel Sep 12, 2024
80b51e8
rm import inc version
Kaihui-intel Sep 12, 2024
1d5a735
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 12, 2024
6850d6e
fix import
Kaihui-intel Sep 12, 2024
4c9ef3c
fix weiight dtype
Kaihui-intel Sep 12, 2024
fc0e930
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 12, 2024
ca170d0
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel Sep 12, 2024
f6f9a1c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 12, 2024
aa6798e
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel Sep 12, 2024
8fbdfaa
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 12, 2024
3766706
update import set_module
Kaihui-intel Sep 12, 2024
240d3c6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 12, 2024
ee5b931
update requirements
Kaihui-intel Sep 12, 2024
6f16f8d
remove itrex
Kaihui-intel Sep 12, 2024
c254db1
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel Sep 12, 2024
ab72f1c
update optimized xpu model list
Kaihui-intel Sep 12, 2024
21173dc
update README
Kaihui-intel Sep 12, 2024
943dca7
remove config & add save/load ut
Kaihui-intel Sep 12, 2024
38017f4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 12, 2024
eb3112c
improve basic ut coverage
Kaihui-intel Sep 13, 2024
fec937c
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel Sep 13, 2024
a1d3eda
skip lm eval json pre-commit
Kaihui-intel Sep 13, 2024
b1467a6
fix redefine torch
Kaihui-intel Sep 13, 2024
9a79844
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 13, 2024
14732cc
clean code
Kaihui-intel Sep 13, 2024
10c8c2b
revert is_xpu_available
Kaihui-intel Sep 13, 2024
758a364
remove absorb_to_dict
Kaihui-intel Sep 13, 2024
89eb5af
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 13, 2024
32cccb6
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel Sep 13, 2024
45627dd
add absorb & update xpu transformers==4.38.1
Kaihui-intel Sep 13, 2024
6ae0389
update transformers version to README
Kaihui-intel Sep 13, 2024
9fa6262
remove modelhub
Kaihui-intel Sep 13, 2024
8f2d381
Update neural_compressor/transformers/models/modeling_auto.py
changwangss Sep 13, 2024
610c558
bits int = 4
Kaihui-intel Sep 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .azure-pipelines/ut-basic.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ pr:
- neural_compressor/torch
- neural_compressor/tensorflow
- neural_compressor/onnxrt
- neural_compressor/transformers
- neural_compressor/evaluation
- .azure-pipelines/scripts/ut/3x

pool: ICX-16C
Expand Down
3 changes: 2 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,8 @@ repos:
examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/ptq_static/prompt.json|
examples/notebook/dynas/ResNet50_Quantiation_Search_Supernet_NAS.ipynb|
examples/notebook/dynas/Transformer_LT_Supernet_NAS.ipynb|
neural_compressor/torch/algorithms/fp8_quant/internal/diffusion_evaluation/SR_evaluation/imagenet1000_clsidx_to_labels.txt
neural_compressor/torch/algorithms/fp8_quant/internal/diffusion_evaluation/SR_evaluation/imagenet1000_clsidx_to_labels.txt|
neural_compressor/evaluation/hf_eval/datasets/cnn_validation.json
)$

- repo: https://github.com/astral-sh/ruff-pre-commit
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
# Step-by-Step
Kaihui-intel marked this conversation as resolved.
Show resolved Hide resolved
We provide a Transformers-like API for model compression using the `WeightOnlyQuant` with `Rtn/Awq/Teq/GPTQ/AutoRound` algorithms, besides we provide use ipex to use intel extension for pytorch to accelerate the model.
We provide the inference benchmarking script `run_generation.py` for large language models, the default search algorithm is beam search with `num_beams = 4`. [Here](./llm_quantization_recipes.md) are some well accuracy and performance optimized models we validated, more models are working in progress.

# Quantization for CPU device

## Prerequisite​
### Create Environment​
python version requests equal or higher than 3.9 due to [text evaluation library](https://github.com/EleutherAI/lm-evaluation-harness/tree/master) limitation, the dependent packages are listed in requirements, we recommend create environment as the following steps.

```bash
pip install -r requirements_cpu_woq.txt
```


### Run
#### Performance
```shell
# fp32
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generate_cpu_woq.py \
--model <MODEL_NAME_OR_PATH> \
--batch_size 1 \
--benchmark

# quant and do benchmark.
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generate_cpu_woq.py \
--model <MODEL_NAME_OR_PATH> \
--woq \
--woq_algo <ALGORITHM_NAME> \ # Default is "Rtn", "Awq", "Teq", "GPTQ", "AutoRound" are provided.
--output_dir <WOQ_MODEL_SAVE_PATH> \ # Default is "./saved_results"
--batch_size \
--benchmark

# load WOQ quantized model and do benchmark.
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generate_cpu_woq.py \
--model <WOQ_MODEL_SAVE_PATH> \
--benchmark

# load WOQ model from Huggingface and do benchmark.
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generate_cpu_woq.py \
--model <MODEL_NAME_OR_PATH> \
--benchmark

```
#### Accuracy
The accuracy validation is based from [lm_evaluation_harness](https://github.com/EleutherAI/lm-evaluation-harness/blob/v0.4.3/lm_eval/__main__.py).
```shell
# fp32
python run_generate_cpu_woq.py \
--model <MODEL_NAME_OR_PATH> \
--accuracy \
--tasks lambada_openai,piqa,hellaswag \ # notice: no space.
--device cpu \
--batch_size 56

# quant and do accuracy.
python run_generate_cpu_woq.py \
--model <MODEL_NAME_OR_PATH> \
--woq \
--woq_algo <ALGORITHM_NAME> \ # Default is "Rtn", "Awq", "Teq", "GPTQ", "AutoRound" are provided.
--output_dir <WOQ_MODEL_SAVE_PATH> \
--accuracy \
--tasks lambada_openai,piqa,hellaswag \ # notice: no space.
--batch_size 56

# load WOQ model quantied by itrex and do benchmark.
python run_generate_cpu_woq.py \
--model <WOQ_MODEL_SAVE_PATH> \
--accuracy \
--tasks lambada_openai,piqa,hellaswag \ # notice: no space.
--batch_size 56

# load WOQ model quantied by itrex and do benchmark with neuralspeed.
# only support quantized with algorithm "Awq", "GPTQ", "AutoRound"
python run_generate_cpu_woq.py \
--model <WOQ_MODEL_SAVE_PATH> \
--accuracy \
--tasks lambada_openai,piqa,hellaswag \ # notice: no space.
--device cpu \
--batch_size 56


# load WOQ model from Huggingface and do benchmark.
python run_generate_cpu_woq.py \
--model <MODEL_NAME_OR_PATH> \
--accuracy \
--tasks lambada_openai,piqa,hellaswag \ # notice: no space.
--device cpu \
--batch_size 56

# load WOQ model from Huggingface and do benchmark with neuralspeed.
python run_generate_cpu_woq.py \
--model <MODEL_NAME_OR_PATH> \
--accuracy \
--tasks lambada_openai,piqa,hellaswag \ # notice: no space.
--device cpu \
--batch_size 56 \

```

# Quantization for GPU device
>**Note**:
> 1. default search algorithm is beam search with num_beams = 1.
> 2. [ipex.optimize_transformers](https://github.com/intel/intel-extension-for-pytorch/blob/v2.1.10%2Bxpu/docs/tutorials/llm/llm_optimize_transformers.md) Support for the optimized inference of model types "gptj," "mistral," "qwen," and "llama" to achieve high performance and accuracy. Ensure accurate inference for other model types as well.
> 3. We provide compression technologies `WeightOnlyQuant` with `Rtn/GPTQ/AutoRound` algorithms and `load_in_4bit` and `load_in_8bit` work on intel GPU device.

## Prerequisite​
### Dependencies
Intel-extension-for-pytorch dependencies are in oneapi package, before install intel-extension-for-pytorch, we should install oneapi first. Please refer to [Installation Guide](https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu&version=v2.1.10%2Bxpu) to install the OneAPI to "/opt/intel folder".

### Create Environment​
Pytorch and Intel-extension-for-pytorch version for intel GPU > 2.1 are required, python version requests equal or higher than 3.9 due to [text evaluation library](https://github.com/EleutherAI/lm-evaluation-harness/tree/master) limitation, the dependent packages are listed in requirements_GPU.txt, we recommend create environment as the following steps. For Intel-exension-for-pytorch, we should install from source code now, and Intel-extension-for-pytorch will add weight-only quantization in the next version.

>**Note**: please install transformers==4.40.2.

```bash
pip install -r requirements_GPU.txt
pip install transformers==4.40.2
source /opt/intel/oneapi/setvars.sh
git clone https://github.com/intel/intel-extension-for-pytorch.git ipex-gpu
cd ipex-gpu
git submodule update --init --recursive
export USE_AOT_DEVLIST='pvc,ats-m150'
export BUILD_WITH_CPU=OFF

python setup.py install
```

## Run
The following are command to show how to use it.

### 1. Performance
``` bash
# fp16
python run_generation_gpu_woq.py \
--model EleutherAI/gpt-j-6b \
--benchmark

# weightonlyquant
python run_generation_gpu_woq.py \
--model EleutherAI/gpt-j-6b \
--woq \
--woq_algo <ALGORITHM_NAME> \ # Default is "Rtn", "GPTQ", "AutoRound" are provided.
--benchmark
```
> Note: If your device memory is not enough, please quantize and save the model first, then rerun the example with loading the model as below, If your device memory is enough, skip below instruction, just quantization and inference.
```bash
# First step: Quantize and save model
python run_generation_gpu_woq.py \
--model EleutherAI/gpt-j-6b \
--woq \ # default quantize method is Rtn
--woq_algo <ALGORITHM_NAME> \ # Default is "Rtn", "GPTQ", "AutoRound" are provided.
--output_dir "saved_dir"

# Second step: Load model and inference
python run_generation_gpu_woq.py \
--model "saved_dir" \
--benchmark
```

### 2. Accuracy
```bash
# quantized model by following the steps above
python run_generation_gpu_woq.py \
--model "saved_dir" \
--accuracy \
--tasks "lambada_openai"
```
Loading
Loading