Support loading checkpoints quantized using Autofp8 #286

Yantom1 · 2024-09-16T12:28:29Z

Support loading https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127

Skip cuda checks
Use scaled_fp8_quant instead of _scaled_mm
Fix weights and weight_scale for guudi2 flot8_e4m3fn range.

michalkuligowski · 2024-09-17T07:37:50Z

@Yantom1 how does that relate to #225? some of the code is duplicated, does the 225 should be closed?

Yantom1 · 2024-09-17T14:52:32Z

@Yantom1 how does that relate to #225? some of the code is duplicated, does the 225 should be closed?

yes, I closed

This reverts commit a6f8dee.

michalkuligowski · 2024-09-20T08:59:19Z

@Yantom1 I see you merged it to extensions repo, this PR can be closed yes?

Yantom1 · 2024-09-22T07:24:43Z

@Yantom1 I see you merged it to extensions repo, this PR can be closed yes?

no, it is part of this change. In the extension repo I changed only ops.py file.

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

...model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py

vllm/model_executor/models/llama.py

vllm/model_executor/layers/quantization/fp8.py

Co-authored-by: Konrad Zawora <[email protected]>

This reverts commit 29fb5ed.

nirda7 and others added 9 commits September 16, 2024 12:03

Inc on vLLM - Split qk and v calculations

a6f8dee

Support loading checkpoints quantized using Autofp8

23e931b

ruff fixes

363de3c

ruff fixes

e4fc78b

isort fixes

d165c6e

ruff format

6f0016b

Update habana_model_runner.py

7f587eb

isort fixes

c204f3f

yapf fixes

2e00486

Yantom1 requested a review from kzawora-intel September 17, 2024 07:28

revert commit

0f40204

Yantom1 added 7 commits September 17, 2024 17:53

Merge branch 'habana_main' into yan_autofp8

cd24505

Revert "Inc on vLLM - Split qk and v calculations"

343b533

This reverts commit a6f8dee.

formnat.sh

8657c4c

delete ops.py

6b485fb

fix imports

2e603ea

isort fix

a7a036a

update vllm-hpu-extension commit hash

2b4a196

Yantom1 requested a review from MrGeva September 19, 2024 12:20

Yantom1 closed this Sep 22, 2024

Yantom1 reopened this Sep 22, 2024

michalkuligowski reviewed Sep 23, 2024

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py Outdated Show resolved Hide resolved

Yantom1 and others added 4 commits September 23, 2024 12:58

pr fix

454acc9

Merge branch 'habana_main' into yan_autofp8

26d8321

Update fp8.py

e92abd6

Update fused_moe.py

f150851

Merge branch 'habana_main' into yan_autofp8

c7dcbbc

michalkuligowski reviewed Sep 24, 2024

View reviewed changes

Yantom1 added 4 commits September 25, 2024 11:08

Update compressed_tensors.py

3e8762e

Update compressed_tensors_w8a8_fp8.py

5726801

Update llama.py

426e8e1

Update compressed_tensors_w8a8_fp8.py

4cf34f4

kzawora-intel reviewed Sep 25, 2024

View reviewed changes

vllm/model_executor/layers/quantization/fp8.py Outdated Show resolved Hide resolved

Yantom1 and others added 2 commits September 25, 2024 13:17

Update vllm/model_executor/layers/quantization/fp8.py

f58d4c1

Co-authored-by: Konrad Zawora <[email protected]>

Update fp8.py

db9affe

michalkuligowski approved these changes Sep 25, 2024

View reviewed changes

michalkuligowski merged commit 29fb5ed into habana_main Sep 25, 2024
19 checks passed

kzawora-intel added a commit that referenced this pull request Oct 4, 2024

Revert "Support loading checkpoints quantized using Autofp8 (#286)"

d7d609f

This reverts commit 29fb5ed.

kzawora-intel mentioned this pull request Oct 4, 2024

[DO NOT MERGE] Upstream test PR #322

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support loading checkpoints quantized using Autofp8 #286

Support loading checkpoints quantized using Autofp8 #286

Yantom1 commented Sep 16, 2024

michalkuligowski commented Sep 17, 2024

Yantom1 commented Sep 17, 2024

michalkuligowski commented Sep 20, 2024

Yantom1 commented Sep 22, 2024

Support loading checkpoints quantized using Autofp8 #286

Support loading checkpoints quantized using Autofp8 #286

Conversation

Yantom1 commented Sep 16, 2024

michalkuligowski commented Sep 17, 2024

Yantom1 commented Sep 17, 2024

michalkuligowski commented Sep 20, 2024

Yantom1 commented Sep 22, 2024