From 46d9192659f1c0dcf488e2e69f0f7dd7bd0b2f2e Mon Sep 17 00:00:00 2001 From: xinhe Date: Thu, 15 Aug 2024 09:57:22 +0800 Subject: [PATCH] update readme for fp8 (#1979) Signed-off-by: xinhe3 --- README.md | 4 ++-- docs/{ => source}/3x/PT_FP8Quant.md | 2 +- docs/source/3x/PyTorch.md | 13 +++++++++---- 3 files changed, 12 insertions(+), 7 deletions(-) rename docs/{ => source}/3x/PT_FP8Quant.md (97%) diff --git a/README.md b/README.md index f4694e991e9..fa82961dd75 100644 --- a/README.md +++ b/README.md @@ -71,7 +71,7 @@ pip install "neural-compressor>=2.3" "transformers>=4.34.0" torch torchvision ``` After successfully installing these packages, try your first quantization program. -### [FP8 Quantization](./examples/3.x_api/pytorch/cv/fp8_quant/) +### [FP8 Quantization](./docs/source/3x/PT_FP8Quant.md) Following example code demonstrates FP8 Quantization, it is supported by Intel Gaudi2 AI Accelerator. To try on Intel Gaudi2, docker image with Gaudi Software Stack is recommended, please refer to following script for environment setup. More details can be found in [Gaudi Guide](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#launch-docker-image-that-was-built). @@ -147,7 +147,7 @@ Intel Neural Compressor will convert the model format from auto-gptq to hpu form Weight-Only Quantization - FP8 Quantization + FP8 Quantization MX Quantization Mixed Precision diff --git a/docs/3x/PT_FP8Quant.md b/docs/source/3x/PT_FP8Quant.md similarity index 97% rename from docs/3x/PT_FP8Quant.md rename to docs/source/3x/PT_FP8Quant.md index a0ed3352e8e..06fd37b367f 100644 --- a/docs/3x/PT_FP8Quant.md +++ b/docs/source/3x/PT_FP8Quant.md @@ -108,6 +108,6 @@ model = convert(model) | Task | Example | |----------------------|---------| | Computer Vision (CV) | [Link](../../examples/3.x_api/pytorch/cv/fp8_quant/) | -| Large Language Model (LLM) | [Link](https://github.com/HabanaAI/optimum-habana-fork/tree/habana-main/examples/text-generation#running-with-fp8) | +| Large Language Model (LLM) | [Link](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation#running-with-fp8) | > Note: For LLM, Optimum-habana provides higher performance based on modified modeling files, so here the Link of LLM goes to Optimum-habana, which utilize Intel Neural Compressor for FP8 quantization internally. diff --git a/docs/source/3x/PyTorch.md b/docs/source/3x/PyTorch.md index a3004f6bcfb..2c2111d4d69 100644 --- a/docs/source/3x/PyTorch.md +++ b/docs/source/3x/PyTorch.md @@ -176,16 +176,21 @@ def load(output_dir="./saved_results", model=None): link - Static Quantization - Post-traning Static Quantization - intel-extension-for-pytorch + Static Quantization + Post-traning Static Quantization + intel-extension-for-pytorch (INT8) ✔ link - TorchDynamo + TorchDynamo (INT8)link + + Intel Gaudi AI accelerator (FP8) + ✔ + link + Dynamic Quantization