update (#761)

* update * fixes
pytorch · May 13, 2024 · 262d5de · 262d5de
1 parent acf8773
commit 262d5de
Showing 1 changed file with 10 additions and 10 deletions.
diff --git a/docs/quantization.md b/docs/quantization.md
@@ -13,12 +13,12 @@ While quantization can potentially degrade the model's performance, the methods
 
 ## Supported Quantization Schemes
 ### Weight Quantization
-| compression | FP Precision | bitwidth| group size | dynamic activation quantization | Eager | AOTI | ExecuTorch |
+| compression | bitwidth| group size | dynamic activation quantization | Eager | AOTI | ExecuTorch |
 |--|--|--|--|--|--|--|--|
-| linear (asymmetric) | fp32, fp16, bf16 | [8, 4]* | [32, 64, 128, 256]** | | ✅ | ✅ | 🚧 |
-| linear with GPTQ*** (asymmetric) | | |[32, 64, 128, 256]**  | | ✅ | ✅ | ❌ |
-| linear with HQQ*** (asymmetric) | | |[32, 64, 128, 256]**  | | ✅ | ✅ | ❌ |
-| linear with dynamic activations (symmetric) | fp32^ | | [32, 64, 128, 256]* | a8w4dq | 🚧 |🚧 | ✅ |
+| linear (asymmetric) | [8, 4]* | [32, 64, 128, 256]** | | ✅ | ✅ | 🚧 |
+| linear with GPTQ*** (asymmetric) | |[32, 64, 128, 256]**  | | ✅ | ✅ | ❌ |
+| linear with HQQ*** (asymmetric) | |[32, 64, 128, 256]**  | | ✅ | ✅ | ❌ |
+| linear with dynamic activations (symmetric) | | [32, 64, 128, 256]* | a8w4dq | 🚧 |🚧 | ✅ |
 
 ### Embedding Quantization
 
@@ -28,11 +28,9 @@ on-device usecases.
 
 | compression | weight quantization (bitwidth)| weight quantization (group size) | dynamic activation quantization | Eager | AOTI | ExecuTorch |
 |--|--|--|--|--|--|--|--|
-| embedding (symmetric) | [8, 4]* | [32, 64, 128, 256]** | | ✅ | ✅ | ✅ |
+| embedding (symmetric) | [8, 4]* | [32, 64, 128, 256]+ | | ✅ | ✅ | ✅ |
+
 
-^ a8w4dq quantization scheme requires model to be converted to fp32,
-  due to lack of support for fp16 and bf16 in the kernels provided with
-  ExecuTorch.
 
 * These are the only valid bitwidth options.
 
@@ -55,6 +53,8 @@ on-device usecases.
     (As a result, there's presently no upstream path for changes and/or
     improvements to HQQ.)
 
++ Should support non-power-of-2-groups as well.
+
 ## Quantization Profiles
 
 Torchchat quantization supports profiles with multiple settings such
@@ -135,7 +135,7 @@ python3 generate.py llama3 --dso-path llama3.so  --prompt "Hello my name is"
 ```
 ### ExecuTorch
 ```
-python3 torchchat.py export llama3 --dtype fp32 --quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, "linear:a8w4dq": {"groupsize" : 256}}' --output-pte-path llama3.pte
+python3 torchchat.py export llama3 --quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, "linear:a8w4dq": {"groupsize" : 256}}' --output-pte-path llama3.pte
 
 python3 generate.py llama3 --pte-path llama3.pte  --prompt "Hello my name is"
 ```