Additional instruction on using hqq quantization. (#535)

We have a basic level support for hqq in torchchat now. Adding some additional words for using it. #337 Co-authored-by: Michael Gschwind <[email protected]>
pytorch · May 12, 2024 · 8d28624 · 8d28624
1 parent baea3de
commit 8d28624
Showing 1 changed file with 11 additions and 2 deletions.
diff --git a/docs/quantization.md b/docs/quantization.md
@@ -26,9 +26,9 @@ Due to the larger vocabulary size of llama3, we also recommend
 quantizing the embeddings to further reduce the model size for
 on-device usecases.
 
-| compression | FP Precision | weight quantization (bitwidth)| weight quantization (group size) | dynamic activation quantization | Eager | AOTI | ExecuTorch |
+| compression | weight quantization (bitwidth)| weight quantization (group size) | dynamic activation quantization | Eager | AOTI | ExecuTorch |
 |--|--|--|--|--|--|--|--|
-| embedding (symmetric) | fp32, fp16, bf16 | [8, 4]* | [ any > 1 ] | | ✅ | ✅ | ✅ |
+| embedding (symmetric) | [8, 4]* | [32, 64, 128, 256]** | | ✅ | ✅ | ✅ |
 
 ^ a8w4dq quantization scheme requires model to be converted to fp32,
   due to lack of support for fp16 and bf16 in the kernels provided with
@@ -46,6 +46,14 @@ on-device usecases.
     algorithms to address accuracy loss when using lower bit
     quantization. Due to HQQ relying on data/calibration free
     quantization, it tends to take less time to quantize model.
+    HQQ is currently enabled with axis=1 configuration. 
+
+    Presently, torchchat includes a subset of the HQQ distribution in 
+    the hqq subdirectory, but HQQ is not installed by default with torchchat,
+    due to dependence incompatibilities between torchchat and the hqq
+    project.  We may integrate hqq via requirements.txt in the future. 
+    (As a result, there's presently no upstream path for changes and/or
+    improvements to HQQ.)
 
 ## Quantization Profiles
 
@@ -72,6 +80,7 @@ data types. The default data type for models is "fast16".  The
 floating point data type available on the selected device.  ("Best"
 tangibly representing a combination of speed and accuracy.)
 
+
 ## Quantization API
 
 Quantization options are passed in json format either as a config file