Skip to content

Commit

Permalink
Additional instruction on using hqq quantization. (#535)
Browse files Browse the repository at this point in the history
We have a basic level support for hqq in torchchat now. Adding some additional words for using it.

#337

Co-authored-by: Michael Gschwind <[email protected]>
  • Loading branch information
zhxchen17 and mikekgfb authored May 12, 2024
1 parent baea3de commit 8d28624
Showing 1 changed file with 11 additions and 2 deletions.
13 changes: 11 additions & 2 deletions docs/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,9 @@ Due to the larger vocabulary size of llama3, we also recommend
quantizing the embeddings to further reduce the model size for
on-device usecases.

| compression | FP Precision | weight quantization (bitwidth)| weight quantization (group size) | dynamic activation quantization | Eager | AOTI | ExecuTorch |
| compression | weight quantization (bitwidth)| weight quantization (group size) | dynamic activation quantization | Eager | AOTI | ExecuTorch |
|--|--|--|--|--|--|--|--|
| embedding (symmetric) | fp32, fp16, bf16 | [8, 4]* | [ any > 1 ] | ||||
| embedding (symmetric) | [8, 4]* | [32, 64, 128, 256]** | ||||

^ a8w4dq quantization scheme requires model to be converted to fp32,
due to lack of support for fp16 and bf16 in the kernels provided with
Expand All @@ -46,6 +46,14 @@ on-device usecases.
algorithms to address accuracy loss when using lower bit
quantization. Due to HQQ relying on data/calibration free
quantization, it tends to take less time to quantize model.
HQQ is currently enabled with axis=1 configuration.

Presently, torchchat includes a subset of the HQQ distribution in
the hqq subdirectory, but HQQ is not installed by default with torchchat,
due to dependence incompatibilities between torchchat and the hqq
project. We may integrate hqq via requirements.txt in the future.
(As a result, there's presently no upstream path for changes and/or
improvements to HQQ.)

## Quantization Profiles

Expand All @@ -72,6 +80,7 @@ data types. The default data type for models is "fast16". The
floating point data type available on the selected device. ("Best"
tangibly representing a combination of speed and accuracy.)


## Quantization API

Quantization options are passed in json format either as a config file
Expand Down

0 comments on commit 8d28624

Please sign in to comment.