- Introduction
- Supported Framework
- Get Start with FP8 Quantization
3.1. Old API Configuration
3.2. New API Configuration
3.2. Automatic Tuning Strategy
3.2. Global Environment Variables - Examples
Float point 8(FP8) is a promising data type for low precision quantization. In Intel Neural Compressor, the emulated FP8 quantization is supported in branch fp8_adaptor. With specifing precision(fp8_e5m2, fp8_e4m3, fp8_e3m4), users can validate the accuracy of the quantized FP8 model.
Framework | Emulated FP8 Quantization |
---|---|
PyTorch | ✔ |
ONNX | ✔ |
Note: FP8 Emulation Toolkit is needed to be installed.
### install mpemu
git clone https://github.com/IntelLabs/FP8-Emulation-Toolkit.git
cd FP8-Emulation-Toolkit
python setup.py install
### install neural compressor
git clone https://github.com/intel/neural-compressor.git
cd neural-compressor
git checkout fp8_adaptor
python setup.py install
Comparing with the INT8 quantization, only one parameter: precision(fp8_e5m2/fp8_e4m3/fp8_e3m4) is added.
Also, for models with BatchNorm, it is recommanded to calibrate its statistics in train mode with FP8 data type before quantization.
model:
name: xxx
framework: pytorch
quantization:
approach: post_training_static_quant # no need for fp8_e5m2
precision: fp8_e4m3 # allowed precision is fp8_e5m2, fp8_e4m3, fp8_e3m4
calibration:
batchnorm_sampling_size: 3000 # only needed for models w/ BatchNorm
sampling_size: 300
tuning:
accuracy_criterion:
relative: 0.01
exit_strategy:
timeout: 0
random_seed: 9527
quant_conf = PostTrainingQuantConfig(
precision="fp8_e5m2",
calibration_sampling_size=[300],
batchnorm_calibration_sampling_size=[3000],
)
Unlike the INT8 base strategy, the FP8 auto tuning strategy will attempt per operation type tuning. We first aggressively quantize all op types. If the accuracy requirement is missed, the strategy will try to quantize one op type and accumulates them together. Finally, the user will get the following information.
[INFO] Suggested op types with KL algorithm are: ['Matmul', 'LayerNorm', 'Linear']
[INFO] Suggested FP8 op types are: ['Matmul', 'Embedding', 'LayerNorm', 'Linear']; Accuracy is 0.5560059529291749
In order to facilitate customer customization, some global environment variables are used.
Framework | Usage | Supported Values |
---|---|---|
FP8_OP_TYPE_LIST | To specify module type range of emulated FP8 quantization | 'linear', 'conv2d', 'bmm', 'amm', 'mm','add', 'mul', 'div', 'embedding', 'embeddingbag', 'layernorm' |
DISABLE_FIRST_CONV | Whether quantize the first convolution layer | True/False |
DISABLE_LAST_LINEAR | Whether quantize the last linear layer | True/False |
MIX_PRECISION | Whether allow mix precision and auto select data type | True/False |
E4M3_SCALE | Whether fix the scale to 1, which means cast fp32 to fp8_e4m3 | 1/- |
quantizer = Quantization("fake.yaml")
quantizer.model = model
quantizer.calib_dataloader = self.cv_dataloader
q_model = quantizer.fit()
or
quant_conf = PostTrainingQuantConfig(
precision="fp8_e5m2",
calibration_sampling_size=[300],
batchnorm_calibration_sampling_size=[3000],
)
q_model = quantization.fit(
model,
quant_conf,
eval_func = eval_func,
calib_dataloader=self.cv_dataloader
)