Skip to content

Latest commit

 

History

History
115 lines (94 loc) · 4.19 KB

fp8.md

File metadata and controls

115 lines (94 loc) · 4.19 KB

Emulated FP8 Quantization

  1. Introduction
  2. Supported Framework
  3. Get Start with FP8 Quantization
    3.1. Old API Configuration
    3.2. New API Configuration
    3.2. Automatic Tuning Strategy
    3.2. Global Environment Variables
  4. Examples

Introduction

Float point 8(FP8) is a promising data type for low precision quantization. In Intel Neural Compressor, the emulated FP8 quantization is supported in branch fp8_adaptor. With specifing precision(fp8_e5m2, fp8_e4m3, fp8_e3m4), users can validate the accuracy of the quantized FP8 model.

Supported Framework

Framework Emulated FP8 Quantization
PyTorch
ONNX

Note: FP8 Emulation Toolkit is needed to be installed.

### install mpemu
git clone https://github.com/IntelLabs/FP8-Emulation-Toolkit.git
cd FP8-Emulation-Toolkit  
python setup.py install

### install neural compressor
git clone https://github.com/intel/neural-compressor.git
cd neural-compressor
git checkout fp8_adaptor
python setup.py install 

Get Start with FP8 Quantization

Comparing with the INT8 quantization, only one parameter: precision(fp8_e5m2/fp8_e4m3/fp8_e3m4) is added.

Also, for models with BatchNorm, it is recommanded to calibrate its statistics in train mode with FP8 data type before quantization.

Old API Configuration for Intel Neural Compressor 1.x

model:
    name: xxx
    framework: pytorch

quantization:
    approach: post_training_static_quant    # no need for fp8_e5m2
    precision: fp8_e4m3    # allowed precision is fp8_e5m2, fp8_e4m3, fp8_e3m4
    calibration:
        batchnorm_sampling_size: 3000    # only needed for models w/ BatchNorm
        sampling_size: 300

tuning:
    accuracy_criterion:
        relative:  0.01
    exit_strategy:
        timeout: 0
    random_seed: 9527

New API Configuration for Intel Neural Compressor 2.0

quant_conf = PostTrainingQuantConfig(
    precision="fp8_e5m2",
    calibration_sampling_size=[300],
    batchnorm_calibration_sampling_size=[3000],
)

Automatic Tuning Strategy

Unlike the INT8 base strategy, the FP8 auto tuning strategy will attempt per operation type tuning. We first aggressively quantize all op types. If the accuracy requirement is missed, the strategy will try to quantize one op type and accumulates them together. Finally, the user will get the following information.

[INFO] Suggested op types with KL algorithm are: ['Matmul', 'LayerNorm', 'Linear']
[INFO] Suggested FP8 op types are: ['Matmul', 'Embedding', 'LayerNorm', 'Linear']; Accuracy is 0.5560059529291749

Global Environment Variables

In order to facilitate customer customization, some global environment variables are used.

Framework Usage Supported Values
FP8_OP_TYPE_LIST To specify module type range of emulated FP8 quantization 'linear', 'conv2d', 'bmm', 'amm', 'mm','add', 'mul', 'div', 'embedding', 'embeddingbag', 'layernorm'
DISABLE_FIRST_CONV Whether quantize the first convolution layer True/False
DISABLE_LAST_LINEAR Whether quantize the last linear layer True/False
MIX_PRECISION Whether allow mix precision and auto select data type True/False
E4M3_SCALE Whether fix the scale to 1, which means cast fp32 to fp8_e4m3 1/-

Examples

quantizer = Quantization("fake.yaml")
quantizer.model = model
quantizer.calib_dataloader = self.cv_dataloader
q_model = quantizer.fit()

or

quant_conf = PostTrainingQuantConfig(
    precision="fp8_e5m2",
    calibration_sampling_size=[300],
    batchnorm_calibration_sampling_size=[3000],
)
q_model = quantization.fit(
    model,
    quant_conf,
    eval_func = eval_func,
    calib_dataloader=self.cv_dataloader
)