Latency 20x with quant_mode = true #21

LiamPKU · 2022-03-08T09:14:37Z

In the hugging face config, I set quant_mode = TRUE.
The weight_integer buffer remains 0, and the result is wrong.
Moreover, inference latency of integer mode is 20 times of float mode.
Can you please explain the reason for me?

huu4ontocord · 2022-03-22T01:17:28Z

Hi,

Similar to this, I also found it is MUCH slower in quant_mode = True. here's a notebook with a slightly modified version of the HF code to allow dynamically switching quant_mode. You can see the timing difference.

https://colab.research.google.com/drive/1DkYFGc18oPvAn5nyGEL1aIFHmD_aNlXW

LiamPKU added the question Further information is requested label Mar 8, 2022

This was referenced Mar 22, 2022

ibert seems to be quite slow in quant_mode = True huggingface/transformers#16319

Closed

IBert problems of quant_model=true #18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latency 20x with quant_mode = true #21

Latency 20x with quant_mode = true #21

LiamPKU commented Mar 8, 2022

huu4ontocord commented Mar 22, 2022

Latency 20x with quant_mode = true #21

Latency 20x with quant_mode = true #21

Comments

LiamPKU commented Mar 8, 2022

huu4ontocord commented Mar 22, 2022