You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I followed the instructions here to compile llama model into .vmfb.
I specified the quantization to 4bits and precision to f16, and I got the mlir like:
Seems the int4 weights was dequantized to f16 and the computation(matmul) is in f16.
Does the quantization support that quantize the activation f16 to q4/q8 and compute in q4/q8? Like what llama.cpp is doing for CPU (the E approach in this article).
Thanks.
The text was updated successfully, but these errors were encountered:
Hi,
I followed the instructions here to compile llama model into .vmfb.
I specified the quantization to 4bits and precision to f16, and I got the mlir like:
Seems the int4 weights was dequantized to f16 and the computation(matmul) is in f16.
Does the quantization support that quantize the activation f16 to q4/q8 and compute in q4/q8? Like what llama.cpp is doing for CPU (the
E
approach in this article).Thanks.
The text was updated successfully, but these errors were encountered: