Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does SHARK LLM support q4/q8 matrix multiplication? #713

Open
rednoah91 opened this issue Jun 3, 2024 · 1 comment
Open

Does SHARK LLM support q4/q8 matrix multiplication? #713

rednoah91 opened this issue Jun 3, 2024 · 1 comment

Comments

@rednoah91
Copy link

rednoah91 commented Jun 3, 2024

Hi,
I followed the instructions here to compile llama model into .vmfb.
I specified the quantization to 4bits and precision to f16, and I got the mlir like:

%15 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%7, %8, %9 : tensor<2048x44x128xi4>, tensor<2048x44xf16>, tensor<2048x44xf16>) outs(%14 : tensor<2048x44x128xf16>) {
        ^bb0(%in: i4 loc("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17194:10), %in_0: f16 loc("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17194:19), %in_1: f16 loc("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17194:33), %out: f16 loc("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17194:47)):
          %19 = arith.extui %in : i4 to i32 loc(callsite("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17195:15 at "./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":16506:3))
          %20 = arith.uitofp %19 : i32 to f16 loc(callsite("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17196:15 at "./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":16506:3))
          %21 = arith.subf %20, %in_1 : f16 loc(callsite("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17197:15 at "./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":16506:3))
          %22 = arith.mulf %21, %in_0 : f16 loc(callsite("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17198:15 at "./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":16506:3))
          linalg.yield %22 : f16 loc(callsite("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17199:7 at "./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":16506:3))
        } -> tensor<2048x44x128xf16> loc(callsite("./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":17193:12 at "./TinyLlama_1.1B_Chat_v1.0_f16_int4.mlir":16506:3))

Seems the int4 weights was dequantized to f16 and the computation(matmul) is in f16.
Does the quantization support that quantize the activation f16 to q4/q8 and compute in q4/q8? Like what llama.cpp is doing for CPU (the E approach in this article).

Thanks.

@vivekkhandelwal1
Copy link
Contributor

Hi @monorimet @AmosLewis @zjgarvey, do you have any info about this query?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants