[Bug] recent change of param "block_size_x/y, unroll" in dlight/gpu/matmul.py significantly decrease q4f16_1 prefill speed on android 8gen3 device #326

xuguodong1999 · 2024-09-21T14:16:24Z

In mlc-llm v0.18.dev0 release, tvm(relax repo) commit 50d1c97dc98 leads to extremely slow prefill speed for q4f16_1 of Llama-2-7B-chat-hf on android 8gen3 device.

Expected behavior

q4f16_1 of Llama-2-7B-chat-hf prefill speed is close to 10 tok/s before.

Actual behavior

q4f16_1 of Llama-2-7B-chat-hf prefill speed is only ~0.3 tok/s now.

Environment

android 8gen3 device
mlc-llm v0.18.dev0
tvm relax submodule in mlc-llm repo at that moment.

Steps to reproduce

compile, bundle q4f16_1 of Llama-2-7B-chat-hf and launch on android 8gen3 device.

by the way, when I revert "block_size_x/y, unroll" value as well as related parts in sch_outer_reduction from (32, 8, 4) to (8, 16, 64), q4f16_0 and q4f16_1 prefill speed become normal.

this may need a fix as most mlc converted models are released in q4f16_1 format.

Triage

vert:android

The text was updated successfully, but these errors were encountered:

MasterJH5574 · 2024-10-28T18:17:32Z

HI @xuguodong1999 thank you for reporting this. Our experience is that for Android devices, using the quantization q4f16_0 can bring better performance than q4f16_1. Would you mind also trying that out?

xuguodong1999 · 2024-10-30T06:31:44Z

HI @xuguodong1999 thank you for reporting this. Our experience is that for Android devices, using the quantization q4f16_0 can bring better performance than q4f16_1. Would you mind also trying that out?

q4f16_0 prefill is indeed faster than q4f16_1 on android, regardless of hyperparameters "block_size_x/y" and "unroll". Thank you for your reply.

xuguodong1999 closed this as completed Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] recent change of param "block_size_x/y, unroll" in dlight/gpu/matmul.py significantly decrease q4f16_1 prefill speed on android 8gen3 device #326

[Bug] recent change of param "block_size_x/y, unroll" in dlight/gpu/matmul.py significantly decrease q4f16_1 prefill speed on android 8gen3 device #326

xuguodong1999 commented Sep 21, 2024

MasterJH5574 commented Oct 28, 2024

xuguodong1999 commented Oct 30, 2024

[Bug] recent change of param "block_size_x/y, unroll" in dlight/gpu/matmul.py significantly decrease q4f16_1 prefill speed on android 8gen3 device #326

[Bug] recent change of param "block_size_x/y, unroll" in dlight/gpu/matmul.py significantly decrease q4f16_1 prefill speed on android 8gen3 device #326

Comments

xuguodong1999 commented Sep 21, 2024

Expected behavior

Actual behavior

Environment

Steps to reproduce

Triage

MasterJH5574 commented Oct 28, 2024

xuguodong1999 commented Oct 30, 2024