You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In mlc-llm v0.18.dev0 release, tvm(relax repo) commit 50d1c97dc98 leads to extremely slow prefill speed for q4f16_1 of Llama-2-7B-chat-hf on android 8gen3 device.
Expected behavior
q4f16_1 of Llama-2-7B-chat-hf prefill speed is close to 10 tok/s before.
Actual behavior
q4f16_1 of Llama-2-7B-chat-hf prefill speed is only ~0.3 tok/s now.
tvm relax submodule in mlc-llm repo at that moment.
Steps to reproduce
compile, bundle q4f16_1 of Llama-2-7B-chat-hf and launch on android 8gen3 device.
by the way, when I revert "block_size_x/y, unroll" value as well as related parts in sch_outer_reduction from (32, 8, 4) to (8, 16, 64), q4f16_0 and q4f16_1 prefill speed become normal.
this may need a fix as most mlc converted models are released in q4f16_1 format.
Triage
vert:android
The text was updated successfully, but these errors were encountered:
HI @xuguodong1999 thank you for reporting this. Our experience is that for Android devices, using the quantization q4f16_0 can bring better performance than q4f16_1. Would you mind also trying that out?
HI @xuguodong1999 thank you for reporting this. Our experience is that for Android devices, using the quantization q4f16_0 can bring better performance than q4f16_1. Would you mind also trying that out?
q4f16_0 prefill is indeed faster than q4f16_1 on android, regardless of hyperparameters "block_size_x/y" and "unroll". Thank you for your reply.
In mlc-llm v0.18.dev0 release, tvm(relax repo) commit 50d1c97dc98 leads to extremely slow prefill speed for q4f16_1 of Llama-2-7B-chat-hf on android 8gen3 device.
Expected behavior
q4f16_1 of Llama-2-7B-chat-hf prefill speed is close to 10 tok/s before.
Actual behavior
q4f16_1 of Llama-2-7B-chat-hf prefill speed is only ~0.3 tok/s now.
Environment
android 8gen3 device
mlc-llm v0.18.dev0
tvm relax submodule in mlc-llm repo at that moment.
Steps to reproduce
by the way, when I revert "block_size_x/y, unroll" value as well as related parts in
sch_outer_reduction
from (32, 8, 4) to (8, 16, 64), q4f16_0 and q4f16_1 prefill speed become normal.this may need a fix as most mlc converted models are released in q4f16_1 format.
Triage
The text was updated successfully, but these errors were encountered: