Fix starcoder first token perf #10612

hkvision · 2024-04-01T11:52:31Z

Fix https://github.com/analytics-zoo/nano/issues/1203

      (c_fc): LowBitLinear(in_features=6144, out_features=24576, bias=True)
      (c_proj): LowBitLinear(in_features=24576, out_features=6144, bias=True)

For starcoder mlp, it has large in/out features as well, 24576*6144>32000*4096, larger than llama's lm_head. empty cache is executed within each transformer block and thus the first token latency is hugely impacted.
Add a condition for bias to exclude these layers. For lm_head seems it won't have bias.

lalalapotter

LGTM.

add bias check

b9f322f

hkvision requested a review from lalalapotter April 1, 2024 11:52

update

b375d8f

lalalapotter approved these changes Apr 2, 2024

View reviewed changes

hkvision merged commit 0a95c55 into intel-analytics:main Apr 2, 2024
17 checks passed

hkvision deleted the starcoder branch April 2, 2024 01:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix starcoder first token perf #10612

Fix starcoder first token perf #10612

hkvision commented Apr 1, 2024 •

edited

Loading

lalalapotter left a comment

Fix starcoder first token perf #10612

Fix starcoder first token perf #10612

Conversation

hkvision commented Apr 1, 2024 • edited Loading

lalalapotter left a comment

Choose a reason for hiding this comment

hkvision commented Apr 1, 2024 •

edited

Loading