Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix starcoder first token perf #10612

Merged
merged 2 commits into from
Apr 2, 2024
Merged

Conversation

hkvision
Copy link
Contributor

@hkvision hkvision commented Apr 1, 2024

Fix https://github.com/analytics-zoo/nano/issues/1203

      (c_fc): LowBitLinear(in_features=6144, out_features=24576, bias=True)
      (c_proj): LowBitLinear(in_features=24576, out_features=6144, bias=True)

For starcoder mlp, it has large in/out features as well, 24576*6144>32000*4096, larger than llama's lm_head. empty cache is executed within each transformer block and thus the first token latency is hugely impacted.
Add a condition for bias to exclude these layers. For lm_head seems it won't have bias.

@hkvision hkvision requested a review from lalalapotter April 1, 2024 11:52
Copy link
Contributor

@lalalapotter lalalapotter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@hkvision hkvision merged commit 0a95c55 into intel-analytics:main Apr 2, 2024
17 checks passed
@hkvision hkvision deleted the starcoder branch April 2, 2024 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants