-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support qwen2-7b with fused decoderlayer optimization on NPU #11912
Conversation
3458b3a
to
147f163
Compare
elif model.config.model_type == "qwen2": | ||
# for qwen2-1.5B and qwen2-7B |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about Qwen2-0.5B or 72B? Need to add check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about Qwen2-0.5B or 72B? Need to add check
Sure, have added related check : )
python/llm/src/ipex_llm/transformers/npu_models/mp_models_base.py
Outdated
Show resolved
Hide resolved
83a2ddb
to
ffdc410
Compare
c87ca0e
to
10d6591
Compare
@@ -383,7 +383,7 @@ def update_cache(self, past_key_value, indexes): | |||
self.load_cache_async() | |||
|
|||
def load_cache_async(self): | |||
self.load_wt_fn(len(self.input_ops), self._mm, self.kv_cache_c_handle) | |||
self.load_wt_fn(len(self.input_ops), self._mm, self.kv_cache_c_handle, verify_size=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how much runtime overhead of verify_size
, if it is large, maybe we can disable it at runtime and only use it when debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how much runtime overhead of
verify_size
, if it is large, maybe we can disable it at runtime and only use it when debugging?
According to my experiment, the overhead seems quite small. BTW have removed it : )
Description
Add qwen2-7B NPU support with fused decoderlayer and multi process optimization.
Details: https://github.com/analytics-zoo/nano/issues/1576#issuecomment-2314969138
Different from other model support, we use QuantizedLinear instead of FusedQwenLowBitDecoderlayer during prefill process.
2. User API changes
N/A
3. Summary of the change
4. How to test?
1234
). And paste your action link here once it has been successfully finished.