Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support qwen2-7b with fused decoderlayer optimization on NPU #11912

Merged
merged 12 commits into from
Aug 29, 2024

Conversation

plusbang
Copy link
Contributor

@plusbang plusbang commented Aug 23, 2024

Description

Add qwen2-7B NPU support with fused decoderlayer and multi process optimization.
Details: https://github.com/analytics-zoo/nano/issues/1576#issuecomment-2314969138

Different from other model support, we use QuantizedLinear instead of FusedQwenLowBitDecoderlayer during prefill process.

2. User API changes

N/A

3. Summary of the change

4. How to test?

  • Unit test: Please manually trigger the PR Validation here by inputting the PR number (e.g., 1234). And paste your action link here once it has been successfully finished.
  • Application test

Comment on lines 60 to 111
elif model.config.model_type == "qwen2":
# for qwen2-1.5B and qwen2-7B
Copy link
Contributor

@jason-dai jason-dai Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about Qwen2-0.5B or 72B? Need to add check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about Qwen2-0.5B or 72B? Need to add check

Sure, have added related check : )

@plusbang plusbang marked this pull request as ready for review August 28, 2024 10:46
@plusbang plusbang changed the title [WIP] Support qwen2-7b with fused decoderlayer optimization on NPU Support qwen2-7b with fused decoderlayer optimization on NPU Aug 28, 2024
@@ -383,7 +383,7 @@ def update_cache(self, past_key_value, indexes):
self.load_cache_async()

def load_cache_async(self):
self.load_wt_fn(len(self.input_ops), self._mm, self.kv_cache_c_handle)
self.load_wt_fn(len(self.input_ops), self._mm, self.kv_cache_c_handle, verify_size=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how much runtime overhead of verify_size, if it is large, maybe we can disable it at runtime and only use it when debugging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how much runtime overhead of verify_size, if it is large, maybe we can disable it at runtime and only use it when debugging?

According to my experiment, the overhead seems quite small. BTW have removed it : )

@plusbang plusbang merged commit 71f03dc into intel-analytics:main Aug 29, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants