Support qwen2-7b with fused decoderlayer optimization on NPU #11912

plusbang · 2024-08-23T09:27:55Z

Description

Add qwen2-7B NPU support with fused decoderlayer and multi process optimization.
Details: https://github.com/analytics-zoo/nano/issues/1576#issuecomment-2314969138

Different from other model support, we use QuantizedLinear instead of FusedQwenLowBitDecoderlayer during prefill process.

2. User API changes

N/A

3. Summary of the change

4. How to test?

Unit test: Please manually trigger the PR Validation here by inputting the PR number (e.g., 1234). And paste your action link here once it has been successfully finished.
Application test

jason-dai · 2024-08-26T01:10:52Z

python/llm/src/ipex_llm/transformers/npu_models/convert_mp.py

+    elif model.config.model_type == "qwen2":
+        # for qwen2-1.5B and qwen2-7B


How about Qwen2-0.5B or 72B? Need to add check

How about Qwen2-0.5B or 72B? Need to add check

Sure, have added related check : )

python/llm/src/ipex_llm/transformers/npu_models/mp_models_base.py

yangw1234 · 2024-08-28T16:52:27Z

python/llm/src/ipex_llm/transformers/npu_models/mp_models_base.py

@@ -383,7 +383,7 @@ def update_cache(self, past_key_value, indexes):
            self.load_cache_async()

    def load_cache_async(self):
-        self.load_wt_fn(len(self.input_ops), self._mm, self.kv_cache_c_handle)
+        self.load_wt_fn(len(self.input_ops), self._mm, self.kv_cache_c_handle, verify_size=True)


Not sure how much runtime overhead of verify_size, if it is large, maybe we can disable it at runtime and only use it when debugging?

Not sure how much runtime overhead of verify_size, if it is large, maybe we can disable it at runtime and only use it when debugging?

According to my experiment, the overhead seems quite small. BTW have removed it : )

plusbang force-pushed the test-qwen2-7b branch from 3458b3a to 147f163 Compare August 23, 2024 09:34

jason-dai reviewed Aug 26, 2024

View reviewed changes

yangw1234 reviewed Aug 26, 2024

View reviewed changes

python/llm/src/ipex_llm/transformers/npu_models/mp_models_base.py Outdated Show resolved Hide resolved

plusbang added 4 commits August 27, 2024 14:54

fix

a008e69

Merge remote-tracking branch 'upstream/main'

d6d4905

Merge remote-tracking branch 'upstream/main'

a0eaacb

Merge remote-tracking branch 'upstream/main'

3583db2

plusbang force-pushed the test-qwen2-7b branch from 83a2ddb to ffdc410 Compare August 28, 2024 08:36

plusbang added 5 commits August 28, 2024 17:55

Merge remote-tracking branch 'upstream/main'

e61382b

test

8a38a37

clean

ff2a5ee

set transpose value False

7ac3396

fix code style

10d6591

plusbang force-pushed the test-qwen2-7b branch from c87ca0e to 10d6591 Compare August 28, 2024 10:01

plusbang added 2 commits August 28, 2024 18:04

fix

3adabd2

update readme

42d70b2

plusbang marked this pull request as ready for review August 28, 2024 10:46

plusbang requested review from jason-dai and yangw1234 August 28, 2024 10:46

plusbang changed the title ~~[WIP] Support qwen2-7b with fused decoderlayer optimization on NPU~~ Support qwen2-7b with fused decoderlayer optimization on NPU Aug 28, 2024

yangw1234 reviewed Aug 28, 2024

View reviewed changes

yangw1234 approved these changes Aug 28, 2024

View reviewed changes

rm verify size

ccdffa5

plusbang merged commit 71f03dc into intel-analytics:main Aug 29, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support qwen2-7b with fused decoderlayer optimization on NPU #11912

Support qwen2-7b with fused decoderlayer optimization on NPU #11912

plusbang commented Aug 23, 2024 •

edited

Loading

jason-dai Aug 26, 2024 •

edited

Loading

plusbang Aug 26, 2024

yangw1234 Aug 28, 2024

plusbang Aug 29, 2024

		elif model.config.model_type == "qwen2":
		# for qwen2-1.5B and qwen2-7B

Support qwen2-7b with fused decoderlayer optimization on NPU #11912

Support qwen2-7b with fused decoderlayer optimization on NPU #11912

Conversation

plusbang commented Aug 23, 2024 • edited Loading

Description

2. User API changes

3. Summary of the change

4. How to test?

jason-dai Aug 26, 2024 • edited Loading

Choose a reason for hiding this comment

plusbang Aug 26, 2024

Choose a reason for hiding this comment

yangw1234 Aug 28, 2024

Choose a reason for hiding this comment

plusbang Aug 29, 2024

Choose a reason for hiding this comment

plusbang commented Aug 23, 2024 •

edited

Loading

jason-dai Aug 26, 2024 •

edited

Loading