[NPU] dump prefill IR for further C++ solution #12402

rnwang04 · 2024-11-14T10:42:18Z

Description

1. Why the change?

https://github.com/analytics-zoo/nano/issues/1716#issue-2628191642
To support pure c++ NPU solution, we need to provide a "compile" tool for user to save all needed files (IR / bin / blob).

2. User API changes

Added two params:

compile_full_model: if set to True, we will save prefill related IR or bin files, default to False
save_directory: directory used to save all needed files (IR / bin / blob), default to None
If we just want to do inference at python side, usage is not changed

model = AutoModelForCausalLM.from_pretrained(model_path,
                                              optimize_model=True,
                                              pipeline=True,
                                              load_in_low_bit=args.load_in_low_bit,
                                              max_context_len=args.max_context_len,
                                              max_prompt_len=args.max_prompt_len,
                                              quantization_group_size=args.quantization_group_size,
                                              torch_dtype=torch.float16,
                                              attn_implementation="eager",
                                              transpose_value_cache=not args.disable_transpose_value_cache,
                                              mixed_precision=True,
                                              trust_remote_code=True)

If we want to dump files to do further C++ inference

model = AutoModelForCausalLM.from_pretrained(model_path,
                                             optimize_model=True,
                                             pipeline=True,
                                             load_in_low_bit=args.load_in_low_bit,
                                             max_context_len=args.max_context_len,
                                             max_prompt_len=args.max_prompt_len,
                                             quantization_group_size=args.quantization_group_size,
                                             torch_dtype=torch.float16,
                                             attn_implementation="eager",
                                             transpose_value_cache=not args.disable_transpose_value_cache,
                                             mixed_precision=True,
                                             trust_remote_code=True,
                                             compile_full_model=True,
                                             save_directory=save_dir)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
tokenizer.save_pretrained(save_dir)

3. Summary of the change

Added two params compile_full_model / save_directory to dump all needed files for further c++ inference support
Sort out what files are necessary
Code refactor

4. Verify correctness

Qwen2.5 7B pipeline python CW
Qwen2.5 7B pipeline c++ CW
Qwen2 1.5B pipeline python CW
Qwen2 1.5B pipeline c++ CW

Only update qwen2 for now, can be extended to other models later.

rnwang04 · 2024-11-18T11:57:50Z

C++ output verification can be found here: https://github.com/intel-analytics/llm.cpp/pull/655

python/llm/src/ipex_llm/transformers/npu_pipeline_model/common.py

rnwang04 marked this pull request as draft November 14, 2024 10:42

rnwang04 marked this pull request as ready for review November 18, 2024 10:12

rnwang04 force-pushed the dump_prefill_ir branch from 2111c59 to af61841 Compare November 18, 2024 10:42

rnwang04 added 7 commits November 18, 2024 18:43

save prefill ir

1e2259c

fix

5fa5432

shorten convert time

72b61a8

fix

5be4913

fix

f1a991a

fix

af61841

fix

59aac71

rnwang04 changed the title ~~[NPU] dump prefill IR~~ [NPU] dump prefill IR for further C++ solution Nov 18, 2024

fix style

b2aff13

rnwang04 requested a review from jason-dai November 18, 2024 11:22

rnwang04 requested review from hkvision and MeouSker77 November 19, 2024 02:42

dump config.json

2eeee2f

hkvision reviewed Nov 19, 2024

View reviewed changes

python/llm/src/ipex_llm/transformers/npu_pipeline_model/common.py Outdated Show resolved Hide resolved

rnwang04 added 2 commits November 19, 2024 17:02

meet review

fe92684

small fix

07a27bc

MeouSker77 approved these changes Nov 20, 2024

View reviewed changes

rnwang04 merged commit 54c62fe into intel-analytics:main Nov 20, 2024
1 check passed

rnwang04 deleted the dump_prefill_ir branch November 20, 2024 07:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPU] dump prefill IR for further C++ solution #12402

[NPU] dump prefill IR for further C++ solution #12402

rnwang04 commented Nov 14, 2024 •

edited

Loading

rnwang04 commented Nov 18, 2024

[NPU] dump prefill IR for further C++ solution #12402

[NPU] dump prefill IR for further C++ solution #12402

Conversation

rnwang04 commented Nov 14, 2024 • edited Loading

Description

1. Why the change?

2. User API changes

3. Summary of the change

4. Verify correctness

rnwang04 commented Nov 18, 2024

rnwang04 commented Nov 14, 2024 •

edited

Loading