Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NPU] Add Optimized Support for Llama3.2-1B/3B on NPU #12339

Merged
merged 10 commits into from
Nov 6, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ In this directory, you will find examples on how to directly run HuggingFace `tr
|------------|----------------------------------------------------------------|
| Llama2 | [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) |
| Llama3 | [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) |
| Llama3.2-1B | [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) |
| Llama3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could merge to one line?

| Chatglm3 | [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b) |
| Chatglm2 | [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b) |
| Qwen2 | [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct), [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) |
Expand All @@ -33,6 +35,9 @@ conda activate llm

:: install ipex-llm with 'npu' option
pip install --pre --upgrade ipex-llm[npu]

:: [optional] for Llama-3.2-1B-Instruct & Llama-3.2-3B-Instruct
pip install transformers==4.45.0 accelerate==0.33.0
```

## 2. Runtime Configurations
Expand Down Expand Up @@ -82,6 +87,8 @@ done
The examples below show how to run the **_optimized HuggingFace model implementations_** on Intel NPU, including
- [Llama2-7B](./llama.py)
- [Llama3-8B](./llama.py)
- [Llama3.2-1B](./llama.py)
- [Llama3.2-3B](./llama.py)
- [Qwen2-1.5B](./qwen.py)
- [Qwen2.5-7B](./qwen.py)
- [MiniCPM-1B](./minicpm.py)
Expand All @@ -106,6 +113,12 @@ python llama.py
:: to run Meta-Llama-3-8B-Instruct (LNL driver version: 32.0.101.2715)
python llama.py --repo-id-or-model-path meta-llama/Meta-Llama-3-8B-Instruct

:: to run Llama-3.2-1B-Instruct
python llama.py --repo-id-or-model-path meta-llama/Llama-3.2-1B-Instruct

:: to run Llama-3.2-3B-Instruct
python llama.py --repo-id-or-model-path meta-llama/Llama-3.2-3B-Instruct

:: to run Qwen2-1.5B-Instruct (LNL driver version: 32.0.101.2715)
python qwen.py

Expand Down Expand Up @@ -145,6 +158,12 @@ python llama.py --disable-transpose-value-cache
:: to run Meta-Llama-3-8B-Instruct (LNL driver version: 32.0.101.2715)
python llama.py --repo-id-or-model-path meta-llama/Meta-Llama-3-8B-Instruct --disable-transpose-value-cache

:: to run Llama-3.2-1B-Instruct
python llama.py --repo-id-or-model-path meta-llama/Llama-3.2-1B-Instruct --disable-transpose-value-cache

:: to run Llama-3.2-3B-Instruct
python llama.py --repo-id-or-model-path meta-llama/Llama-3.2-3B-Instruct --disable-transpose-value-cache

:: to run Qwen2-1.5B-Instruct (LNL driver version: 32.0.101.2715)
python qwen.py --disable-transpose-value-cache

Expand Down
2 changes: 2 additions & 0 deletions python/llm/example/NPU/HF-Transformers-AutoModels/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ This folder contains examples of running IPEX-LLM on Intel NPU:
|------------|----------------------------------------------------------------|
| Llama2 | [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) |
| Llama3 | [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) |
| Llama3.2-1B | [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) |
| Llama3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) |
| Chatglm3 | [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b) |
| Chatglm2 | [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b) |
| Qwen2 | [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct), [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) |
Expand Down
15 changes: 11 additions & 4 deletions python/llm/src/ipex_llm/transformers/npu_models/convert_mp.py
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,8 @@ def convert_llama(
intra_pp=None,
transpose_value_cache=True,
):
from ipex_llm.transformers.npu_models.llama_mp import gen_llama_fused_model_forward
from ipex_llm.transformers.npu_models.llama_mp import gen_llama_fused_model_forward,\
gen_llama_32_fused_model_forward
from ipex_llm.transformers.npu_models.llama_mp import DecodeRunner, PrefillRunner
from transformers.models.llama.modeling_llama import LlamaModel

Expand All @@ -193,9 +194,15 @@ def convert_llama(
max_prompt_len=max_prompt_len,
transpose_value_cache=transpose_value_cache,
)
llama_model_forward = gen_llama_fused_model_forward(
prefill_runner=prefill_runner, decode_runner=decode_runner
)
if model.config.num_hidden_layers == 28 or model.config.num_hidden_layers == 16:
# llama-3.2-3B & llama-3.2-1B
llama_model_forward = gen_llama_32_fused_model_forward(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we could also add transformers version check here.

prefill_runner=prefill_runner, decode_runner=decode_runner
)
else:
llama_model_forward = gen_llama_fused_model_forward(
prefill_runner=prefill_runner, decode_runner=decode_runner
)
convert_forward(model, LlamaModel, llama_model_forward)
from transformers.models.llama.modeling_llama import LlamaForCausalLM
from ipex_llm.transformers.npu_models.llama_mp import llama2_casullm_forward
Expand Down
5 changes: 4 additions & 1 deletion python/llm/src/ipex_llm/transformers/npu_models/kv.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ class DynamicFusedNormalCache(DynamicCache):
# Experimental support for fused decoderlayer implementation on NPU
# Currently only for llama2

def __init__(self) -> None:
def __init__(self, num_hidden_layers: Optional[int] = None) -> None:
self.key_cache: Dict[int, torch.Tensor] = {}
self.value_cache: Dict[int, torch.Tensor] = {}
self.min_layer_idx = sys.maxsize
Expand All @@ -158,6 +158,9 @@ def update(
cache_kwargs: Optional[Dict[str, Any]]=None,
) -> Tuple[torch.Tensor, torch.Tensor]:

if key_states == []:
return key_states, value_states

batch_size, num_heads, seq_len, head_dim = key_states.shape

max_seq_length = cache_kwargs["max_seq_len"] if "max_seq_len" in cache_kwargs else None
Expand Down
Loading
Loading