-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add experimental support of fused decoder layer for llama2 #11768
Add experimental support of fused decoder layer for llama2 #11768
Conversation
df4be6f
to
630dd23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the changes in pipeline_parallel.py
, if cleaned up a bit , should not affect gpu code. Maybe we should try to merge them later.
pip install --pre --upgrade ipex-llm[all] | ||
pip install --pre --upgrade bigdl-core-npu | ||
|
||
pip install transformers==4.40 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am experimenting on transformers==3.39.3
, have we verified 4.40?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am experimenting on
transformers==3.39.3
, have we verified 4.40?
yes, I have also verified transformers==4.40
.
Have added optimize_llm_post
to wrap lm_head
related logic etc. And considering we may need additional process to address prefill, maybe we could keep separating npu and gpu PP temporarilly.
Merge it first, please let me know if any suggestion or comment. |
Description
Refactor #11716
This experimental support is only verified on llama2-7b currently.
To avoid influence on GPU support and NPU support for other models (llama3, qwen2, etc), add
DynamicFusedNormalCache
,llama_fused_model_forward
andpipeline_parallel
intransformers/npu_models/
. Then only specifypipeline_parallel_stages > 1
will use related logic.Add experimental example and readme
4. How to test?
1234
). And paste your action link here once it has been successfully finished.https://github.com/intel-analytics/ipex-llm-workflow/actions/runs/10362835964