Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vllm awq loading logic #11987

Merged

Conversation

ACupofAir
Copy link
Contributor

Description

  • We use an environment variable to get the group size. This is because within ipex-llm, we cannot get the quantization config.
  • This pr is apply main's Add vllm awq loading logic #11950 to ipex-llm-mainline branch.

@ACupofAir
Copy link
Contributor Author

Have test on docker image vllm-ipex-054:0903. There is no exception in the output.
Results:

  1. chatglm3-6b 1 card:
    image

  2. llama2-13b 2 cards:
    image

@glorysdj glorysdj requested a review from gc-fu September 4, 2024 01:01
@gc-fu
Copy link
Contributor

gc-fu commented Sep 4, 2024

Have test on docker image vllm-ipex-054:0903. There is no exception in the output. Results:

  1. chatglm3-6b 1 card:
    image
  2. llama2-13b 2 cards:
    image

The result of chatglm3-6b seems wired for me, it might be caused by not setting export BIGDL_LLM_SDP_IGNORE_MASK=0.

Please set this environment variable and test again. Also, please adding one awq test result into the thread. For instance: https://huggingface.co/TheBloke/Llama-2-7B-Chat-AWQ

@ACupofAir
Copy link
Contributor Author

Have test on docker image vllm-ipex-054:0903. There is no exception in the output. Results:

  1. chatglm3-6b 1 card:
    image
  2. llama2-13b 2 cards:
    image

The result of chatglm3-6b seems wired for me, it might be caused by not setting export BIGDL_LLM_SDP_IGNORE_MASK=0.

Please set this environment variable and test again. Also, please adding one awq test result into the thread. For instance: https://huggingface.co/TheBloke/Llama-2-7B-Chat-AWQ

All result is reasonable now.

  1. verification for awq model(llama2-7b-chat-awq)
    image
  2. result for chatglm3-6b after export BIGDL_LLM_SDP_IGNORE_MASK=0
    image

Attention: to run the awq model need apply this pr(analytics-zoo/vllm#29) for vllm 0.5.4

Copy link
Contributor

@gc-fu gc-fu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gc-fu gc-fu merged commit 56b8514 into intel-analytics:ipex-vllm-mainline Sep 6, 2024
gc-fu pushed a commit that referenced this pull request Sep 10, 2024
* [ADD] Add vllm awq loading logic

* [FIX] fix the module.linear_method path

* [FIX] fix quant_config path error
gc-fu added a commit that referenced this pull request Sep 10, 2024
* Enable single card sync engine

* enable ipex-llm optimizations for vllm

* enable optimizations for lm_head

* Fix chatglm multi-reference problem

* Remove duplicate layer

* LLM: Update vLLM to v0.5.4 (#11746)

* Enable single card sync engine

* enable ipex-llm optimizations for vllm

* enable optimizations for lm_head

* Fix chatglm multi-reference problem

* update 0.5.4 api_server

* add dockerfile

* fix

* fix

* refine

* fix

---------

Co-authored-by: gc-fu <[email protected]>

* Add vllm-0.5.4 Dockerfile (#11838)

* Update BIGDL_LLM_SDP_IGNORE_MASK in start-vllm-service.sh (#11957)

* Fix vLLM not convert issues (#11817) (#11918)

* Fix not convert issues

* refine

Co-authored-by: Guancheng Fu <[email protected]>

* Fix glm4-9b-chat nan error on vllm 0.5.4 (#11969)

* init

* update mlp forward

* fix minicpm error in vllm 0.5.4

* fix dependabot alerts (#12008)

* Update 0.5.4 dockerfile (#12021)

* Add vllm awq loading logic (#11987)

* [ADD] Add vllm awq loading logic

* [FIX] fix the module.linear_method path

* [FIX] fix quant_config path error

* Enable Qwen padding mlp to 256 to support batch_forward (#12030)

* Enable padding mlp

* padding to 256

* update style

* Install 27191 runtime in 0.5.4 docker image (#12040)

* fix rebase error

* fix rebase error

* vLLM: format for 0.5.4 rebase (#12043)

* format

* Update model_convert.py

* Fix serving docker related modifications (#12046)

* Fix undesired modifications (#12048)

* fix

* Refine offline_inference arguments

---------

Co-authored-by: Xiangyu Tian <[email protected]>
Co-authored-by: Jun Wang <[email protected]>
Co-authored-by: Wang, Jian4 <[email protected]>
Co-authored-by: liu-shaojun <[email protected]>
Co-authored-by: Shaojun Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants