Skip to content

Commit

Permalink
Merge branch 'intel-analytics:main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
notsyncing authored Nov 16, 2024
2 parents 9604698 + 3d5fbf2 commit 1b71f5f
Show file tree
Hide file tree
Showing 281 changed files with 23,090 additions and 75,454 deletions.
2 changes: 2 additions & 0 deletions .github/actions/llm/download-llm-binary/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ runs:
mv windows-avx2/* python/llm/llm-binary/
mv windows-avx-vnni/* python/llm/llm-binary/
mv windows-avx/* python/llm/llm-binary/
mv windows-npu-level0/* python/llm/llm-binary/
fi
rm -rf linux-avx2 || true
rm -rf linux-avx512 || true
Expand All @@ -36,3 +37,4 @@ runs:
rm -rf windows-avx2 || true
rm -rf windows-avx-vnni || true
rm -rf windows-avx || true
rm -rf windows-npu-level0 || true
58 changes: 58 additions & 0 deletions .github/workflows/llm-binary-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -443,6 +443,64 @@ jobs:
path: |
release
check-windows-npu-level0-artifact:
if: ${{contains(inputs.platform, 'Windows')}}
runs-on: [Shire]
outputs:
if-exists: ${{steps.check_artifact.outputs.exists}}
steps:
- name: Check if built
id: check_artifact
uses: xSAVIKx/artifact-exists-action@v0
with:
name: windows-npu-level0

windows-build-npu-level0:
runs-on: [self-hosted, Windows, npu-level0]
needs: check-windows-npu-level0-artifact
if: needs.check-windows-npu-level0-artifact.outputs.if-exists == 'false'
steps:
- name: Set access token
run: |
echo "github_access_token=$env:GITHUB_ACCESS_TOKEN" >> $env:GITHUB_ENV
echo "github_access_token=$env:GITHUB_ACCESS_TOKEN"
- uses: actions/checkout@f43a0e5ff2bd294095638e18286ca9a3d1956744 # actions/checkout@v3
with:
repository: "intel-analytics/llm.cpp"
ref: ${{ inputs.llmcpp-ref }}
token: ${{ env.github_access_token }}
submodules: "recursive"
- name: Add msbuild to PATH
uses: microsoft/[email protected]
with:
msbuild-architecture: x64
- name: Add cmake to PATH
uses: ilammy/msvc-dev-cmd@v1
- name: Build binary
shell: cmd
run: |
call "C:\Program Files (x86)\Intel\openvino_2024.4.0\setupvars.bat"
cd bigdl-core-npu-level0
sed -i "/FetchContent_MakeAvailable(intel_npu_acceleration_library)/s/^/#/" CMakeLists.txt
mkdir build
cd build
cmake ..
cmake --build . --config Release -t pipeline
- name: Move release binary
shell: powershell
run: |
cd bigdl-core-npu-level0
if (Test-Path ./release) { rm -r -fo release }
mkdir release
mv build/Release/pipeline.dll release/pipeline.dll
- name: Archive build files
uses: actions/upload-artifact@v3
with:
name: windows-npu-level0
path: |
bigdl-core-npu-level0/release
# to make llm-binary-build optionally skippable
dummy-step:
if: ${{ inputs.platform == 'Dummy' }}
Expand Down
982 changes: 866 additions & 116 deletions .github/workflows/llm_performance_tests.yml

Large diffs are not rendered by default.

17 changes: 10 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@
<b>< English</b> | <a href='./README.zh-CN.md'>中文</a> >
</p>

**`IPEX-LLM`** is a PyTorch library for running **LLM** on Intel CPU and GPU *(e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max)* with very low latency[^1].
**`IPEX-LLM`** is an LLM acceleration library for Intel ***CPU***, ***GPU*** *(e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max)* and ***NPU*** [^1] .
> [!NOTE]
> - *It is built on top of the excellent work of **`llama.cpp`**, **`transformers`**, **`bitsandbytes`**, **`vLLM`**, **`qlora`**, **`AutoGPTQ`**, **`AutoAWQ`**, etc.*
> - *It provides seamless integration with [llama.cpp](docs/mddocs/Quickstart/llama_cpp_quickstart.md), [Ollama](docs/mddocs/Quickstart/ollama_quickstart.md), [Text-Generation-WebUI](docs/mddocs/Quickstart/webui_quickstart.md), [HuggingFace transformers](python/llm/example/GPU/HuggingFace), [LangChain](python/llm/example/GPU/LangChain), [LlamaIndex](python/llm/example/GPU/LlamaIndex), [DeepSpeed-AutoTP](python/llm/example/GPU/Deepspeed-AutoTP), [vLLM](docs/mddocs/Quickstart/vLLM_quickstart.md), [FastChat](docs/mddocs/Quickstart/fastchat_quickstart.md), [Axolotl](docs/mddocs/Quickstart/axolotl_quickstart.md), [HuggingFace PEFT](python/llm/example/GPU/LLM-Finetuning), [HuggingFace TRL](python/llm/example/GPU/LLM-Finetuning/DPO), [AutoGen](python/llm/example/CPU/Applications/autogen), [ModeScope](python/llm/example/GPU/ModelScope-Models), etc.*
> - ***50+ models** have been optimized/verified on `ipex-llm` (including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list [here](#verified-models).*
> - *It provides seamless integration with [llama.cpp](docs/mddocs/Quickstart/llama_cpp_quickstart.md), [Ollama](docs/mddocs/Quickstart/ollama_quickstart.md), [HuggingFace transformers](python/llm/example/GPU/HuggingFace), [LangChain](python/llm/example/GPU/LangChain), [LlamaIndex](python/llm/example/GPU/LlamaIndex), [vLLM](docs/mddocs/Quickstart/vLLM_quickstart.md), [Text-Generation-WebUI](docs/mddocs/Quickstart/webui_quickstart.md), [DeepSpeed-AutoTP](python/llm/example/GPU/Deepspeed-AutoTP), [FastChat](docs/mddocs/Quickstart/fastchat_quickstart.md), [Axolotl](docs/mddocs/Quickstart/axolotl_quickstart.md), [HuggingFace PEFT](python/llm/example/GPU/LLM-Finetuning), [HuggingFace TRL](python/llm/example/GPU/LLM-Finetuning/DPO), [AutoGen](python/llm/example/CPU/Applications/autogen), [ModeScope](python/llm/example/GPU/ModelScope-Models), etc.*
> - ***70+ models** have been optimized/verified on `ipex-llm` (e.g., Llama, Phi, Mistral, Mixtral, Whisper, Qwen, MiniCPM, Qwen-VL, MiniCPM-V and more), with state-of-art **LLM optimizations**, **XPU acceleration** and **low-bit (FP8/FP6/FP4/INT4) support**; see the complete list [here](#verified-models).*
## Latest Update 🔥
- [2024/07] We added support for running Microsoft's **GraphRAG** using local LLM on Intel GPU; see the quickstart guide [here](docs/mddocs/Quickstart/graphrag_quickstart.md).
Expand Down Expand Up @@ -177,17 +177,17 @@ Please see the **Perplexity** result below (tested on Wikitext dataset using the
## `ipex-llm` Quickstart

### Docker
- [GPU Inference in C++](docs/mddocs/DockerGuides/docker_cpp_xpu_quickstart.md): running `llama.cpp`, `ollama`, `OpenWebUI`, etc., with `ipex-llm` on Intel GPU
- [GPU Inference in C++](docs/mddocs/DockerGuides/docker_cpp_xpu_quickstart.md): running `llama.cpp`, `ollama`, etc., with `ipex-llm` on Intel GPU
- [GPU Inference in Python](docs/mddocs/DockerGuides/docker_pytorch_inference_gpu.md) : running HuggingFace `transformers`, `LangChain`, `LlamaIndex`, `ModelScope`, etc. with `ipex-llm` on Intel GPU
- [vLLM on GPU](docs/mddocs/DockerGuides/vllm_docker_quickstart.md): running `vLLM` serving with `ipex-llm` on Intel GPU
- [vLLM on CPU](docs/mddocs/DockerGuides/vllm_cpu_docker_quickstart.md): running `vLLM` serving with `ipex-llm` on Intel CPU
- [FastChat on GPU](docs/mddocs/DockerGuides/fastchat_docker_quickstart.md): running `FastChat` serving with `ipex-llm` on Intel GPU
- [VSCode on GPU](docs/mddocs/DockerGuides/docker_run_pytorch_inference_in_vscode.md): running and developing `ipex-llm` applications in Python using VSCode on Intel GPU

### Use
- [llama.cpp](docs/mddocs/Quickstart/llama_cpp_quickstart.md): running **llama.cpp** (*using C++ interface of `ipex-llm` as an accelerated backend for `llama.cpp`*) on Intel GPU
- [Ollama](docs/mddocs/Quickstart/ollama_quickstart.md): running **ollama** (*using C++ interface of `ipex-llm` as an accelerated backend for `ollama`*) on Intel GPU
- [Llama 3 with `llama.cpp` and `ollama`](docs/mddocs/Quickstart/llama3_llamacpp_ollama_quickstart.md): running **Llama 3** on Intel GPU using `llama.cpp` and `ollama` with `ipex-llm`
- [llama.cpp](docs/mddocs/Quickstart/llama_cpp_quickstart.md): running **llama.cpp** (*using C++ interface of `ipex-llm`*) on Intel GPU
- [Ollama](docs/mddocs/Quickstart/ollama_quickstart.md): running **ollama** (*using C++ interface of `ipex-llm`*) on Intel GPU
- [PyTorch/HuggingFace](docs/mddocs/Quickstart/install_windows_gpu.md): running **PyTorch**, **HuggingFace**, **LangChain**, **LlamaIndex**, etc. (*using Python interface of `ipex-llm`*) on Intel GPU for [Windows](docs/mddocs/Quickstart/install_windows_gpu.md) and [Linux](docs/mddocs/Quickstart/install_linux_gpu.md)
- [vLLM](docs/mddocs/Quickstart/vLLM_quickstart.md): running `ipex-llm` in **vLLM** on both Intel [GPU](docs/mddocs/DockerGuides/vllm_docker_quickstart.md) and [CPU](docs/mddocs/DockerGuides/vllm_cpu_docker_quickstart.md)
- [FastChat](docs/mddocs/Quickstart/fastchat_quickstart.md): running `ipex-llm` in **FastChat** serving on on both Intel GPU and CPU
- [Serving on multiple Intel GPUs](docs/mddocs/Quickstart/deepspeed_autotp_fastapi_quickstart.md): running `ipex-llm` **serving on multiple Intel GPUs** by leveraging DeepSpeed AutoTP and FastAPI
Expand Down Expand Up @@ -258,6 +258,8 @@ Over 50 models have been optimized/verified on `ipex-llm`, including *LLaMA/LLaM
| LLaMA 2 | [link1](python/llm/example/CPU/Native-Models), [link2](python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama2) | [link](python/llm/example/GPU/HuggingFace/LLM/llama2) |
| LLaMA 3 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama3) | [link](python/llm/example/GPU/HuggingFace/LLM/llama3) |
| LLaMA 3.1 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama3.1) | [link](python/llm/example/GPU/HuggingFace/LLM/llama3.1) |
| LLaMA 3.2 | | [link](python/llm/example/GPU/HuggingFace/LLM/llama3.2) |
| LLaMA 3.2-Vision | | [link](python/llm/example/GPU/PyTorch-Models/Model/llama3.2-vision/) |
| ChatGLM | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm) | |
| ChatGLM2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm2) | [link](python/llm/example/GPU/HuggingFace/LLM/chatglm2) |
| ChatGLM3 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/chatglm3) | [link](python/llm/example/GPU/HuggingFace/LLM/chatglm3) |
Expand All @@ -282,6 +284,7 @@ Over 50 models have been optimized/verified on `ipex-llm`, including *LLaMA/LLaM
| Qwen2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen2) | [link](python/llm/example/GPU/HuggingFace/LLM/qwen2) |
| Qwen2.5 | | [link](python/llm/example/GPU/HuggingFace/LLM/qwen2.5) |
| Qwen-VL | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen-vl) | [link](python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl) |
| Qwen2-VL || [link](python/llm/example/GPU/PyTorch-Models/Model/qwen2-vl) |
| Qwen2-Audio | | [link](python/llm/example/GPU/HuggingFace/Multimodal/qwen2-audio) |
| Aquila | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/aquila) | [link](python/llm/example/GPU/HuggingFace/LLM/aquila) |
| Aquila2 | [link](python/llm/example/CPU/HF-Transformers-AutoModels/Model/aquila2) | [link](python/llm/example/GPU/HuggingFace/LLM/aquila2) |
Expand Down
Loading

0 comments on commit 1b71f5f

Please sign in to comment.