Skip to content

Commit

Permalink
LLM: update example layout (#9046)
Browse files Browse the repository at this point in the history
  • Loading branch information
plusbang authored Oct 9, 2023
1 parent 44db766 commit 02bf757
Show file tree
Hide file tree
Showing 118 changed files with 204 additions and 185 deletions.
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@
> *It is built on top of the excellent work of [llama.cpp](https://github.com/ggerganov/llama.cpp), [ggml](https://github.com/ggerganov/ggml), [gptq](https://github.com/IST-DASLab/gptq), [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), [qlora](https://github.com/artidoro/qlora), [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), [gptq_for_llama](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [chatglm.cpp](https://github.com/li-plus/chatglm.cpp), [redpajama.cpp](https://github.com/togethercomputer/redpajama.cpp), [gptneox.cpp](https://github.com/byroneverson/gptneox.cpp), [bloomz.cpp](https://github.com/NouamaneTazi/bloomz.cpp/), etc.*
### Latest update
- **[New]** `bigdl-llm` now supports QLoRA fintuning on Intel GPU; see the the example [here](python/llm/example/gpu/qlora_finetuning).
- `bigdl-llm` now supports Intel GPU (including Arc, Flex and MAX); see the the latest GPU examples [here](python/llm/example/gpu).
- **[New]** `bigdl-llm` now supports QLoRA fintuning on Intel GPU; see the the example [here](python/llm/example/GPU/QLoRA-FineTuning).
- `bigdl-llm` now supports Intel GPU (including Arc, Flex and MAX); see the the latest GPU examples [here](python/llm/example/GPU).
- `bigdl-llm` tutorial is released [here](https://github.com/intel-analytics/bigdl-llm-tutorial).
- Over 20 models have been optimized/verified on `bigdl-llm`, including *LLaMA/LLaMA2, ChatGLM/ChatGLM2, MPT, Falcon, Dolly, StarCoder, Whisper, InternLM, QWen, Baichuan, Aquila, MOSS,* and more; see the complete list [here](python/llm/README.md#verified-models).

Expand Down Expand Up @@ -76,7 +76,7 @@ input_ids = tokenizer.encode(input_str, ...)
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids)
```
*See the complete examples [here](python/llm/example/transformers/transformers_int4/).*
*See the complete examples [here](python/llm/example/CPU/HF-Transformers-AutoModels/Model).*

#### GPU INT4
##### Install
Expand Down Expand Up @@ -105,7 +105,7 @@ input_ids = tokenizer.encode(input_str, ...).to('xpu')
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids.cpu())
```
*See the complete examples [here](python/llm/example/gpu/).*
*See the complete examples [here](python/llm/example/GPU).*

#### More Low-Bit Support
##### Save and load
Expand All @@ -115,15 +115,15 @@ After the model is optimized using `bigdl-llm`, you may save and load the model
model.save_low_bit(model_path)
new_model = AutoModelForCausalLM.load_low_bit(model_path)
```
*See the complete example [here](python/llm/example/transformers/transformers_low_bit/).*
*See the complete example [here](python/llm/example/CPU/HF-Transformers-AutoModels/Save-Load).*

##### Additonal data types

In addition to INT4, You may apply other low bit optimizations (such as *INT8*, *INT5*, *NF4*, etc.) as follows:
```python
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int8")
```
*See the complete example [here](python/llm/example/transformers/transformers_low_bit/).*
*See the complete example [here](python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types).*


***For more details, please refer to the `bigdl-llm` [Document](https://test-bigdl-llm.readthedocs.io/en/main/doc/LLM/index.html), [Readme](python/llm), [Tutorial](https://github.com/intel-analytics/bigdl-llm-tutorial) and [API Doc](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/index.html).***
Expand Down
49 changes: 25 additions & 24 deletions python/llm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,23 +40,24 @@ Over 20 models have been optimized/verified on `bigdl-llm`, including *LLaMA/LLa

| Model | Example |
|-----------|----------------------------------------------------------|
| LLaMA *(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)* | [link1](example/transformers/native_int4), [link2](example/transformers/transformers_int4/vicuna) |
| LLaMA 2 | [link](example/transformers/transformers_int4/llama2) |
| MPT | [link](example/transformers/transformers_int4/mpt) |
| Falcon | [link](example/transformers/transformers_int4/falcon) |
| ChatGLM | [link](example/transformers/transformers_int4/chatglm) |
| ChatGLM2 | [link](example/transformers/transformers_int4/chatglm2) |
| Qwen | [link](example/transformers/transformers_int4/qwen) |
| MOSS | [link](example/transformers/transformers_int4/moss) |
| Baichuan | [link](example/transformers/transformers_int4/baichuan) |
| Baichuan2 | [link](example/transformers/transformers_int4/baichuan2) |
| Dolly-v1 | [link](example/transformers/transformers_int4/dolly_v1) |
| Dolly-v2 | [link](example/transformers/transformers_int4/dolly_v2) |
| RedPajama | [link1](example/transformers/native_int4), [link2](example/transformers/transformers_int4/redpajama) |
| Phoenix | [link1](example/transformers/native_int4), [link2](example/transformers/transformers_int4/phoenix) |
| StarCoder | [link1](example/transformers/native_int4), [link2](example/transformers/transformers_int4/starcoder) |
| InternLM | [link](example/transformers/transformers_int4/internlm) |
| Whisper | [link](example/transformers/transformers_int4/whisper) |
| LLaMA *(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)* | [link1](example/CPU/Native-Models), [link2](example/CPU/HF-Transformers-AutoModels/Model/vicuna) |
| LLaMA 2 | [link](example/CPU/HF-Transformers-AutoModels/Model/llama2) |
| MPT | [link](example/CPU/HF-Transformers-AutoModels/Model/mpt) |
| Falcon | [link](example/CPU/HF-Transformers-AutoModels/Model/falcon) |
| ChatGLM | [link](example/CPU/HF-Transformers-AutoModels/Model/chatglm) |
| ChatGLM2 | [link](example/CPU/HF-Transformers-AutoModels/Model/chatglm2) |
| Qwen | [link](example/CPU/HF-Transformers-AutoModels/Model/qwen) |
| MOSS | [link](example/CPU/HF-Transformers-AutoModels/Model/moss) |
| Baichuan | [link](example/CPU/HF-Transformers-AutoModels/Model/baichuan) |
| Baichuan2 | [link](example/CPU/HF-Transformers-AutoModels/Model/baichuan2) |
| Dolly-v1 | [link](example/CPU/HF-Transformers-AutoModels/Model/dolly_v1) |
| Dolly-v2 | [link](example/CPU/HF-Transformers-AutoModels/Model/dolly_v2) |
| RedPajama | [link1](example/CPU/Native-Models), [link2](example/CPU/HF-Transformers-AutoModels/Model/redpajama) |
| Phoenix | [link1](example/CPU/Native-Models), [link2](example/CPU/HF-Transformers-AutoModels/Model/phoenix) |
| StarCoder | [link1](example/CPU/Native-Models), [link2](example/CPU/HF-Transformers-AutoModels/Model/starcoder) |
| InternLM | [link](example/CPU/HF-Transformers-AutoModels/Model/internlm) |
| Whisper | [link](example/CPU/HF-Transformers-AutoModels/Model/whisper) |
| Aquila | [link](example/CPU/HF-Transformers-AutoModels/Model/aquila) |

</details>

Expand Down Expand Up @@ -119,7 +120,7 @@ output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids)
```

See the complete examples [here](example/transformers/transformers_int4/).
See the complete examples [here](example/CPU/HF-Transformers-AutoModels/Model/).

###### GPU INT4
You may apply INT4 optimizations to any Hugging Face *Transformers* model on Intel GPU as follows.
Expand All @@ -138,7 +139,7 @@ input_ids = tokenizer.encode(input_str, ...).to('xpu')
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids.cpu())
```
See the complete examples [here](example/gpu/).
See the complete examples [here](example/GPU).

###### More Low-Bit Support
- Save and load
Expand All @@ -148,7 +149,7 @@ See the complete examples [here](example/gpu/).
model.save_low_bit(model_path)
new_model = AutoModelForCausalLM.load_low_bit(model_path)
```
*See the complete example [here](example/transformers/transformers_low_bit/).*
*See the complete example [here](example/CPU/HF-Transformers-AutoModels/Save-Load).*

- Additonal data types

Expand All @@ -157,7 +158,7 @@ See the complete examples [here](example/gpu/).
```python
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int8")
```
*See the complete example [here](example/transformers/transformers_low_bit/).*
*See the complete example [here](example/CPU/HF-Transformers-AutoModels/More-Data-Types).*

##### 2. Native INT4 model

Expand All @@ -182,7 +183,7 @@ output_ids = llm.generate(input_ids, ...)
output = llm.batch_decode(output_ids)
```

See the complete example [here](example/transformers/native_int4/native_int4_pipeline.py).
See the complete example [here](example/CPU/Native-Models/native_int4_pipeline.py).

##### 3. LangChain API
You may run the models using the LangChain API in `bigdl-llm`.
Expand All @@ -202,7 +203,7 @@ You may run the models using the LangChain API in `bigdl-llm`.
doc_chain = load_qa_chain(bigdl_llm, ...)
output = doc_chain.run(...)
```
See the examples [here](example/langchain/transformers_int4).
See the examples [here](example/CPU/LangChain/transformers_int4).

- **Using native INT4 model**

Expand All @@ -224,7 +225,7 @@ You may run the models using the LangChain API in `bigdl-llm`.
doc_chain.run(...)
```

See the examples [here](example/langchain/native_int4).
See the examples [here](example/CPU/LangChain/native_int4).

##### 4. CLI Tool
>**Note**: Currently `bigdl-llm` CLI supports *LLaMA* (e.g., *vicuna*), *GPT-NeoX* (e.g., *redpajama*), *BLOOM* (e.g., *pheonix*) and *GPT2* (e.g., *starcoder*) model architecture; for other models, you may use the Hugging Face `transformers` or LangChain APIs.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ You can use BigDL-LLM to run any Huggingface Transformer models with INT4 optimi
| InternLM | [link](internlm) |
| Whisper | [link](whisper) |
| Qwen | [link](qwen) |
| Aquila | [link](aquila) |

## Recommended Requirements
To run the examples, we recommend using Intel® Xeon® processors (server), or >= 12th Gen Intel® Core™ processor (client).
Expand Down
7 changes: 7 additions & 0 deletions python/llm/example/CPU/HF-Transformers-AutoModels/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Running Hugging Face Transformers model using BigDL-LLM on Intel CPU

This folder contains examples of running any Hugging Face Transformers model on BigDL-LLM (using the standard AutoModel APIs):

- [Model](Model): examples of running Hugging Face Transformers models (e.g., LLaMA2, ChatGLM2, Falcon, MPT, Baichuan2, etc.) using INT4 optimizations
- [More-Data-Types](More-Data-Types): examples of applying other low bit optimizations (NF4/INT5/INT8, etc.)
- [Save-Load](Save-Load): examples of saving and loading low-bit models
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# BigDL-LLM Transformers Low-Bit Inference Pipeline for Large Language Model

In this example, we show a pipeline to apply BigDL-LLM low-bit optimizations (including INT8/INT5/INT4) to any Hugging Face Transformers model, and then run inference on the optimized low-bit model.

## Prepare Environment
We suggest using conda to manage environment:
```bash
conda create -n llm python=3.9
conda activate llm

pip install --pre --upgrade bigdl-llm[all]
```

## Run Example
```bash
python ./transformers_low_bit_pipeline.py --repo-id-or-model-path decapoda-research/llama-7b-hf --low-bit sym_int5 --save-path ./llama-7b-sym_int5
```
arguments info:
- `--repo-id-or-model-path`: str value, argument defining the huggingface repo id for the large language model to be downloaded, or the path to the huggingface checkpoint folder, the value is 'decapoda-research/llama-7b-hf' by default.
- `--low-bit`: str value, options are sym_int4, asym_int4, sym_int5, asym_int5 or sym_int8. (sym_int4 means symmetric int 4, asym_int4 means asymmetric int 4, etc.). Relevant low bit optimizations will be applied to the model.
- `--save-path`: str value, the path to save the low-bit model. Then you can load the low-bit directly.
- `--load-path`: optional str value. The path to load low-bit model.


## Sample Output for Inference
### 'decapoda-research/llama-7b-hf' Model
```log
Prompt: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun
Output: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun. She wanted to be a princess, and she wanted to be a pirate. She wanted to be a superhero, and she wanted to be
Model and tokenizer are saved to ./llama-7b-sym_int5
```

### Load low-bit model
Command to run:
```bash
python ./transformers_low_bit_pipeline.py --load-path ./llama-7b-sym_int5
```
Output log:
```log
Prompt: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun
Output: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun. She wanted to be a princess, and she wanted to be a pirate. She wanted to be a superhero, and she wanted to be
```

Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
#
# Copyright 2016 The BigDL Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import argparse
from bigdl.llm.transformers import AutoModelForCausalLM
from transformers import LlamaTokenizer, TextGenerationPipeline

if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Transformer save_load example')
parser.add_argument('--repo-id-or-model-path', type=str, default="decapoda-research/llama-7b-hf",
help='The huggingface repo id for the large language model to be downloaded'
', or the path to the huggingface checkpoint folder')
parser.add_argument('--low-bit', type=str, default="sym_int4",
choices=['sym_int4', 'asym_int4', 'sym_int5', 'asym_int5', 'sym_int8'],
help='The quantization type the model will convert to.')
parser.add_argument('--save-path', type=str, default=None,
help='The path to save the low-bit model.')
parser.add_argument('--load-path', type=str, default=None,
help='The path to load the low-bit model.')
args = parser.parse_args()
model_path = args.repo_id_or_model_path
low_bit = args.low_bit
load_path = args.load_path
if load_path:
model = AutoModelForCausalLM.load_low_bit(load_path)
tokenizer = LlamaTokenizer.from_pretrained(load_path)
else:
# load_in_low_bit in bigdl.llm.transformers will convert
# the relevant layers in the model into corresponding int X format
model = AutoModelForCausalLM.from_pretrained(model_path, load_in_low_bit=low_bit, trust_remote_code=True)
tokenizer = LlamaTokenizer.from_pretrained(model_path, trust_remote_code=True)

pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, max_new_tokens=32)
input_str = "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun"
output = pipeline(input_str)[0]["generated_text"]
print(f"Prompt: {input_str}")
print(f"Output: {output}")

save_path = args.save_path
if save_path:
model.save_low_bit(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer are saved to {save_path}")
File renamed without changes.
File renamed without changes.
Empty file.
7 changes: 7 additions & 0 deletions python/llm/example/CPU/PyTorch-Models/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Running PyTorch model using BigDL-LLM on Intel CPU

This folder contains examples of running any PyTorch model on BigDL-LLM (with "one-line code change"):

- [Model](Model): examples of running PyTorch models (e.g., Openai Whisper, LLaMA2, ChatGLM2, Falcon, MPT, Baichuan2, etc.) using INT4 optimizations
- [More-Data-Types](More-Data-Types): examples of applying other low bit optimizations (NF4/INT5/INT8, etc.)
- [Save-Load](Save-Load): examples of saving and loading low-bit models
Empty file.
18 changes: 18 additions & 0 deletions python/llm/example/CPU/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# BigDL-LLM Examples on Intel CPU

This folder contains examples of running BigDL-LLM on Intel CPU:

- [HF-Transformers-AutoModels](HF-Transformers-AutoModels): running any Hugging Face Transformers model on BigDL-LLM (using the standard AutoModel APIs)
- [PyTorch-Models](PyTorch-Models): running any PyTorch model on BigDL-LLM (with "one-line code change")
- [Native-Models](Native-Models): converting & running LLM in `llama`/`chatglm`/`bloom`/`gptneox`/`starcoder` model family using native (cpp) implementation
- [LangChain](LangChain): running LangChain applications on BigDL-LLM

## System Support
**Hardware**:
- Intel® Core™ processors
- Intel® Xeon® processors

**Operating System**:
- Ubuntu 20.04 or later
- CentOS 7 or later
- Windows 10/11, with or without WSL
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ You can use BigDL-LLM to run almost every Huggingface Transformer models with IN

- Intel Arc™ A-Series Graphics
- Intel Data Center GPU Flex Series
- Intel Data Center GPU Max Series

## Recommended Requirements
To apply Intel GPU acceleration, there’re several steps for tools installation and environment preparation.
Expand Down
Empty file.
7 changes: 7 additions & 0 deletions python/llm/example/GPU/HF-Transformers-AutoModels/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Running Hugging Face Transformers model using BigDL-LLM on Intel GPU

This folder contains examples of running any Hugging Face Transformers model on BigDL-LLM (using the standard AutoModel APIs):

- [Model](Model): examples of running Hugging Face Transformers models (e.g., LLaMA2, ChatGLM2, Falcon, MPT, Baichuan2, etc.) using INT4 optimizations
- [More-Data-Types](More-Data-Types): examples of applying other low bit optimizations (NF4/INT5/INT8, etc.)
- [Save-Load](Save-Load): examples of saving and loading low-bit models
Empty file.
Empty file.
Empty file.
7 changes: 7 additions & 0 deletions python/llm/example/GPU/PyTorch-Models/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Running PyTorch model using BigDL-LLM on Intel GPU

This folder contains examples of running any PyTorch model on BigDL-LLM (with "one-line code change"):

- [Model](Model): examples of running PyTorch models (e.g., Openai Whisper, LLaMA2, ChatGLM2, Falcon, MPT, Baichuan2, etc.) using INT4 optimizations
- [More-Data-Types](More-Data-Types): examples of applying other low bit optimizations (NF4/INT5/INT8, etc.)
- [Save-Load](Save-Load): examples of saving and loading low-bit models
Empty file.
26 changes: 26 additions & 0 deletions python/llm/example/GPU/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# BigDL-LLM Examples on Intel GPU

This folder contains examples of running BigDL-LLM on Intel GPU:

- [HF-Transformers-AutoModels](HF-Transformers-AutoModels): running any Hugging Face Transformers model on BigDL-LLM (using the standard AutoModel APIs)
- [PyTorch-Models](PyTorch-Models): running any PyTorch model on BigDL-LLM (with "one-line code change")
- [QLoRA-FineTuning](QLoRA-FineTuning): running QLoRA finetuning on BigDL-LLM


## System Support
**Hardware**:
- Intel Arc™ A-Series Graphics
- Intel Data Center GPU Flex Series
- Intel Data Center GPU Max Series

**Operating System**:
- Ubuntu 20.04 or later (Ubuntu 22.04 is preferred)

## Requirements
To apply Intel GPU acceleration, there’re several steps for tools installation and environment preparation.

Step 1, please refer to our [driver installation](https://dgpu-docs.intel.com/driver/installation.html) for general purpose GPU capabilities.
> **Note**: IPEX 2.0.110+xpu requires Intel GPU Driver version is [Stable 647.21](https://dgpu-docs.intel.com/releases/stable_647_21_20230714.html).
Step 2, you also need to download and install [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html). OneMKL and DPC++ compiler are needed, others are optional.
> **Note**: IPEX 2.0.110+xpu requires Intel® oneAPI Base Toolkit's version >= 2023.2.0.
Loading

0 comments on commit 02bf757

Please sign in to comment.