LLM: update example layout (#9046)

intel · Oct 9, 2023 · 02bf757 · 02bf757
1 parent 44db766
commit 02bf757
Show file tree

Hide file tree

Showing 118 changed files with 204 additions and 185 deletions.
diff --git a/README.md b/README.md
@@ -12,8 +12,8 @@
 > *It is built on top of the excellent work of [llama.cpp](https://github.com/ggerganov/llama.cpp), [ggml](https://github.com/ggerganov/ggml), [gptq](https://github.com/IST-DASLab/gptq), [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), [qlora](https://github.com/artidoro/qlora), [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), [gptq_for_llama](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [chatglm.cpp](https://github.com/li-plus/chatglm.cpp), [redpajama.cpp](https://github.com/togethercomputer/redpajama.cpp), [gptneox.cpp](https://github.com/byroneverson/gptneox.cpp), [bloomz.cpp](https://github.com/NouamaneTazi/bloomz.cpp/), etc.*
 
 ### Latest update
-- **[New]** `bigdl-llm` now supports QLoRA fintuning on Intel GPU; see the the example [here](python/llm/example/gpu/qlora_finetuning).
-- `bigdl-llm` now supports Intel GPU (including Arc, Flex and MAX); see the the latest GPU examples [here](python/llm/example/gpu).
+- **[New]** `bigdl-llm` now supports QLoRA fintuning on Intel GPU; see the the example [here](python/llm/example/GPU/QLoRA-FineTuning).
+- `bigdl-llm` now supports Intel GPU (including Arc, Flex and MAX); see the the latest GPU examples [here](python/llm/example/GPU).
 - `bigdl-llm` tutorial is released [here](https://github.com/intel-analytics/bigdl-llm-tutorial).
 - Over 20 models have been optimized/verified on `bigdl-llm`, including *LLaMA/LLaMA2, ChatGLM/ChatGLM2, MPT, Falcon, Dolly, StarCoder, Whisper, InternLM, QWen, Baichuan, Aquila, MOSS,* and more; see the complete list [here](python/llm/README.md#verified-models).
 
@@ -76,7 +76,7 @@ input_ids = tokenizer.encode(input_str, ...)
 output_ids = model.generate(input_ids, ...)
 output = tokenizer.batch_decode(output_ids)
 ```
-*See the complete examples [here](python/llm/example/transformers/transformers_int4/).*
+*See the complete examples [here](python/llm/example/CPU/HF-Transformers-AutoModels/Model).*
 
 #### GPU INT4
 ##### Install
@@ -105,7 +105,7 @@ input_ids = tokenizer.encode(input_str, ...).to('xpu')
 output_ids = model.generate(input_ids, ...)
 output = tokenizer.batch_decode(output_ids.cpu())
 ```
-*See the complete examples [here](python/llm/example/gpu/).*
+*See the complete examples [here](python/llm/example/GPU).*
 
 #### More Low-Bit Support
 ##### Save and load
@@ -115,15 +115,15 @@ After the model is optimized using `bigdl-llm`, you may save and load the model
 model.save_low_bit(model_path)
 new_model = AutoModelForCausalLM.load_low_bit(model_path)
 ```
-*See the complete example [here](python/llm/example/transformers/transformers_low_bit/).*
+*See the complete example [here](python/llm/example/CPU/HF-Transformers-AutoModels/Save-Load).*
 
 ##### Additonal data types
 
 In addition to INT4, You may apply other low bit optimizations (such as *INT8*, *INT5*, *NF4*, etc.) as follows: 
 ```python
 model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int8")
 ```
-*See the complete example [here](python/llm/example/transformers/transformers_low_bit/).*
+*See the complete example [here](python/llm/example/CPU/HF-Transformers-AutoModels/More-Data-Types).*
 
 
 ***For more details, please refer to the `bigdl-llm` [Document](https://test-bigdl-llm.readthedocs.io/en/main/doc/LLM/index.html), [Readme](python/llm), [Tutorial](https://github.com/intel-analytics/bigdl-llm-tutorial) and [API Doc](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/index.html).***

diff --git a/python/llm/README.md b/python/llm/README.md
@@ -40,23 +40,24 @@ Over 20 models have been optimized/verified on `bigdl-llm`, including *LLaMA/LLa
 
 | Model     | Example                                                  |
 |-----------|----------------------------------------------------------|
-| LLaMA *(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)* | [link1](example/transformers/native_int4), [link2](example/transformers/transformers_int4/vicuna)    |
-| LLaMA 2   | [link](example/transformers/transformers_int4/llama2)    |
-| MPT       | [link](example/transformers/transformers_int4/mpt)       |
-| Falcon    | [link](example/transformers/transformers_int4/falcon)    |
-| ChatGLM   | [link](example/transformers/transformers_int4/chatglm)   | 
-| ChatGLM2  | [link](example/transformers/transformers_int4/chatglm2)  | 
-| Qwen      | [link](example/transformers/transformers_int4/qwen)      |
-| MOSS      | [link](example/transformers/transformers_int4/moss)      | 
-| Baichuan  | [link](example/transformers/transformers_int4/baichuan)  | 
-| Baichuan2 | [link](example/transformers/transformers_int4/baichuan2) |
-| Dolly-v1  | [link](example/transformers/transformers_int4/dolly_v1)  | 
-| Dolly-v2  | [link](example/transformers/transformers_int4/dolly_v2)  | 
-| RedPajama | [link1](example/transformers/native_int4), [link2](example/transformers/transformers_int4/redpajama) | 
-| Phoenix   | [link1](example/transformers/native_int4), [link2](example/transformers/transformers_int4/phoenix)   | 
-| StarCoder | [link1](example/transformers/native_int4), [link2](example/transformers/transformers_int4/starcoder) | 
-| InternLM  | [link](example/transformers/transformers_int4/internlm)  |
-| Whisper   | [link](example/transformers/transformers_int4/whisper)   |
+| LLaMA *(such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)* | [link1](example/CPU/Native-Models), [link2](example/CPU/HF-Transformers-AutoModels/Model/vicuna)    |
+| LLaMA 2   | [link](example/CPU/HF-Transformers-AutoModels/Model/llama2)    |
+| MPT       | [link](example/CPU/HF-Transformers-AutoModels/Model/mpt)       |
+| Falcon    | [link](example/CPU/HF-Transformers-AutoModels/Model/falcon)    |
+| ChatGLM   | [link](example/CPU/HF-Transformers-AutoModels/Model/chatglm)   | 
+| ChatGLM2  | [link](example/CPU/HF-Transformers-AutoModels/Model/chatglm2)  | 
+| Qwen      | [link](example/CPU/HF-Transformers-AutoModels/Model/qwen)      |
+| MOSS      | [link](example/CPU/HF-Transformers-AutoModels/Model/moss)      | 
+| Baichuan  | [link](example/CPU/HF-Transformers-AutoModels/Model/baichuan)  | 
+| Baichuan2 | [link](example/CPU/HF-Transformers-AutoModels/Model/baichuan2) |
+| Dolly-v1  | [link](example/CPU/HF-Transformers-AutoModels/Model/dolly_v1)  | 
+| Dolly-v2  | [link](example/CPU/HF-Transformers-AutoModels/Model/dolly_v2)  | 
+| RedPajama | [link1](example/CPU/Native-Models), [link2](example/CPU/HF-Transformers-AutoModels/Model/redpajama) | 
+| Phoenix   | [link1](example/CPU/Native-Models), [link2](example/CPU/HF-Transformers-AutoModels/Model/phoenix)   | 
+| StarCoder | [link1](example/CPU/Native-Models), [link2](example/CPU/HF-Transformers-AutoModels/Model/starcoder) | 
+| InternLM  | [link](example/CPU/HF-Transformers-AutoModels/Model/internlm)  |
+| Whisper   | [link](example/CPU/HF-Transformers-AutoModels/Model/whisper)   |
+| Aquila    | [link](example/CPU/HF-Transformers-AutoModels/Model/aquila)    |
 
 </details>
 
@@ -119,7 +120,7 @@ output_ids = model.generate(input_ids, ...)
 output = tokenizer.batch_decode(output_ids)
 ```
 
-See the complete examples [here](example/transformers/transformers_int4/).  
+See the complete examples [here](example/CPU/HF-Transformers-AutoModels/Model/).  
 
 ###### GPU INT4
 You may apply INT4 optimizations to any Hugging Face *Transformers* model on Intel GPU as follows.
@@ -138,7 +139,7 @@ input_ids = tokenizer.encode(input_str, ...).to('xpu')
 output_ids = model.generate(input_ids, ...)
 output = tokenizer.batch_decode(output_ids.cpu())
 ```
-See the complete examples [here](example/gpu/).
+See the complete examples [here](example/GPU).
 
 ###### More Low-Bit Support
 - Save and load
@@ -148,7 +149,7 @@ See the complete examples [here](example/gpu/).
   model.save_low_bit(model_path)
   new_model = AutoModelForCausalLM.load_low_bit(model_path)
   ```
-  *See the complete example [here](example/transformers/transformers_low_bit/).*
+  *See the complete example [here](example/CPU/HF-Transformers-AutoModels/Save-Load).*
 
 - Additonal data types
 
@@ -157,7 +158,7 @@ See the complete examples [here](example/gpu/).
   ```python
   model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_low_bit="sym_int8")
   ```
-  *See the complete example [here](example/transformers/transformers_low_bit/).*
+  *See the complete example [here](example/CPU/HF-Transformers-AutoModels/More-Data-Types).*
 
 ##### 2. Native INT4 model
 
@@ -182,7 +183,7 @@ output_ids = llm.generate(input_ids, ...)
 output = llm.batch_decode(output_ids)
 ``` 
 
-See the complete example [here](example/transformers/native_int4/native_int4_pipeline.py). 
+See the complete example [here](example/CPU/Native-Models/native_int4_pipeline.py). 
 
 ##### 3. LangChain API
 You may run the models using the LangChain API in `bigdl-llm`.
@@ -202,7 +203,7 @@ You may run the models using the LangChain API in `bigdl-llm`.
   doc_chain = load_qa_chain(bigdl_llm, ...)
   output = doc_chain.run(...)
   ```
-  See the examples [here](example/langchain/transformers_int4).
+  See the examples [here](example/CPU/LangChain/transformers_int4).
 
 - **Using native INT4 model**
 
@@ -224,7 +225,7 @@ You may run the models using the LangChain API in `bigdl-llm`.
   doc_chain.run(...)
   ```
 
-  See the examples [here](example/langchain/native_int4).
+  See the examples [here](example/CPU/LangChain/native_int4).
 
 ##### 4. CLI Tool
 >**Note**: Currently `bigdl-llm` CLI supports *LLaMA* (e.g., *vicuna*), *GPT-NeoX* (e.g., *redpajama*), *BLOOM* (e.g., *pheonix*) and *GPT2* (e.g., *starcoder*) model architecture; for other models, you may use the Hugging Face `transformers` or LangChain APIs.

diff --git a/.../transformers/transformers_int4/README.md → ...F-Transformers-AutoModels/Model/README.md b/.../transformers/transformers_int4/README.md → ...F-Transformers-AutoModels/Model/README.md
@@ -21,6 +21,7 @@ You can use BigDL-LLM to run any Huggingface Transformer models with INT4 optimi
 | InternLM  | [link](internlm)  |
 | Whisper   | [link](whisper)   |
 | Qwen      | [link](qwen)      |
+| Aquila    | [link](aquila)    |
 
 ## Recommended Requirements
 To run the examples, we recommend using Intel® Xeon® processors (server), or >= 12th Gen Intel® Core™ processor (client).

diff --git a/...ormers/transformers_int4/aquila/README.md → ...formers-AutoModels/Model/aquila/README.md b/...ormers/transformers_int4/aquila/README.md → ...formers-AutoModels/Model/aquila/README.md
diff --git a/...mers/transformers_int4/aquila/generate.py → ...rmers-AutoModels/Model/aquila/generate.py b/...mers/transformers_int4/aquila/generate.py → ...rmers-AutoModels/Model/aquila/generate.py
diff --git a/...mers/transformers_int4/baichuan/README.md → ...rmers-AutoModels/Model/baichuan/README.md b/...mers/transformers_int4/baichuan/README.md → ...rmers-AutoModels/Model/baichuan/README.md
diff --git a/...rs/transformers_int4/baichuan/generate.py → ...ers-AutoModels/Model/baichuan/generate.py b/...rs/transformers_int4/baichuan/generate.py → ...ers-AutoModels/Model/baichuan/generate.py
diff --git a/...ers/transformers_int4/baichuan2/README.md → ...mers-AutoModels/Model/baichuan2/README.md b/...ers/transformers_int4/baichuan2/README.md → ...mers-AutoModels/Model/baichuan2/README.md
diff --git a/...s/transformers_int4/baichuan2/generate.py → ...rs-AutoModels/Model/baichuan2/generate.py b/...s/transformers_int4/baichuan2/generate.py → ...rs-AutoModels/Model/baichuan2/generate.py
diff --git a/...rmers/transformers_int4/chatglm/README.md → ...ormers-AutoModels/Model/chatglm/README.md b/...rmers/transformers_int4/chatglm/README.md → ...ormers-AutoModels/Model/chatglm/README.md
diff --git a/...ers/transformers_int4/chatglm/generate.py → ...mers-AutoModels/Model/chatglm/generate.py b/...ers/transformers_int4/chatglm/generate.py → ...mers-AutoModels/Model/chatglm/generate.py
diff --git a/...mers/transformers_int4/chatglm2/README.md → ...rmers-AutoModels/Model/chatglm2/README.md b/...mers/transformers_int4/chatglm2/README.md → ...rmers-AutoModels/Model/chatglm2/README.md
diff --git a/...rs/transformers_int4/chatglm2/generate.py → ...ers-AutoModels/Model/chatglm2/generate.py b/...rs/transformers_int4/chatglm2/generate.py → ...ers-AutoModels/Model/chatglm2/generate.py
diff --git a/.../transformers_int4/chatglm2/streamchat.py → ...s-AutoModels/Model/chatglm2/streamchat.py b/.../transformers_int4/chatglm2/streamchat.py → ...s-AutoModels/Model/chatglm2/streamchat.py
diff --git a/...mers/transformers_int4/dolly_v1/README.md → ...rmers-AutoModels/Model/dolly_v1/README.md b/...mers/transformers_int4/dolly_v1/README.md → ...rmers-AutoModels/Model/dolly_v1/README.md
diff --git a/...rs/transformers_int4/dolly_v1/generate.py → ...ers-AutoModels/Model/dolly_v1/generate.py b/...rs/transformers_int4/dolly_v1/generate.py → ...ers-AutoModels/Model/dolly_v1/generate.py
diff --git a/...mers/transformers_int4/dolly_v2/README.md → ...rmers-AutoModels/Model/dolly_v2/README.md b/...mers/transformers_int4/dolly_v2/README.md → ...rmers-AutoModels/Model/dolly_v2/README.md
diff --git a/...rs/transformers_int4/dolly_v2/generate.py → ...ers-AutoModels/Model/dolly_v2/generate.py b/...rs/transformers_int4/dolly_v2/generate.py → ...ers-AutoModels/Model/dolly_v2/generate.py
diff --git a/...ormers/transformers_int4/falcon/README.md → ...formers-AutoModels/Model/falcon/README.md b/...ormers/transformers_int4/falcon/README.md → ...formers-AutoModels/Model/falcon/README.md
diff --git a/...alcon/falcon-40b-instruct/modelling_RW.py → ...alcon/falcon-40b-instruct/modelling_RW.py b/...alcon/falcon-40b-instruct/modelling_RW.py → ...alcon/falcon-40b-instruct/modelling_RW.py
diff --git a/...falcon/falcon-7b-instruct/modelling_RW.py → ...falcon/falcon-7b-instruct/modelling_RW.py b/...falcon/falcon-7b-instruct/modelling_RW.py → ...falcon/falcon-7b-instruct/modelling_RW.py
diff --git a/...mers/transformers_int4/falcon/generate.py → ...rmers-AutoModels/Model/falcon/generate.py b/...mers/transformers_int4/falcon/generate.py → ...rmers-AutoModels/Model/falcon/generate.py
diff --git a/...mers/transformers_int4/internlm/README.md → ...rmers-AutoModels/Model/internlm/README.md b/...mers/transformers_int4/internlm/README.md → ...rmers-AutoModels/Model/internlm/README.md
diff --git a/...rs/transformers_int4/internlm/generate.py → ...ers-AutoModels/Model/internlm/generate.py b/...rs/transformers_int4/internlm/generate.py → ...ers-AutoModels/Model/internlm/generate.py
diff --git a/...ormers/transformers_int4/llama2/README.md → ...formers-AutoModels/Model/llama2/README.md b/...ormers/transformers_int4/llama2/README.md → ...formers-AutoModels/Model/llama2/README.md
diff --git a/...mers/transformers_int4/llama2/generate.py → ...rmers-AutoModels/Model/llama2/generate.py b/...mers/transformers_int4/llama2/generate.py → ...rmers-AutoModels/Model/llama2/generate.py
diff --git a/...sformers/transformers_int4/moss/README.md → ...nsformers-AutoModels/Model/moss/README.md b/...sformers/transformers_int4/moss/README.md → ...nsformers-AutoModels/Model/moss/README.md
diff --git a/...ormers/transformers_int4/moss/generate.py → ...formers-AutoModels/Model/moss/generate.py b/...ormers/transformers_int4/moss/generate.py → ...formers-AutoModels/Model/moss/generate.py
diff --git a/...nsformers/transformers_int4/mpt/README.md → ...ansformers-AutoModels/Model/mpt/README.md b/...nsformers/transformers_int4/mpt/README.md → ...ansformers-AutoModels/Model/mpt/README.md
diff --git a/...formers/transformers_int4/mpt/generate.py → ...sformers-AutoModels/Model/mpt/generate.py b/...formers/transformers_int4/mpt/generate.py → ...sformers-AutoModels/Model/mpt/generate.py
diff --git a/...rmers/transformers_int4/phoenix/README.md → ...ormers-AutoModels/Model/phoenix/README.md b/...rmers/transformers_int4/phoenix/README.md → ...ormers-AutoModels/Model/phoenix/README.md
diff --git a/...ers/transformers_int4/phoenix/generate.py → ...mers-AutoModels/Model/phoenix/generate.py b/...ers/transformers_int4/phoenix/generate.py → ...mers-AutoModels/Model/phoenix/generate.py
diff --git a/...sformers/transformers_int4/qwen/README.md → ...nsformers-AutoModels/Model/qwen/README.md b/...sformers/transformers_int4/qwen/README.md → ...nsformers-AutoModels/Model/qwen/README.md
diff --git a/...ormers/transformers_int4/qwen/generate.py → ...formers-AutoModels/Model/qwen/generate.py b/...ormers/transformers_int4/qwen/generate.py → ...formers-AutoModels/Model/qwen/generate.py
diff --git a/...ers/transformers_int4/redpajama/README.md → ...mers-AutoModels/Model/redpajama/README.md b/...ers/transformers_int4/redpajama/README.md → ...mers-AutoModels/Model/redpajama/README.md
diff --git a/...s/transformers_int4/redpajama/generate.py → ...rs-AutoModels/Model/redpajama/generate.py b/...s/transformers_int4/redpajama/generate.py → ...rs-AutoModels/Model/redpajama/generate.py
diff --git a/...ers/transformers_int4/starcoder/README.md → ...mers-AutoModels/Model/starcoder/README.md b/...ers/transformers_int4/starcoder/README.md → ...mers-AutoModels/Model/starcoder/README.md
diff --git a/...s/transformers_int4/starcoder/generate.py → ...rs-AutoModels/Model/starcoder/generate.py b/...s/transformers_int4/starcoder/generate.py → ...rs-AutoModels/Model/starcoder/generate.py
diff --git a/...ormers/transformers_int4/vicuna/README.md → ...formers-AutoModels/Model/vicuna/README.md b/...ormers/transformers_int4/vicuna/README.md → ...formers-AutoModels/Model/vicuna/README.md
diff --git a/...mers/transformers_int4/vicuna/generate.py → ...rmers-AutoModels/Model/vicuna/generate.py b/...mers/transformers_int4/vicuna/generate.py → ...rmers-AutoModels/Model/vicuna/generate.py
diff --git a/...rs_int4/whisper/long-segment-recognize.py → ...s/Model/whisper/long-segment-recognize.py b/...rs_int4/whisper/long-segment-recognize.py → ...s/Model/whisper/long-segment-recognize.py
diff --git a/...rmers/transformers_int4/whisper/readme.md → ...ormers-AutoModels/Model/whisper/readme.md b/...rmers/transformers_int4/whisper/readme.md → ...ormers-AutoModels/Model/whisper/readme.md
diff --git a/...rs/transformers_int4/whisper/recognize.py → ...ers-AutoModels/Model/whisper/recognize.py b/...rs/transformers_int4/whisper/recognize.py → ...ers-AutoModels/Model/whisper/recognize.py
diff --git a/...ansformers/transformers_low_bit/README.md → ...mers-AutoModels/More-Data-Types/README.md b/...ansformers/transformers_low_bit/README.md → ...mers-AutoModels/More-Data-Types/README.md
diff --git a/..._low_bit/transformers_low_bit_pipeline.py → ...ta-Types/transformers_low_bit_pipeline.py b/..._low_bit/transformers_low_bit_pipeline.py → ...ta-Types/transformers_low_bit_pipeline.py
diff --git a/python/llm/example/CPU/HF-Transformers-AutoModels/README.md b/python/llm/example/CPU/HF-Transformers-AutoModels/README.md
@@ -0,0 +1,7 @@
+# Running Hugging Face Transformers model using BigDL-LLM on Intel CPU
+
+This folder contains examples of running any Hugging Face Transformers model on BigDL-LLM (using the standard AutoModel APIs):
+
+- [Model](Model): examples of running Hugging Face Transformers models (e.g., LLaMA2, ChatGLM2, Falcon, MPT, Baichuan2, etc.) using INT4 optimizations
+- [More-Data-Types](More-Data-Types): examples of applying other low bit optimizations (NF4/INT5/INT8, etc.)
+- [Save-Load](Save-Load): examples of saving and loading low-bit models
diff --git a/python/llm/example/CPU/HF-Transformers-AutoModels/Save-Load/README.md b/python/llm/example/CPU/HF-Transformers-AutoModels/Save-Load/README.md
@@ -0,0 +1,43 @@
+# BigDL-LLM Transformers Low-Bit Inference Pipeline for Large Language Model
+
+In this example, we show a pipeline to apply BigDL-LLM low-bit optimizations (including INT8/INT5/INT4) to any Hugging Face Transformers model, and then run inference on the optimized low-bit model.
+
+## Prepare Environment
+We suggest using conda to manage environment:
+```bash
+conda create -n llm python=3.9
+conda activate llm
+
+pip install --pre --upgrade bigdl-llm[all]
+```
+
+## Run Example
+```bash
+python ./transformers_low_bit_pipeline.py --repo-id-or-model-path decapoda-research/llama-7b-hf --low-bit sym_int5 --save-path ./llama-7b-sym_int5
+```
+arguments info:
+- `--repo-id-or-model-path`: str value, argument defining the huggingface repo id for the large language model to be downloaded, or the path to the huggingface checkpoint folder, the value is 'decapoda-research/llama-7b-hf' by default.
+- `--low-bit`: str value, options are sym_int4, asym_int4, sym_int5, asym_int5 or sym_int8. (sym_int4 means symmetric int 4, asym_int4 means asymmetric int 4, etc.). Relevant low bit optimizations will be applied to the model.
+- `--save-path`: str value, the path to save the low-bit model. Then you can load the low-bit directly.
+- `--load-path`: optional str value. The path to load low-bit model.
+
+
+## Sample Output for Inference
+### 'decapoda-research/llama-7b-hf' Model
+```log
+Prompt: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun
+Output: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun. She wanted to be a princess, and she wanted to be a pirate. She wanted to be a superhero, and she wanted to be
+Model and tokenizer are saved to ./llama-7b-sym_int5
+```
+
+### Load low-bit model
+Command to run:
+```bash
+python ./transformers_low_bit_pipeline.py --load-path ./llama-7b-sym_int5
+```
+Output log:
+```log
+Prompt: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun
+Output: Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun. She wanted to be a princess, and she wanted to be a pirate. She wanted to be a superhero, and she wanted to be
+```
+
diff --git a/python/llm/example/CPU/HF-Transformers-AutoModels/Save-Load/transformers_low_bit_pipeline.py b/python/llm/example/CPU/HF-Transformers-AutoModels/Save-Load/transformers_low_bit_pipeline.py
@@ -0,0 +1,56 @@
+#
+# Copyright 2016 The BigDL Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import argparse
+from bigdl.llm.transformers import AutoModelForCausalLM
+from transformers import LlamaTokenizer, TextGenerationPipeline
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Transformer save_load example')
+    parser.add_argument('--repo-id-or-model-path', type=str, default="decapoda-research/llama-7b-hf",
+                        help='The huggingface repo id for the large language model to be downloaded'
+                             ', or the path to the huggingface checkpoint folder')
+    parser.add_argument('--low-bit', type=str, default="sym_int4",
+                        choices=['sym_int4', 'asym_int4', 'sym_int5', 'asym_int5', 'sym_int8'],
+                        help='The quantization type the model will convert to.')
+    parser.add_argument('--save-path', type=str, default=None,
+                        help='The path to save the low-bit model.')
+    parser.add_argument('--load-path', type=str, default=None,
+                        help='The path to load the low-bit model.')
+    args = parser.parse_args()
+    model_path = args.repo_id_or_model_path
+    low_bit = args.low_bit
+    load_path = args.load_path
+    if load_path:
+        model = AutoModelForCausalLM.load_low_bit(load_path)
+        tokenizer = LlamaTokenizer.from_pretrained(load_path)
+    else:
+        # load_in_low_bit in bigdl.llm.transformers will convert
+        # the relevant layers in the model into corresponding int X format
+        model = AutoModelForCausalLM.from_pretrained(model_path, load_in_low_bit=low_bit, trust_remote_code=True)
+        tokenizer = LlamaTokenizer.from_pretrained(model_path, trust_remote_code=True)
+
+    pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, max_new_tokens=32)
+    input_str = "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun"
+    output = pipeline(input_str)[0]["generated_text"]
+    print(f"Prompt: {input_str}")
+    print(f"Output: {output}")
+
+    save_path = args.save_path
+    if save_path:
+        model.save_low_bit(save_path)
+        tokenizer.save_pretrained(save_path)
+        print(f"Model and tokenizer are saved to {save_path}")
diff --git a/python/llm/example/langchain/README.md → python/llm/example/CPU/LangChain/README.md b/python/llm/example/langchain/README.md → python/llm/example/CPU/LangChain/README.md
diff --git a/...lm/example/langchain/native_int4/docqa.py → ...xample/CPU/LangChain/native_int4/docqa.py b/...lm/example/langchain/native_int4/docqa.py → ...xample/CPU/LangChain/native_int4/docqa.py
diff --git a/...ample/langchain/native_int4/streamchat.py → ...e/CPU/LangChain/native_int4/streamchat.py b/...ample/langchain/native_int4/streamchat.py → ...e/CPU/LangChain/native_int4/streamchat.py
diff --git a/...e/langchain/native_int4/voiceassistant.py → ...U/LangChain/native_int4/voiceassistant.py b/...e/langchain/native_int4/voiceassistant.py → ...U/LangChain/native_int4/voiceassistant.py
diff --git a/...ample/langchain/transformers_int4/chat.py → ...e/CPU/LangChain/transformers_int4/chat.py b/...ample/langchain/transformers_int4/chat.py → ...e/CPU/LangChain/transformers_int4/chat.py
diff --git a/...mple/langchain/transformers_int4/docqa.py → .../CPU/LangChain/transformers_int4/docqa.py b/...mple/langchain/transformers_int4/docqa.py → .../CPU/LangChain/transformers_int4/docqa.py
diff --git a/...e/langchain/transformers_int4/llm_math.py → ...U/LangChain/transformers_int4/llm_math.py b/...e/langchain/transformers_int4/llm_math.py → ...U/LangChain/transformers_int4/llm_math.py
diff --git a/...chain/transformers_int4/voiceassistant.py → ...Chain/transformers_int4/voiceassistant.py b/...chain/transformers_int4/voiceassistant.py → ...Chain/transformers_int4/voiceassistant.py
diff --git a/...xample/transformers/native_int4/README.md → ...n/llm/example/CPU/Native-Models/README.md b/...xample/transformers/native_int4/README.md → ...n/llm/example/CPU/Native-Models/README.md
diff --git a/...rmers/native_int4/native_int4_pipeline.py → ...CPU/Native-Models/native_int4_pipeline.py b/...rmers/native_int4/native_int4_pipeline.py → ...CPU/Native-Models/native_int4_pipeline.py
diff --git a/python/llm/example/pytorch-models/README.md → ...xample/CPU/PyTorch-Models/Model/README.md b/python/llm/example/pytorch-models/README.md → ...xample/CPU/PyTorch-Models/Model/README.md
diff --git a/...llm/example/pytorch-models/bark/README.md → ...e/CPU/PyTorch-Models/Model/bark/README.md b/...llm/example/pytorch-models/bark/README.md → ...e/CPU/PyTorch-Models/Model/bark/README.md
diff --git a/.../pytorch-models/bark/synthesize_speech.py → ...ch-Models/Model/bark/synthesize_speech.py b/.../pytorch-models/bark/synthesize_speech.py → ...ch-Models/Model/bark/synthesize_speech.py
diff --git a/...llm/example/pytorch-models/bert/README.md → ...e/CPU/PyTorch-Models/Model/bert/README.md b/...llm/example/pytorch-models/bert/README.md → ...e/CPU/PyTorch-Models/Model/bert/README.md
diff --git a/...le/pytorch-models/bert/extract_feature.py → ...orch-Models/Model/bert/extract_feature.py b/...le/pytorch-models/bert/extract_feature.py → ...orch-Models/Model/bert/extract_feature.py
diff --git a/.../example/pytorch-models/chatglm/README.md → ...PU/PyTorch-Models/Model/chatglm/README.md b/.../example/pytorch-models/chatglm/README.md → ...PU/PyTorch-Models/Model/chatglm/README.md
diff --git a/...xample/pytorch-models/chatglm/generate.py → .../PyTorch-Models/Model/chatglm/generate.py b/...xample/pytorch-models/chatglm/generate.py → .../PyTorch-Models/Model/chatglm/generate.py
diff --git a/...m/example/pytorch-models/llama2/README.md → ...CPU/PyTorch-Models/Model/llama2/README.md b/...m/example/pytorch-models/llama2/README.md → ...CPU/PyTorch-Models/Model/llama2/README.md
diff --git a/...example/pytorch-models/llama2/generate.py → ...U/PyTorch-Models/Model/llama2/generate.py b/...example/pytorch-models/llama2/generate.py → ...U/PyTorch-Models/Model/llama2/generate.py
diff --git a/...e/pytorch-models/openai-whisper/readme.md → ...rch-Models/Model/openai-whisper/readme.md b/...e/pytorch-models/openai-whisper/readme.md → ...rch-Models/Model/openai-whisper/readme.md
diff --git a/...ytorch-models/openai-whisper/recognize.py → ...-Models/Model/openai-whisper/recognize.py b/...ytorch-models/openai-whisper/recognize.py → ...-Models/Model/openai-whisper/recognize.py
diff --git a/python/llm/example/CPU/PyTorch-Models/More-Data-Types/.keep b/python/llm/example/CPU/PyTorch-Models/More-Data-Types/.keep
diff --git a/python/llm/example/CPU/PyTorch-Models/README.md b/python/llm/example/CPU/PyTorch-Models/README.md
@@ -0,0 +1,7 @@
+# Running PyTorch model using BigDL-LLM on Intel CPU
+
+This folder contains examples of running any PyTorch model on BigDL-LLM (with "one-line code change"):
+
+- [Model](Model): examples of running PyTorch models (e.g., Openai Whisper, LLaMA2, ChatGLM2, Falcon, MPT, Baichuan2, etc.) using INT4 optimizations
+- [More-Data-Types](More-Data-Types): examples of applying other low bit optimizations (NF4/INT5/INT8, etc.)
+- [Save-Load](Save-Load): examples of saving and loading low-bit models
diff --git a/python/llm/example/CPU/PyTorch-Models/Save-Load/.keep b/python/llm/example/CPU/PyTorch-Models/Save-Load/.keep
diff --git a/python/llm/example/CPU/README.md b/python/llm/example/CPU/README.md
@@ -0,0 +1,18 @@
+# BigDL-LLM Examples on Intel CPU
+
+This folder contains examples of running BigDL-LLM on Intel CPU:
+
+- [HF-Transformers-AutoModels](HF-Transformers-AutoModels): running any Hugging Face Transformers model on BigDL-LLM (using the standard AutoModel APIs)
+- [PyTorch-Models](PyTorch-Models): running any PyTorch model on BigDL-LLM (with "one-line code change")
+- [Native-Models](Native-Models): converting & running LLM in `llama`/`chatglm`/`bloom`/`gptneox`/`starcoder` model family using native (cpp) implementation
+- [LangChain](LangChain): running LangChain applications on BigDL-LLM
+
+## System Support
+**Hardware**:
+- Intel® Core™ processors
+- Intel® Xeon® processors
+
+**Operating System**:
+- Ubuntu 20.04 or later
+- CentOS 7 or later
+- Windows 10/11, with or without WSL
diff --git a/...mple/gpu/hf-transformers-models/README.md → ...F-Transformers-AutoModels/Model/README.md b/...mple/gpu/hf-transformers-models/README.md → ...F-Transformers-AutoModels/Model/README.md
@@ -21,6 +21,7 @@ You can use BigDL-LLM to run almost every Huggingface Transformer models with IN
 
 - Intel Arc™ A-Series Graphics
 - Intel Data Center GPU Flex Series
+- Intel Data Center GPU Max Series
 
 ## Recommended Requirements
 To apply Intel GPU acceleration, there’re several steps for tools installation and environment preparation.

diff --git a/...hf-transformers-models/baichuan/README.md → ...rmers-AutoModels/Model/baichuan/README.md b/...hf-transformers-models/baichuan/README.md → ...rmers-AutoModels/Model/baichuan/README.md
diff --git a/...-transformers-models/baichuan/generate.py → ...ers-AutoModels/Model/baichuan/generate.py b/...-transformers-models/baichuan/generate.py → ...ers-AutoModels/Model/baichuan/generate.py
diff --git a/...f-transformers-models/baichuan2/README.md → ...mers-AutoModels/Model/baichuan2/README.md b/...f-transformers-models/baichuan2/README.md → ...mers-AutoModels/Model/baichuan2/README.md
diff --git a/...transformers-models/baichuan2/generate.py → ...rs-AutoModels/Model/baichuan2/generate.py b/...transformers-models/baichuan2/generate.py → ...rs-AutoModels/Model/baichuan2/generate.py
diff --git a/...hf-transformers-models/chatglm2/README.md → ...rmers-AutoModels/Model/chatglm2/README.md b/...hf-transformers-models/chatglm2/README.md → ...rmers-AutoModels/Model/chatglm2/README.md
diff --git a/...-transformers-models/chatglm2/generate.py → ...ers-AutoModels/Model/chatglm2/generate.py b/...-transformers-models/chatglm2/generate.py → ...ers-AutoModels/Model/chatglm2/generate.py
diff --git a/...ransformers-models/chatglm2/streamchat.py → ...s-AutoModels/Model/chatglm2/streamchat.py b/...ransformers-models/chatglm2/streamchat.py → ...s-AutoModels/Model/chatglm2/streamchat.py
diff --git a/...nsformers-models/chinese-llama2/README.md → ...AutoModels/Model/chinese-llama2/README.md b/...nsformers-models/chinese-llama2/README.md → ...AutoModels/Model/chinese-llama2/README.md
diff --git a/...formers-models/chinese-llama2/generate.py → ...toModels/Model/chinese-llama2/generate.py b/...formers-models/chinese-llama2/generate.py → ...toModels/Model/chinese-llama2/generate.py
diff --git a/...u/hf-transformers-models/falcon/README.md → ...formers-AutoModels/Model/falcon/README.md b/...u/hf-transformers-models/falcon/README.md → ...formers-AutoModels/Model/falcon/README.md
diff --git a/...falcon/falcon-7b-instruct/modelling_RW.py → ...falcon/falcon-7b-instruct/modelling_RW.py b/...falcon/falcon-7b-instruct/modelling_RW.py → ...falcon/falcon-7b-instruct/modelling_RW.py
diff --git a/...hf-transformers-models/falcon/generate.py → ...rmers-AutoModels/Model/falcon/generate.py b/...hf-transformers-models/falcon/generate.py → ...rmers-AutoModels/Model/falcon/generate.py
diff --git a/.../hf-transformers-models/gpt-j/generate.py → ...ormers-AutoModels/Model/gpt-j/generate.py b/.../hf-transformers-models/gpt-j/generate.py → ...ormers-AutoModels/Model/gpt-j/generate.py
diff --git a/...pu/hf-transformers-models/gpt-j/readme.md → ...sformers-AutoModels/Model/gpt-j/readme.md b/...pu/hf-transformers-models/gpt-j/readme.md → ...sformers-AutoModels/Model/gpt-j/readme.md
diff --git a/...hf-transformers-models/internlm/README.md → ...rmers-AutoModels/Model/internlm/README.md b/...hf-transformers-models/internlm/README.md → ...rmers-AutoModels/Model/internlm/README.md
diff --git a/...-transformers-models/internlm/generate.py → ...ers-AutoModels/Model/internlm/generate.py b/...-transformers-models/internlm/generate.py → ...ers-AutoModels/Model/internlm/generate.py
diff --git a/...u/hf-transformers-models/llama2/README.md → ...formers-AutoModels/Model/llama2/README.md b/...u/hf-transformers-models/llama2/README.md → ...formers-AutoModels/Model/llama2/README.md
diff --git a/...hf-transformers-models/llama2/generate.py → ...rmers-AutoModels/Model/llama2/generate.py b/...hf-transformers-models/llama2/generate.py → ...rmers-AutoModels/Model/llama2/generate.py
diff --git a/.../gpu/hf-transformers-models/mpt/README.md → ...ansformers-AutoModels/Model/mpt/README.md b/.../gpu/hf-transformers-models/mpt/README.md → ...ansformers-AutoModels/Model/mpt/README.md
diff --git a/...pu/hf-transformers-models/mpt/generate.py → ...sformers-AutoModels/Model/mpt/generate.py b/...pu/hf-transformers-models/mpt/generate.py → ...sformers-AutoModels/Model/mpt/generate.py
diff --git a/...gpu/hf-transformers-models/qwen/README.md → ...nsformers-AutoModels/Model/qwen/README.md b/...gpu/hf-transformers-models/qwen/README.md → ...nsformers-AutoModels/Model/qwen/README.md
diff --git a/...u/hf-transformers-models/qwen/generate.py → ...formers-AutoModels/Model/qwen/generate.py b/...u/hf-transformers-models/qwen/generate.py → ...formers-AutoModels/Model/qwen/generate.py
diff --git a/...transformers-models/starcoder/generate.py → ...rs-AutoModels/Model/starcoder/generate.py b/...transformers-models/starcoder/generate.py → ...rs-AutoModels/Model/starcoder/generate.py
diff --git a/...f-transformers-models/starcoder/readme.md → ...mers-AutoModels/Model/starcoder/readme.md b/...f-transformers-models/starcoder/readme.md → ...mers-AutoModels/Model/starcoder/readme.md
diff --git a/...nsformers-models/voiceassistant/README.md → ...AutoModels/Model/voiceassistant/README.md b/...nsformers-models/voiceassistant/README.md → ...AutoModels/Model/voiceassistant/README.md
diff --git a/...formers-models/voiceassistant/generate.py → ...toModels/Model/voiceassistant/generate.py b/...formers-models/voiceassistant/generate.py → ...toModels/Model/voiceassistant/generate.py
diff --git a/.../hf-transformers-models/whisper/readme.md → ...ormers-AutoModels/Model/whisper/readme.md b/.../hf-transformers-models/whisper/readme.md → ...ormers-AutoModels/Model/whisper/readme.md
diff --git a/...-transformers-models/whisper/recognize.py → ...ers-AutoModels/Model/whisper/recognize.py b/...-transformers-models/whisper/recognize.py → ...ers-AutoModels/Model/whisper/recognize.py
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types/.keep b/python/llm/example/GPU/HF-Transformers-AutoModels/More-Data-Types/.keep
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/README.md b/python/llm/example/GPU/HF-Transformers-AutoModels/README.md
@@ -0,0 +1,7 @@
+# Running Hugging Face Transformers model using BigDL-LLM on Intel GPU
+
+This folder contains examples of running any Hugging Face Transformers model on BigDL-LLM (using the standard AutoModel APIs):
+
+- [Model](Model): examples of running Hugging Face Transformers models (e.g., LLaMA2, ChatGLM2, Falcon, MPT, Baichuan2, etc.) using INT4 optimizations
+- [More-Data-Types](More-Data-Types): examples of applying other low bit optimizations (NF4/INT5/INT8, etc.)
+- [Save-Load](Save-Load): examples of saving and loading low-bit models
diff --git a/python/llm/example/GPU/HF-Transformers-AutoModels/Save-Load/.keep b/python/llm/example/GPU/HF-Transformers-AutoModels/Save-Load/.keep
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/.keep b/python/llm/example/GPU/PyTorch-Models/Model/.keep
diff --git a/python/llm/example/GPU/PyTorch-Models/More-Data-Types/.keep b/python/llm/example/GPU/PyTorch-Models/More-Data-Types/.keep
diff --git a/python/llm/example/GPU/PyTorch-Models/README.md b/python/llm/example/GPU/PyTorch-Models/README.md
@@ -0,0 +1,7 @@
+# Running PyTorch model using BigDL-LLM on Intel GPU
+
+This folder contains examples of running any PyTorch model on BigDL-LLM (with "one-line code change"):
+
+- [Model](Model): examples of running PyTorch models (e.g., Openai Whisper, LLaMA2, ChatGLM2, Falcon, MPT, Baichuan2, etc.) using INT4 optimizations
+- [More-Data-Types](More-Data-Types): examples of applying other low bit optimizations (NF4/INT5/INT8, etc.)
+- [Save-Load](Save-Load): examples of saving and loading low-bit models
diff --git a/python/llm/example/GPU/PyTorch-Models/Save-Load/.keep b/python/llm/example/GPU/PyTorch-Models/Save-Load/.keep
diff --git a/...lm/example/gpu/qlora_finetuning/README.md → ...lm/example/GPU/QLoRA-FineTuning/README.md b/...lm/example/gpu/qlora_finetuning/README.md → ...lm/example/GPU/QLoRA-FineTuning/README.md
diff --git a/...u/qlora_finetuning/export_merged_model.py → ...U/QLoRA-FineTuning/export_merged_model.py b/...u/qlora_finetuning/export_merged_model.py → ...U/QLoRA-FineTuning/export_merged_model.py
diff --git a/.../gpu/qlora_finetuning/qlora_finetuning.py → .../GPU/QLoRA-FineTuning/qlora_finetuning.py b/.../gpu/qlora_finetuning/qlora_finetuning.py → .../GPU/QLoRA-FineTuning/qlora_finetuning.py
diff --git a/python/llm/example/GPU/README.md b/python/llm/example/GPU/README.md
@@ -0,0 +1,26 @@
+# BigDL-LLM Examples on Intel GPU
+
+This folder contains examples of running BigDL-LLM on Intel GPU:
+
+- [HF-Transformers-AutoModels](HF-Transformers-AutoModels): running any Hugging Face Transformers model on BigDL-LLM (using the standard AutoModel APIs)
+- [PyTorch-Models](PyTorch-Models): running any PyTorch model on BigDL-LLM (with "one-line code change")
+- [QLoRA-FineTuning](QLoRA-FineTuning): running QLoRA finetuning on BigDL-LLM
+
+
+## System Support
+**Hardware**:
+- Intel Arc™ A-Series Graphics
+- Intel Data Center GPU Flex Series
+- Intel Data Center GPU Max Series
+
+**Operating System**:
+- Ubuntu 20.04 or later (Ubuntu 22.04 is preferred)
+
+## Requirements
+To apply Intel GPU acceleration, there’re several steps for tools installation and environment preparation.
+
+Step 1, please refer to our [driver installation](https://dgpu-docs.intel.com/driver/installation.html) for general purpose GPU capabilities.
+> **Note**: IPEX 2.0.110+xpu requires Intel GPU Driver version is [Stable 647.21](https://dgpu-docs.intel.com/releases/stable_647_21_20230714.html).
+
+Step 2, you also need to download and install [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html). OneMKL and DPC++ compiler are needed, others are optional.
+> **Note**: IPEX 2.0.110+xpu requires Intel® oneAPI Base Toolkit's version >= 2023.2.0.