Merge branch 'main' into release/2.2

modelscope · Jul 20, 2024 · cb33ecd · cb33ecd
2 parents 99cb37f + 53c14d2
commit cb33ecd
Show file tree

Hide file tree

Showing 123 changed files with 4,540 additions and 1,799 deletions.
diff --git a/README.md b/README.md
diff --git a/README_CN.md b/README_CN.md
diff --git a/asset/discord_qr.jpg b/asset/discord_qr.jpg
diff --git a/docs/source/.readthedocs.yaml b/docs/source/.readthedocs.yaml
@@ -0,0 +1,30 @@
+# .readthedocs.yaml
+# Read the Docs configuration file
+# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
+
+# Required
+version: 2
+
+# Set the OS, Python version and other tools you might need
+build:
+  os: ubuntu-22.04
+  tools:
+    python: "3.12"
+
+# Build documentation in the "docs/" directory with Sphinx
+sphinx:
+  configuration: docs/source/conf.py
+
+# Optionally build your docs in additional formats such as PDF and ePub
+# formats:
+#    - pdf
+#    - epub
+
+# Optional but recommended, declare the Python requirements required
+# to build your documentation
+# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
+python:
+   install:
+      - requirements: requirements/docs.txt
+      - requirements: requirements/framework.txt
+      - requirements: requirements/llm.txt
diff --git a/docs/source/GetStarted/界面训练推理.md b/docs/source/GetStarted/界面训练推理.md
@@ -8,7 +8,7 @@ swift web-ui
 
 开启界面训练和推理。
 
-web-ui没有传入参数，所有可控部分都在界面中。但是有几个环境变量可以使用：
+web-ui可以通过环境变量或者参数控制UI行为。环境变量如下：
 
 > WEBUI_SHARE=1/0 默认为0 控制gradio是否是share状态
 >
@@ -19,3 +19,5 @@ web-ui没有传入参数，所有可控部分都在界面中。但是有几个
 > WEBUI_PORT web-ui的端口号
 >
 > USE_INFERENCE=1/0 默认0. 控制gradio的推理页面是直接加载模型推理或者部署（USE_INFERENCE=0）
+
+如果使用参数，请参考[命令行参数](../LLM/命令行参数.md#web-ui-参数)。
diff --git a/docs/source/LLM/Agent微调最佳实践.md b/docs/source/LLM/Agent微调最佳实践.md
@@ -165,7 +165,7 @@ Final Answer: 如果您想要一款拍照表现出色的手机，我为您推荐
 | ms-bench         | 60000(抽样)     |
 | self-recognition | 3000(重复抽样)  |
 
-我们也支持使用自己的Agent数据集。数据集格式需要符合[自定义数据集](https://github.com/modelscope/swift/blob/main/docs/source/LLM/%E8%87%AA%E5%AE%9A%E4%B9%89%E4%B8%8E%E6%8B%93%E5%B1%95.md#%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86)的要求。更具体地，Agent的response/system应该符合上述的Action/Action Input/Observation格式。
+我们也支持使用自己的Agent数据集。数据集格式需要符合[自定义数据集](%E8%87%AA%E5%AE%9A%E4%B9%89%E4%B8%8E%E6%8B%93%E5%B1%95.md#%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86)的要求。更具体地，Agent的response/system应该符合上述的Action/Action Input/Observation格式。
 
 我们将**MLP**和**Embedder**加入了lora_target_modules. 你可以通过指定`--lora_target_modules ALL`在所有的linear层(包括qkvo以及mlp和embedder)加lora. 这**通常是效果最好的**.
 

diff --git a/docs/source/LLM/LLM微调文档.md b/docs/source/LLM/LLM微调文档.md
@@ -37,7 +37,7 @@ pip install -r requirements/llm.txt  -U
 ```
 
 ## 微调
-如果你要使用界面的方式进行微调与推理, 可以查看[界面训练与推理文档](https://github.com/modelscope/swift/blob/main/docs/source/GetStarted/%E7%95%8C%E9%9D%A2%E8%AE%AD%E7%BB%83%E6%8E%A8%E7%90%86.md).
+如果你要使用界面的方式进行微调与推理, 可以查看[界面训练与推理文档](../GetStarted/%E7%95%8C%E9%9D%A2%E8%AE%AD%E7%BB%83%E6%8E%A8%E7%90%86.md).
 
 ### 使用python
 ```python
@@ -100,6 +100,7 @@ swift sft \
     --output_dir output \
 
 # 多机多卡
+# 如果多机共用磁盘请在各机器sh中额外指定`--save_on_each_node false`.
 # node0
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 NNODES=2 \
@@ -246,6 +247,7 @@ print(f'history: {history}')
 
 使用**数据集**评估:
 ```bash
+# 如果要推理所有数据集样本, 请额外指定`--show_dataset_sample -1`
 # 直接推理
 CUDA_VISIBLE_DEVICES=0 swift infer \
     --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' \

diff --git a/docs/source/LLM/LLM量化文档.md b/docs/source/LLM/LLM量化文档.md
@@ -305,7 +305,7 @@ CUDA_VISIBLE_DEVICES=0 swift sft \
 ```
 
 **注意**
-- hqq支持更多自定义参数，比如为不同网络层指定不同量化配置，具体请见[命令行参数](https://github.com/modelscope/swift/blob/main/docs/source/LLM/命令行参数.md)
+- hqq支持更多自定义参数，比如为不同网络层指定不同量化配置，具体请见[命令行参数](命令行参数.md)
 - eetq量化为8bit量化，无需指定quantization_bit。目前不支持bf16，需要指定dtype为fp16
 - eetq目前qlora速度比较慢，推荐使用hqq。参考[issue](https://github.com/NetEase-FuXi/EETQ/issues/17)
 

diff --git a/docs/source/LLM/LmDeploy推理加速与部署.md b/docs/source/LLM/LmDeploy推理加速与部署.md
@@ -0,0 +1,146 @@
+# LmDeploy推理加速与部署
+
+## 目录
+- [环境准备](#环境准备)
+- [推理加速](#推理加速)
+- [部署](#部署)
+- [多模态](#多模态)
+
+## 环境准备
+GPU设备: A10, 3090, V100, A100均可.
+```bash
+# 设置pip全局镜像 (加速下载)
+pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
+# 安装ms-swift
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e '.[llm]'
+
+pip install lmdeploy
+```
+
+## 推理加速
+
+### 使用python
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    ModelType, get_lmdeploy_engine, get_default_template_type,
+    get_template, inference_lmdeploy, inference_stream_lmdeploy
+)
+
+model_type = ModelType.qwen_7b_chat
+lmdeploy_engine = get_lmdeploy_engine(model_type)
+template_type = get_default_template_type(model_type)
+template = get_template(template_type, lmdeploy_engine.hf_tokenizer)
+# 与`transformers.GenerationConfig`类似的接口
+lmdeploy_engine.generation_config.max_new_tokens = 256
+generation_info = {}
+
+request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪？'}]
+resp_list = inference_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
+for request, resp in zip(request_list, resp_list):
+    print(f"query: {request['query']}")
+    print(f"response: {resp['response']}")
+print(generation_info)
+
+# stream
+history1 = resp_list[1]['history']
+request_list = [{'query': '这有什么好吃的', 'history': history1}]
+gen = inference_stream_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
+query = request_list[0]['query']
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for resp_list in gen:
+    resp = resp_list[0]
+    response = resp['response']
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print()
+
+history = resp_list[0]['history']
+print(f'history: {history}')
+print(generation_info)
+"""
+query: 你好!
+response: 你好！有什么我能帮助你的吗？
+query: 浙江的省会在哪？
+response: 浙江省会是杭州市。
+{'num_prompt_tokens': 46, 'num_generated_tokens': 13, 'num_samples': 2, 'runtime': 0.2037766759749502, 'samples/s': 9.81466593480922, 'tokens/s': 63.79532857625993}
+query: 这有什么好吃的
+response: 杭州有许多美食，比如西湖醋鱼、东坡肉、龙井虾仁、油炸臭豆腐等，都是当地非常有名的传统名菜。此外，当地的点心也非常有特色，比如桂花糕、马蹄酥、绿豆糕等。
+history: [['浙江的省会在哪？', '浙江省会是杭州市。'], ['这有什么好吃的', '杭州有许多美食，比如西湖醋鱼、东坡肉、龙井虾仁、油炸臭豆腐等，都是当地非常有名的传统名菜。此外，当地的点心也非常有特色，比如桂花糕、马蹄酥、绿豆糕等。']]
+{'num_prompt_tokens': 44, 'num_generated_tokens': 53, 'num_samples': 1, 'runtime': 0.6306625790311955, 'samples/s': 1.5856339558566632, 'tokens/s': 84.03859966040315}
+"""
+```
+
+**TP:**
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
+
+from swift.llm import (
+    ModelType, get_lmdeploy_engine, get_default_template_type,
+    get_template, inference_lmdeploy, inference_stream_lmdeploy
+)
+
+model_type = ModelType.qwen_7b_chat
+lmdeploy_engine = get_lmdeploy_engine(model_type, tp=2)
+template_type = get_default_template_type(model_type)
+template = get_template(template_type, lmdeploy_engine.hf_tokenizer)
+# 与`transformers.GenerationConfig`类似的接口
+lmdeploy_engine.generation_config.max_new_tokens = 256
+generation_info = {}
+
+request_list = [{'query': '你好!'}, {'query': '浙江的省会在哪？'}]
+resp_list = inference_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
+for request, resp in zip(request_list, resp_list):
+    print(f"query: {request['query']}")
+    print(f"response: {resp['response']}")
+print(generation_info)
+
+# stream
+history1 = resp_list[1]['history']
+request_list = [{'query': '这有什么好吃的', 'history': history1}]
+gen = inference_stream_lmdeploy(lmdeploy_engine, template, request_list, generation_info=generation_info)
+query = request_list[0]['query']
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for resp_list in gen:
+    resp = resp_list[0]
+    response = resp['response']
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print()
+
+history = resp_list[0]['history']
+print(f'history: {history}')
+print(generation_info)
+"""
+query: 你好!
+response: 你好！有什么我能帮助你的吗？
+query: 浙江的省会在哪？
+response: 浙江省会是杭州市。
+{'num_prompt_tokens': 46, 'num_generated_tokens': 13, 'num_samples': 2, 'runtime': 0.2080078640137799, 'samples/s': 9.61502109298861, 'tokens/s': 62.497637104425955}
+query: 这有什么好吃的
+response: 杭州有许多美食，比如西湖醋鱼、东坡肉、龙井虾仁、油焖笋等等。杭州的特色小吃也很有风味，比如桂花糕、叫花鸡、油爆虾等。此外，杭州还有许多美味的甜品，如月饼、麻薯、绿豆糕等。
+history: [['浙江的省会在哪？', '浙江省会是杭州市。'], ['这有什么好吃的', '杭州有许多美食，比如西湖醋鱼、东坡肉、龙井虾仁、油焖笋等等。杭州的特色小吃也很有风味，比如桂花糕、叫花鸡、油爆虾等。此外，杭州还有许多美味的甜品，如月饼、麻薯、绿豆糕等。']]
+{'num_prompt_tokens': 44, 'num_generated_tokens': 64, 'num_samples': 1, 'runtime': 0.5715192809584551, 'samples/s': 1.7497222461558426, 'tokens/s': 111.98222375397393}
+"""
+```
+
+
+### 使用CLI
+敬请期待...
+
+## 部署
+敬请期待...
+
+## 多模态
+敬请期待...
diff --git a/docs/source/LLM/Qwen1.5全流程最佳实践.md b/docs/source/LLM/Qwen1.5全流程最佳实践.md
@@ -128,7 +128,6 @@ gen = inference_stream_vllm(llm_engine, template, request_list)
 print_idx = 0
 print(f'query: {query}\nresponse: ', end='')
 for resp_list in gen:
-    request = request_list[0]
     resp = resp_list[0]
     response = resp['response']
     delta = response[print_idx:]
@@ -346,7 +345,6 @@ gen = inference_stream_vllm(llm_engine, template, request_list)
 print_idx = 0
 print(f'query: {query}\nresponse: ', end='')
 for resp_list in gen:
-    request = request_list[0]
     resp = resp_list[0]
     response = resp['response']
     delta = response[print_idx:]