update multi-modal docs (#1212)

modelscope · Jun 24, 2024 · 04a98d4 · 04a98d4
1 parent 38e4d96
commit 04a98d4
Show file tree

Hide file tree

Showing 21 changed files with 397 additions and 216 deletions.
diff --git a/docs/source/Multi-Modal/cogvlm2最佳实践.md b/docs/source/Multi-Modal/cogvlm2最佳实践.md
@@ -32,6 +32,11 @@ CUDA_VISIBLE_DEVICES=0 swift infer --model_type cogvlm2-19b-chat
 输出: (支持传入本地路径或URL)
 ```python
 """
+<<< 你好
+Input a media path or URL <<<
+你好！我是一个人工智能助手，随时准备回答你的问题。有什么我可以帮助你的吗？
+--------------------------------------------------
+<<< clear
 <<< 描述这张图片
 Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
 这是一张特写照片，展示了一只灰色和白色相间的猫。这只猫的眼睛是灰色的，鼻子是粉色的，嘴巴微微张开。它的毛发看起来柔软而蓬松，背景模糊，突出了猫的面部特征。
@@ -68,6 +73,23 @@ Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.co
 但心中的美好永远留存。
 这段旅程，
 让他们更加珍惜生命中的每一刻。
+--------------------------------------------------
+<<< clear
+<<< 对图片进行OCR
+Input a media path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png
+图中内容是：
+
+简介
+
+SWIFT支持250+LLM和35+ MLLM(多模态大模型)的训练、推理、评测和部署。开发者可以直接将我们的框架应用到自己的Research和生产环境中，实现模型训练评测到应用的完整链路。我们除支持了PEFT提供的轻量训练方案外，也提供了一个完整的Adapters库以支持最新的训练技术，如NEFTune、LoRA+、LLaMA-PRO等，这个适配器库可以脱离训练脚本直接使用在自己的自定流程中。
+
+为方便不熟悉深度学习的用户使用，我们提供了一个Gradio的web-ui用于控制训练和推理，并提供了配套的深度学习课程和最佳实践供新手入门。
+
+此外，我们也在拓展其他模态的能力，目前我们支持了AnimateDiff的全参数训练和LoRA训练。
+
+SWIFT具有丰富的文档体系，如有使用问题请查看这里。
+
+可以在Huggingface space和ModelScope创空间中体验SWIFTweb-ui功能了。
 """
 ```
 
@@ -89,6 +111,10 @@ poem:
 
 <img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
 
+ocr:
+
+<img src="https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png" width="250" style="display: inline-block;">
+
 **单样本推理**
 
 ```python

diff --git a/docs/source/Multi-Modal/cogvlm最佳实践.md b/docs/source/Multi-Modal/cogvlm最佳实践.md
@@ -48,6 +48,11 @@ In a world where night and day intertwine,
 A boat floats gently, reflecting the moon's shine.
 Fireflies dance, their glow a mesmerizing trance,
 As the boat sails through a tranquil, enchanted expanse.
+--------------------------------------------------
+<<< clear
+<<< Perform OCR on the image.
+Input a media path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr_en.png
+The image contains textual content that describes the capabilities and features of the SWIFT framework. It mentions support for training, inference, and deployment of 250+ LLMs and 35+ MLMs, and how developers can apply this framework to their research and production environments. It also mentions lightweight training solutions provided by PEFT and an adapter library to support the latest training techniques. Additionally, the text highlights that SWIFT offers capabilities for other modalities and supports full-parameter training and LLaMA training for AnimateDiff. There's also a mention of rich documentation available on Huggingface space and ModelScope studio.
 """
 ```
 
@@ -69,6 +74,10 @@ poem:
 
 <img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
 
+ocr_en:
+
+<img src="https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr_en.png" width="250" style="display: inline-block;">
+
 **单样本推理**
 
 ```python

diff --git a/docs/source/Multi-Modal/deepseek-vl最佳实践.md b/docs/source/Multi-Modal/deepseek-vl最佳实践.md
@@ -66,6 +66,14 @@ CUDA_VISIBLE_DEVICES=0 swift infer --model_type deepseek-vl-1_3b-chat
 舟儿前行不自知。
 夜深人静思绪远，
 孤舟独行心悠然。
+--------------------------------------------------
+<<< clear
+<<< <img>https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png</img>对图片进行OCR
+The image contains Chinese text and appears to be a screenshot of a document or webpage. The text is divided into several paragraphs, and there are several instances of URLs and Chinese characters. The text is not entirely clear due to the resolution, but some of the visible words and phrases include "SWIFT", "250+", "LLM35+", "MLM", "PEFT", "adapters", "GPT", "XNLI", "Tune", "LORA", "LAMA-PRO", "Gradio", "web.ui", "AnimateDiff", "HuggingFace", "space", "ModelScope", and "SWIFT web".
+
+The text seems to be discussing topics related to machine learning, specifically mentioning models like SWIFT, GPT, and LAMA-PRO, as well as tools and frameworks like HuggingFace and ModelScope. The URLs suggest that the text might be referencing online resources or repositories related to these topics.
+
+The text is not fully legible due to the low resolution and the angle at which the image was taken, which makes it difficult to provide a precise transcription. However, the presence of technical terms and URLs indicates that the content is likely from a technical or academic context, possibly a research paper, a technical report, or an article discussing advancements in machine learning and related technologies.
 """
 ```
 
@@ -87,6 +95,9 @@ poem:
 
 <img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
 
+ocr:
+
+<img src="https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png" width="250" style="display: inline-block;">
 
 **单样本推理**
 

diff --git a/docs/source/Multi-Modal/glm4v最佳实践.md b/docs/source/Multi-Modal/glm4v最佳实践.md
@@ -30,6 +30,11 @@ CUDA_VISIBLE_DEVICES=0 swift infer --model_type glm4v-9b-chat
 输出: (支持传入本地路径或URL)
 ```python
 """
+<<< 你好
+Input a media path or URL <<<
+你好👋！很高兴见到你，欢迎问我任何问题。
+--------------------------------------------------
+<<< clear
 <<< 描述这张图片
 Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
 这是一张特写照片，展示了一只毛茸茸的小猫。小猫的眼睛大而圆，呈深蓝色，眼珠呈金黄色，非常明亮。它的鼻子短而小巧，是粉色的。小猫的嘴巴紧闭，胡须细长。它的耳朵竖立着，耳朵内侧是白色的，外侧是棕色的。小猫的毛发看起来柔软而浓密，主要是白色和棕色相间的条纹图案。背景模糊不清，但似乎是一个室内环境。
@@ -54,6 +59,23 @@ Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.co
 人在画中寻诗意，
 
 心随景迁忘忧愁。
+--------------------------------------------------
+<<< clear
+<<< 对图片进行OCR
+Input a media path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png
+图片中的OCR结果如下：
+
+简介
+
+SWIFT支持250+LLM和35+MLLM（多模态大模型）的训练、推理、评测和部署。开发者可以直接将我们的框架应用到自己的Research和生产环境中，实现模型训练评测到应用的完整链路。我们除支持了PEFT提供的轻量训练方案外，也提供了一个完整的Adapters库以支持最新的训练技术，如NEFTune、LoRA+、LLaMA-PRO等，这个适配器库可以脱离训练脚本直接使用在自己的自定流程中。
+
+为方便不熟悉深度学习的用户使用，我们提供了一个Gradio的web-ui用于控制训练和推理，并提供了配套的深度学习课程和最佳实践供新入门。
+
+此外，我们也在拓展其他模态的能力，目前我们支持了AnimateDiff的全参数训练和LoRA训练。
+
+SWIFT具有丰富的文档体系，如有使用问题请请查看这里。
+
+可以在Huggingface space和ModelScope创空间中体验SWIFT web-ui功能了。
 """
 ```
 
@@ -75,6 +97,10 @@ poem:
 
 <img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
 
+ocr:
+
+<img src="https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png" width="250" style="display: inline-block;">
+
 **单样本推理**
 
 ```python

diff --git a/docs/source/Multi-Modal/index.md b/docs/source/Multi-Modal/index.md
@@ -13,13 +13,13 @@
 5. [Phi3-Vision最佳实践](phi3-vision最佳实践.md)
 
 
-一轮对话只能包含一张图片:
+一轮对话只能包含一张图片（可能可以不含图片）:
 1. [Llava最佳实践](llava最佳实践.md)
 2. [Yi-VL最佳实践.md](yi-vl最佳实践.md)
 3. [mPLUG-Owl2最佳实践](mplug-owl2最佳实践.md)
 
 
-整个对话围绕一张图片:
+整个对话围绕一张图片（可能可以不含图片）:
 1. [CogVLM最佳实践](cogvlm最佳实践.md), [CogVLM2最佳实践](cogvlm2最佳实践.md), [glm4v最佳实践](glm4v最佳实践.md)
 2. [MiniCPM-V最佳实践](minicpm-v最佳实践.md), [MiniCPM-V-2最佳实践](minicpm-v-2最佳实践.md), [MiniCPM-V-2.5最佳实践](minicpm-v-2.5最佳实践.md)
 3. [InternVL-Chat-V1.5最佳实践](internvl最佳实践.md)
diff --git a/docs/source/Multi-Modal/internlm-xcomposer2最佳实践.md b/docs/source/Multi-Modal/internlm-xcomposer2最佳实践.md
@@ -26,26 +26,27 @@ CUDA_VISIBLE_DEVICES=0 swift infer --model_type internlm-xcomposer2-7b-chat
 ```python
 """
 <<< 你是谁？
- 我是你的助手，一个基于语言的人工智能模型，可以回答你的问题。
+我是浦语·灵笔，一个由上海人工智能实验室开发的语言模型。我能理解并流畅地使用英语和中文与你对话。
 --------------------------------------------------
 <<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img><img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png</img>这两张图片有什么区别
- 这两张图片是不同的, 第一张是羊的图片, 第二张是猫的图片
+这两张图片没有直接的关联，它们分别展示了两个不同的场景。第一幅图是一张卡通画，描绘了一群羊在草地上，背景是蓝天和山脉。第二幅图则是一张猫的照片，猫正看着镜头，背景模糊不清。
 --------------------------------------------------
+<<< clear
 <<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png</img>图中有几只羊
- 图中有4只羊
+图中有4只羊
 --------------------------------------------------
 <<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png</img>计算结果是多少
- 计算结果是1452+45304=46756
+1452 + 45304 = 46756
 --------------------------------------------------
 <<< <img>http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png</img>根据图片中的内容写首诗
- 湖面波光粼粼，小舟独自飘荡。
-船上点灯，照亮夜色，
-星星点点，倒映水中。
-
-远处山峦，云雾缭绕，
-天空繁星，闪烁不停。
-湖面如镜，倒影清晰，
-小舟穿行，如诗如画。
+夜色苍茫月影斜，
+湖面平静如明镜。
+小舟轻荡波光里，
+灯火微摇映水乡。
+--------------------------------------------------
+<<< clear
+<<< <img>https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png</img>对图片进行OCR
+很抱歉，我无法对您提供的图片进行OCR。如果您需要文本识别服务，您可以上传图片到其他支持OCR服务的平台，或者您可以尝试使用一些在线OCR工具。
 """
 ```
 
@@ -67,6 +68,9 @@ poem:
 
 <img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
 
+ocr:
+
+<img src="https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png" width="250" style="display: inline-block;">
 
 **单样本推理**
 

diff --git a/docs/source/Multi-Modal/internvl最佳实践.md b/docs/source/Multi-Modal/internvl最佳实践.md
@@ -42,43 +42,58 @@ CUDA_VISIBLE_DEVICES=0,1 swift infer --model_type internvl-chat-v1_5 --dtype bf1
 输出: (支持传入本地路径或URL)
 ```python
 """
-<<< Describe this image.
-Input a media path or URL <<<  http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
-This is a high-resolution image of a kitten. The kitten has striking blue eyes and a fluffy white and grey coat. The fur pattern suggests that it may be a Maine Coon or a similar breed. The kitten's ears are perked up, and it has a curious and innocent expression. The background is blurred, which brings the focus to the kitten's face.
+<<< 你是谁
+Input a media path or URL <<<
+我是一个人工智能助手，旨在通过自然语言处理和机器学习技术来帮助用户解决问题和完成任务。
 --------------------------------------------------
 <<< clear
-<<< How many sheep are in the picture?
+<<< 描述这张图片
+Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
+这张图片是一只小猫咪的特写照片。这只小猫咪有着蓝灰色的眼睛和白色的毛发，上面有灰色和黑色的条纹。它的耳朵是尖的，眼睛睁得大大的，看起来非常可爱和好奇。背景是模糊的，无法分辨具体的环境，但看起来像是在室内，有柔和的光线。
+--------------------------------------------------
+<<< clear
+<<< 图中有几只羊
 Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
-There are four sheep in the picture.
+图中有四只羊。
 --------------------------------------------------
 <<< clear
-<<< What is the calculation result?
+<<< 计算结果是多少?
 Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png
-The calculation result is 59,856.
+将两个数相加，得到：
+1452 + 45304 = 46766
+因此，1452 + 45304 = 46766。
 --------------------------------------------------
 <<< clear
-<<< Write a poem based on the content of the picture.
+<<< 根据图片中的内容写首诗
 Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png
-Token indices sequence length is longer than the specified maximum sequence length for this model (5142 > 4096). Running this sequence through the model will result in indexing errors
-In the still of the night,
-A lone boat sails on the light.
-The stars above, a twinkling sight,
-Reflecting in the water's might.
-
-The trees stand tall, a silent guard,
-Their leaves rustling in the yard.
-The boatman's lantern, a beacon bright,
-Guiding him through the night.
-
-The river flows, a gentle stream,
-Carrying the boatman's dream.
-His journey long, his heart serene,
-In the beauty of the scene.
-
-The stars above, a guiding light,
-Leading him through the night.
-The boatman's journey, a tale to tell,
-Of courage, hope, and love as well.
+夜色笼罩水面，
+小舟轻摇入画帘。
+星辉闪烁如珠串，
+月色朦胧似轻烟。
+
+树影婆娑映水面，
+静谧宁和心自安。
+夜深人静思无限，
+唯有舟影伴我眠。
+--------------------------------------------------
+<<< clear
+<<< 对图片进行OCR
+Input a media path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png
+图中所有文字：
+简介
+SWIFT支持250＋LLM和35＋MLLM（多模态大模型）的训练、推
+理、评测和部署。开发者可以直接将我们的框架应用到自己的Research和
+生产环境中，实现模型训练评测到应用的完整链路。我们除支持
+PEFT提供的轻量训练方案外，也提供了一个完整的Adapters库以支持
+最新的训练技术，如NEFTune、LoRA+、LLaMA-PRO等，这个适配
+器库可以脱离训练脚本直接使用在自已的自定义流程中。
+为了方便不熟悉深度学习的用户使用，我们提供了一个Gradio的web-ui
+于控制训练和推理，并提供了配套的深度学习课程和最佳实践供新手入
+此外，我们也正在拓展其他模态的能力，目前我们支持了AnimateDiff的全参
+数训练和LoRA训练。
+SWIFT具有丰富的文档体系，如有使用问题请查看这里：
+可以在Huggingface space和ModelScope创空间中体验SWIFT web-
+ui功能了。
 """
 ```
 
@@ -100,6 +115,10 @@ poem:
 
 <img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
 
+ocr:
+
+<img src="https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png" width="250" style="display: inline-block;">
+
 **单样本推理**
 
 ```python

diff --git a/docs/source/Multi-Modal/llava最佳实践.md b/docs/source/Multi-Modal/llava最佳实践.md
@@ -42,6 +42,10 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 swift infer --model_type llava1_6-yi-34b-instruct
 输出: (支持传入本地路径或URL)
 ```python
 """
+<<< who are you
+Input a media path or URL <<<
+I am a language model, specifically a transformer model, trained to generate text based on the input it receives. I do not have personal experiences or emotions, and I do not have a physical form. I exist purely as a software program that can process and generate text.
+--------------------------------------------------
 <<< Describe this image.
 Input a media path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
 The image shows a close-up of a kitten with a soft, blurred background that suggests a natural, outdoor setting. The kitten has a mix of white and gray fur with darker stripes, typical of a tabby pattern. Its eyes are wide open, with a striking blue color that contrasts with the kitten's fur. The kitten's nose is small and pink, and its whiskers are long and white, adding to the kitten's cute and innocent appearance. The lighting in the image is soft and diffused, creating a gentle and warm atmosphere. The focus is sharp on the kitten's face, while the rest of the image is slightly out of focus, which draws attention to the kitten's features.
@@ -85,6 +89,20 @@ The boat, a symbol of solitude,
 In the vast expanse of the universe's beauty,
 A lone journey, a solitary quest,
 In the quiet of the night, it finds its rest.
+--------------------------------------------------
+<<< Perform OCR on the image.
+Input a media path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr_en.png
+The text in the image is as follows:
+
+INTRODUCTION
+
+SWIFT supports training, inference, evaluation and deployment of 250+ LLMs (multimodal large models). Developers can directly apply our framework to their own research and production environments to realize the complete workflow from model training and evaluation to application. In addition, SWIFT provides a complete Adapters library to support the latest training techniques such as NLP, Vision, etc. This adapter library can be used directly in your own custom workflow without our training scripts.
+
+To facilitate use by users unfamiliar with deep learning, we provide a Grado web-ui for controlling training and inference, as well as accompanying deep learning courses and best practices for beginners.
+
+SWIFT has rich documentation for users, please check here.
+
+SWIFT is web-ui available both on Huggingface space and ModelScope studio, please feel free to try!
 """
 ```
 
@@ -106,6 +124,10 @@ poem:
 
 <img src="http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/poem.png" width="250" style="display: inline-block;">
 
+ocr_en:
+
+<img src="https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr_en.png" width="250" style="display: inline-block;">
+
 **单样本推理**
 
 ```python