From dc4aea677ab0520d91ff4979e80340cb5a090095 Mon Sep 17 00:00:00 2001 From: Jiangtao Hu Date: Wed, 9 Oct 2024 16:59:42 +0800 Subject: [PATCH] [Doc] Fix VLM prompt placeholder sample bug (#9170) --- docs/source/models/vlm.rst | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/source/models/vlm.rst b/docs/source/models/vlm.rst index 45316fd34a5d2..b2262de238660 100644 --- a/docs/source/models/vlm.rst +++ b/docs/source/models/vlm.rst @@ -25,7 +25,7 @@ The :class:`~vllm.LLM` class can be instantiated in much the same way as languag To pass an image to the model, note the following in :class:`vllm.inputs.PromptType`: * ``prompt``: The prompt should follow the format that is documented on HuggingFace. -* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`. +* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`. .. code-block:: python @@ -34,7 +34,7 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptT # Load the image using PIL.Image image = PIL.Image.open(...) - + # Single prompt inference outputs = llm.generate({ "prompt": prompt, @@ -68,7 +68,7 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptT "prompt": prompt, "multi_modal_data": mm_data, }) - + for o in outputs: generated_text = o.outputs[0].text print(generated_text) @@ -116,7 +116,7 @@ Instead of passing in a single image, you can pass in a list of images. .. code-block:: python # Refer to the HuggingFace repo for the correct format to use - prompt = "<|user|>\n\n\nWhat is the content of each image?<|end|>\n<|assistant|>\n" + prompt = "<|user|>\n<|image_1|>\n<|image_2|>\nWhat is the content of each image?<|end|>\n<|assistant|>\n" # Load the images using PIL.Image image1 = PIL.Image.open(...) @@ -135,11 +135,11 @@ Instead of passing in a single image, you can pass in a list of images. A code example can be found in `examples/offline_inference_vision_language_multi_image.py `_. -Multi-image input can be extended to perform video captioning. We show this with `Qwen2-VL `_ as it supports videos: +Multi-image input can be extended to perform video captioning. We show this with `Qwen2-VL `_ as it supports videos: .. code-block:: python - # Specify the maximum number of frames per video to be 4. This can be changed. + # Specify the maximum number of frames per video to be 4. This can be changed. llm = LLM("Qwen/Qwen2-VL-2B-Instruct", limit_mm_per_prompt={"image": 4}) # Create the request payload. @@ -157,7 +157,7 @@ Multi-image input can be extended to perform video captioning. We show this with # Perform inference and log output. outputs = llm.chat([message]) - + for o in outputs: generated_text = o.outputs[0].text print(generated_text)