Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend] Chat-based Embeddings API #9759

Merged
merged 46 commits into from
Nov 1, 2024
Merged
Show file tree
Hide file tree
Changes from 43 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
1b91750
Initial implementation
DarkLight1337 Oct 28, 2024
61e0fcf
Update docs
DarkLight1337 Oct 28, 2024
c62be47
Cleanup
DarkLight1337 Oct 28, 2024
cc999b1
Consolidate and make code consistent
DarkLight1337 Oct 28, 2024
9ed87c1
Remove useless statement
DarkLight1337 Oct 28, 2024
efa7c6f
Rename back
DarkLight1337 Oct 28, 2024
ab9297e
Factor out common code
DarkLight1337 Oct 28, 2024
5a4f271
Reinstate truncate_prompt_tokens check
DarkLight1337 Oct 29, 2024
4a969b4
Rename
DarkLight1337 Oct 29, 2024
279b9ce
Fix
DarkLight1337 Oct 29, 2024
7de803f
Remove unused code
DarkLight1337 Oct 29, 2024
c1ef363
Migrate tokenization API
DarkLight1337 Oct 29, 2024
a10fa85
Some fixes
DarkLight1337 Oct 29, 2024
89e0710
format
DarkLight1337 Oct 29, 2024
81b94de
remoev unused imports
DarkLight1337 Oct 29, 2024
a79d3b2
Migrate chat and completion APIs
DarkLight1337 Oct 29, 2024
8b950dd
Factor out trace headers code
DarkLight1337 Oct 29, 2024
2c91855
Merge branch 'main' into chat-embeddings-api
DarkLight1337 Oct 29, 2024
f5e72ff
Clean
DarkLight1337 Oct 29, 2024
9cd1ac3
More precise error handling
DarkLight1337 Oct 29, 2024
d775150
Add and update tests
DarkLight1337 Oct 29, 2024
f2b5846
Cleanup
DarkLight1337 Oct 29, 2024
4a25806
Fix tests
DarkLight1337 Oct 29, 2024
bbcfc6a
Update docs
DarkLight1337 Oct 29, 2024
b6820b7
Add docs
DarkLight1337 Oct 29, 2024
fed887a
Fix doc failure
DarkLight1337 Oct 29, 2024
1774b27
Mock out starlette
DarkLight1337 Oct 29, 2024
c94aa93
Try fix docs
DarkLight1337 Oct 29, 2024
e2ecbcd
Cleanup docs
DarkLight1337 Oct 29, 2024
fbbd8b1
Fix newlines
DarkLight1337 Oct 29, 2024
50ad3aa
Reword
DarkLight1337 Oct 29, 2024
9c1df21
Fix
DarkLight1337 Oct 29, 2024
8049030
Update
DarkLight1337 Oct 29, 2024
a387845
Update
DarkLight1337 Oct 29, 2024
d80ec7e
Update
DarkLight1337 Oct 29, 2024
ea5fd96
format
DarkLight1337 Oct 29, 2024
b05ede6
Convert to tip
DarkLight1337 Oct 29, 2024
dba9806
newline
DarkLight1337 Oct 29, 2024
557c9ef
Fix missing client
DarkLight1337 Oct 30, 2024
8c8ee96
Merge branch 'main' into chat-embeddings-api
DarkLight1337 Oct 31, 2024
c3ba030
Merge branch 'main' into chat-embeddings-api
DarkLight1337 Oct 31, 2024
46f316f
Optionally initialize request handlers
DarkLight1337 Nov 1, 2024
1179f66
Update tip
DarkLight1337 Nov 1, 2024
eb4b235
Update tests
DarkLight1337 Nov 1, 2024
bf46a16
format
DarkLight1337 Nov 1, 2024
7f188f9
Rename
DarkLight1337 Nov 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/requirements-docs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,7 @@ torch
py-cpuinfo
transformers
mistral_common >= 1.3.4
aiohttp
starlette
openai # Required by docs/source/serving/openai_compatible_server.md's vllm.entrypoints.openai.cli_args
partial-json-parser # Required by docs/source/serving/openai_compatible_server.md's vllm.entrypoints.openai.cli_args
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,6 @@ def setup(app):

# Mock out external dependencies here, otherwise the autodoc pages may be blank.
autodoc_mock_imports = [
"aiohttp",
"compressed_tensors",
"cpuinfo",
"cv2",
Expand Down Expand Up @@ -143,6 +142,7 @@ def add_line(self, line: str, source: str, *lineno: int) -> None:
"python": ("https://docs.python.org/3", None),
"typing_extensions":
("https://typing-extensions.readthedocs.io/en/latest", None),
"aiohttp": ("https://docs.aiohttp.org/en/stable", None),
"pillow": ("https://pillow.readthedocs.io/en/stable", None),
"numpy": ("https://numpy.org/doc/stable", None),
"torch": ("https://pytorch.org/docs/stable", None),
Expand Down
5 changes: 5 additions & 0 deletions docs/source/dev/pooling_params.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Pooling Parameters
==================

.. autoclass:: vllm.PoolingParams
:members:
8 changes: 4 additions & 4 deletions docs/source/getting_started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -138,10 +138,10 @@ Since this server is compatible with OpenAI API, you can use it as a drop-in rep
A more detailed client example can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py>`__.

OpenAI Chat API with vLLM
~~~~~~~~~~~~~~~~~~~~~~~~~~
OpenAI Chat Completions API with vLLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

vLLM is designed to also support the OpenAI Chat API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.
vLLM is designed to also support the OpenAI Chat Completions API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.

You can use the `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_ endpoint to interact with the model:

Expand All @@ -157,7 +157,7 @@ You can use the `create chat completion <https://platform.openai.com/docs/api-re
$ ]
$ }'
Alternatively, you can use the `openai` python package:
Alternatively, you can use the ``openai`` python package:

.. code-block:: python
Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,7 @@ Documentation
:caption: Developer Documentation

dev/sampling_params
dev/pooling_params
dev/offline_inference/offline_index
dev/engine/engine_index
dev/kernel/paged_attention
Expand Down
55 changes: 52 additions & 3 deletions docs/source/models/vlm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ Below is an example on how to launch the same ``microsoft/Phi-3.5-vision-instruc
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt image=2

.. important::
Since OpenAI Vision API is based on `Chat Completions <https://platform.openai.com/docs/api-reference/chat>`_ API,
Since OpenAI Vision API is based on `Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`_,
a chat template is **required** to launch the API server.

Although Phi-3.5-Vision comes with a chat template, for other models you may have to provide one if the model's tokenizer does not come with it.
Expand Down Expand Up @@ -243,6 +243,10 @@ To consume the server, you can use the OpenAI client like in the example below:

A full code example can be found in `examples/openai_api_client_for_multimodal.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_api_client_for_multimodal.py>`_.

.. tip::
There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
In fact, you can place image placeholders in the middle of the text by interleaving text and image content.

.. note::

By default, the timeout for fetching images through http url is ``5`` seconds. You can override this by setting the environment variable:
Expand All @@ -251,5 +255,50 @@ A full code example can be found in `examples/openai_api_client_for_multimodal.p

$ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>

.. note::
There is no need to format the prompt in the API request since it will be handled by the server.
Chat Embeddings API
^^^^^^^^^^^^^^^^^^^

vLLM's Chat Embeddings API is a superset of OpenAI's `Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`_,
where a list of ``messages`` can be passed instead of batched ``inputs``. This enables multi-modal inputs to be passed to embedding models.

.. tip::
The schema of ``messages`` is exactly the same as in Chat Completions API.

In this example, we will serve the ``TIGER-Lab/VLM2Vec-Full`` model.

.. code-block:: bash

vllm serve TIGER-Lab/VLM2Vec-Full --task embedding \
--trust-remote-code --max-model-len 4096

.. important::

Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass ``--task embedding``
to run this model in embedding mode instead of text generation mode.

Since this schema is not defined by OpenAI client, we post a request to the server using the lower-level ``requests`` library:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leaving this as a thought here: should we perhaps have a fork of the openai client that support our extensions explicitly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds good, but not sure whether we have bandwidth to maintain it 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest opening an issue for this.


.. code-block:: python

import requests

image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

response = requests.post(
"http://localhost:8000/v1/embeddings",
json={
"model": "TIGER-Lab/VLM2Vec-Full",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": image_url}},
{"type": "text", "text": "Represent the given image."},
],
}],
"encoding_format": "float",
},
)
response.raise_for_status()

embedding_json = response.json()
print("Embedding output:", embedding_json["data"][0]["embedding"])
55 changes: 44 additions & 11 deletions docs/source/serving/openai_compatible_server.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,26 @@ print(completion.choices[0].message)
```

## API Reference
Please see the [OpenAI API Reference](https://platform.openai.com/docs/api-reference) for more information on the API. We support all parameters except:
- Chat: `tools`, and `tool_choice`.
- Completions: `suffix`.

vLLM also provides experimental support for OpenAI Vision API compatible inference. See more details in [Using VLMs](../models/vlm.rst).
We currently support the following OpenAI APIs:

- [Completions API](https://platform.openai.com/docs/api-reference/completions)
- *Note: `suffix` parameter is not supported.*
- [Chat Completions API](https://platform.openai.com/docs/api-reference/chat)
- [Vision](https://platform.openai.com/docs/guides/vision)-related parameters are supported; see [Using VLMs](../models/vlm.rst).
- *Note: `image_url.detail` parameter is not supported.*
- We also support `audio_url` content type for audio files.
- Refer to [vllm.entrypoints.chat_utils](https://github.com/vllm-project/vllm/tree/main/vllm/entrypoints/chat_utils.py) for the exact schema.
- *TODO: Support `input_audio` content type as defined [here](https://github.com/openai/openai-python/blob/v1.52.2/src/openai/types/chat/chat_completion_content_part_input_audio_param.py).*
- *Note: `parallel_tool_calls` and `user` parameters are ignored.*
- [Embeddings API](https://platform.openai.com/docs/api-reference/embeddings)
- Instead of `inputs`, you can pass in a list of `messages` (same schema as Chat Completions API),
which will be treated as a single prompt to the model according to its chat template.
- This enables multi-modal inputs to be passed to embedding models, see [Using VLMs](../models/vlm.rst).
- *Note: You should run `vllm serve` with `--task embedding` to ensure that the model is being run in embedding mode.*

## Extra Parameters

vLLM supports a set of parameters that are not part of the OpenAI API.
In order to use them, you can pass them as extra parameters in the OpenAI client.
Or directly merge them into the JSON payload if you are using HTTP call directly.
Expand All @@ -49,7 +62,26 @@ completion = client.chat.completions.create(
)
```

### Extra Parameters for Chat API
### Extra Parameters for Completions API

The following [sampling parameters (click through to see documentation)](../dev/sampling_params.rst) are supported.

```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-completion-sampling-params
:end-before: end-completion-sampling-params
```

The following extra parameters are supported:

```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-completion-extra-params
:end-before: end-completion-extra-params
```

### Extra Parameters for Chat Completions API

The following [sampling parameters (click through to see documentation)](../dev/sampling_params.rst) are supported.

```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
Expand All @@ -66,21 +98,22 @@ The following extra parameters are supported:
:end-before: end-chat-completion-extra-params
```

### Extra Parameters for Completions API
The following [sampling parameters (click through to see documentation)](../dev/sampling_params.rst) are supported.
### Extra Parameters for Embeddings API

The following [pooling parameters (click through to see documentation)](../dev/pooling_params.rst) are supported.

```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-completion-sampling-params
:end-before: end-completion-sampling-params
:start-after: begin-embedding-pooling-params
:end-before: end-embedding-pooling-params
```

The following extra parameters are supported:

```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-completion-extra-params
:end-before: end-completion-extra-params
:start-after: begin-embedding-extra-params
:end-before: end-embedding-extra-params
```

## Chat Template
Expand Down
13 changes: 4 additions & 9 deletions tests/entrypoints/openai/test_basic.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
from http import HTTPStatus
from typing import List

import openai
import pytest
import pytest_asyncio
import requests
Expand Down Expand Up @@ -83,10 +82,8 @@ async def client(server):
indirect=True,
)
@pytest.mark.asyncio
async def test_show_version(client: openai.AsyncOpenAI):
base_url = str(client.base_url)[:-3].strip("/")

response = requests.get(base_url + "/version")
async def test_show_version(server: RemoteOpenAIServer):
response = requests.get(server.url_for("version"))
response.raise_for_status()

assert response.json() == {"version": VLLM_VERSION}
Expand All @@ -102,9 +99,7 @@ async def test_show_version(client: openai.AsyncOpenAI):
indirect=True,
)
@pytest.mark.asyncio
async def test_check_health(client: openai.AsyncOpenAI):
base_url = str(client.base_url)[:-3].strip("/")

response = requests.get(base_url + "/health")
async def test_check_health(server: RemoteOpenAIServer):
response = requests.get(server.url_for("health"))

assert response.status_code == HTTPStatus.OK
Loading