Support chat template and `echo` for chat API #1756

Tostino · 2023-11-22T16:49:11Z

This pull request introduces the chat template feature to vLLM, utilizing the template stored in the tokenizer, enhancing its compatibility with the OpenAI Chat API.
https://huggingface.co/blog/chat-templates

This only affects the OpenAI API chat/completion endpoint, the regular completion endpoint does not utilize this feature.

There has already been a ton of discussion under this PR: #1493 but I accidentally messed things up by replacing the branch, so we are trying this again...

Addition of the --chat-template command-line argument to specify a chat template file or single-line template for the model.
Implementation of the --response-role command-line argument for defining the role name in chat responses when add_generation_prompt is set to true.
Update to the chat API request handling to support handling finishing a partial response correctly, and echoing input portions of messages (request.add_generation_prompt, and request.echo).
Addition of new chat templates examples (template_chatml.jinja, template_alpaca.jinja, and template_inkbot.jinja) showing the multiple ways they can be specified.
More robust error handling, and fix the responses to actually match the OpenAI API format.
Update quickstart.rst to show the new features.

examples/template_chatml.json

vllm/entrypoints/openai/api_server.py

aarnphm · 2023-11-22T20:07:04Z

@simon-mo whats your opinion on just supporting chat templates through envvar? By doing this, there would be no --prompt-template or additional file logic in argparse. We can redirect users to the HF chat templates docs on this.

vllm should be able to document how people can pass in the jinja templates through a envvar, so that vllm won't handle any parsing, and users are responsible for this?

I think this is powerful in a way that users have full control of the chat templates. By default, if there is none provided then fall back to huggingface behaviour?

I think we can provide a default chat templates if needed

edit: This is probably good for serverless as well.

casper-hansen · 2023-11-23T15:19:08Z

Can we also make sure there is a way to disable chat templates such that users provide messages that are preformatted?

Tostino · 2023-11-23T15:24:43Z

Can we also make sure there is a way to disable chat templates such that users provide messages that are preformatted?

As said before, you use the regular completion endpoint if you want to provide a pre-formatted string to the model.

Explain how we would provide the preformatted text using the chat/completion endpoint which requires a list of messages.
Here is an example call:

curl http://0.0.0.0:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer " \
  -d '{
    "model": "Tostino/Inkbot-13B-8k-0.2",
    "stop": ["\n<#user#>", "\n<#bot#>"],
    "stream": false,
    "add_generation_prompt": true,
    "echo": true,
    "n": 1,
    "messages": [
	{"role": "meta-current_date", "content": "2023-10-20"},
	{"role": "meta-task_name", "content": "kg"},
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": "Hello!"},
	{"role": "assistant", "content": "Hello, how are you?"},
	{"role": "user", "content": "Great, thank you! Now i would like some help planning my weekend in Asheville, I would like some hike suggestions."},
	{"role": "assistant", "content": "Asheville is a fantastic destination for hiking enthusiasts! Here are a few suggestions:\n\n1. Black Balsam Knob Trail: This 4.4-mile round trip trail offers stunning views of the surrounding mountains.\n2. Max Patch Loop: This 2.4-mile loop is perfect for families or beginners, featuring wildflowers and mountain views.\n3. Shut-ins Trail: The 2.8-mile hike takes you through a scenic gorge with cascading waterfalls.\n4. Graveyard Fields: For a longer hike, this 8-mile loop takes you through meadows, streams, and a beautiful forest.\n5. Linville Gorge: If you are up for a more challenging hike, the Linville Gorge has various trails with breathtaking views."},
	{"role": "user", "content": "Extract key concepts and relationships from the conversation to form a knowledge graph."}	
	]
  }'

casper-hansen · 2023-11-23T15:29:32Z

I see where you are coming from - perhaps it should be mentioned in the PR that this only applies to certain endpoints.

aarnphm · 2023-11-24T02:38:01Z

Can we also make sure there is a way to disable chat templates such that users provide messages that are preformatted?

As said before, you use the regular completion endpoint if you want to provide a pre-formatted string to the model.

Explain how we would provide the preformatted text using the chat/completion endpoint which requires a list of messages. Here is an example call:

curl http://0.0.0.0:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer " \
  -d '{
    "model": "Tostino/Inkbot-13B-8k-0.2",
    "stop": ["\n<#user#>", "\n<#bot#>"],
    "stream": false,
    "add_generation_prompt": true,
    "echo": true,
    "n": 1,
    "messages": [
	{"role": "meta-current_date", "content": "2023-10-20"},
	{"role": "meta-task_name", "content": "kg"},
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": "Hello!"},
	{"role": "assistant", "content": "Hello, how are you?"},
	{"role": "user", "content": "Great, thank you! Now i would like some help planning my weekend in Asheville, I would like some hike suggestions."},
	{"role": "assistant", "content": "Asheville is a fantastic destination for hiking enthusiasts! Here are a few suggestions:\n\n1. Black Balsam Knob Trail: This 4.4-mile round trip trail offers stunning views of the surrounding mountains.\n2. Max Patch Loop: This 2.4-mile loop is perfect for families or beginners, featuring wildflowers and mountain views.\n3. Shut-ins Trail: The 2.8-mile hike takes you through a scenic gorge with cascading waterfalls.\n4. Graveyard Fields: For a longer hike, this 8-mile loop takes you through meadows, streams, and a beautiful forest.\n5. Linville Gorge: If you are up for a more challenging hike, the Linville Gorge has various trails with breathtaking views."},
	{"role": "user", "content": "Extract key concepts and relationships from the conversation to form a knowledge graph."}	
	]
  }'

I think chat templates should only apply for /v1/chat/completions. Users can change the chat templates if needed. /v1/completions should still stay the way they were before, which means that users should have full responsibility for formatting their prompt.

intrafindBreno · 2023-11-24T15:18:08Z

This pull request is a great contribution to the vllm project. I hope it gets merged soon!

dongxiaolong · 2023-11-27T03:21:05Z

Hi @Tostino ,

Thank you so much for your PR! Your contribution has successfully addressed key issues in model inference, enabling the implementation of function calls using models like OpenHermes. Here is an example I created that demonstrates the effective application of OpenHermes: OpenHermes Functions with VLLM. Your work has truly unlocked the potential of OpenHermes. Adding another practical example to your contribution would greatly help others understand the significant impact of this work.

tjtanaa · 2023-11-27T05:32:44Z

Huggingface now supports chat_template through their tokenizer class. I think if the community are using this feature, it will make chat template tracking easier.

To that end, we have setup a huggingface space that allows developers and the community to encourage the use of huggingface tokenizer's chat_template. We have provided many of the well-known prompt templates. There are many more templates to come within these few weeks.

The ui interface also allows user to download chat template as jinja2 which I think would be beneficial for this PR feature as well.

Tostino · 2023-11-27T15:37:17Z

@tjtanaa That is a neat tool, glad there is more work going into this in the same direction. Being able to download the template to a file easily would be beneficial for users.
One thing to note, is that it incorrectly parses my template for Tostino/Inkbot-13B-8k-0.2 when I tried. The newlines are ignored.

@dongxiaolong Glad you found it useful! Very cool example to see working.

tjtanaa · 2023-11-28T08:09:30Z

@tjtanaa That is a neat tool, glad there is more work going into this in the same direction. Being able to download the template to a file easily would be beneficial for users. One thing to note, is that it incorrectly parses my template for Tostino/Inkbot-13B-8k-0.2 when I tried. The newlines are ignored.

@dongxiaolong Glad you found it useful! Very cool example to see working.

Thank you for taking the time to try it out.
I am following the huggingface example which stores the chat-template into a one-liner jinja2 code. So if you want to have a newline, you should try to add in + '\n' explicitly. e.g.

airoboros_v2.jinja2

{% if not add_generation_prompt is defined %}
{% set add_generation_prompt = false %}
{% endif %}
{% for message in messages %}
{% if not loop.first %}
{% endif %}
{% if message['role'] == 'system' %}
{{ message['content'] + '\n' }}
{% elif message['role'] == 'user' %}
{{ 'USER: ' + message['content'] + '\n' }}
{% elif message['role'] == 'assistant' %}
{{ 'ASSISTANT: ' + message['content'] + '</s>' }}
{% endif %}
{% endfor %}
{% if add_generation_prompt %}
{{ 'ASSISTANT:' }}
{% endif %}

Tostino · 2023-11-28T12:17:36Z

@tjtanaa where can we discuss that issue other than on this PR?

My template is valid jinja, and it does have new lines embedded in the single-line version of it. There was an extra step to correctly load the single line jinja in this PR to deal with new lines properly, I'm guessing you just need to do the same.

Edit: I'll open a discussion on the HF repo.

Tostino · 2023-11-29T16:26:30Z

@aarnphm / @simon-mo / @WoosukKwon (or anyone else appropriate)

Is there anything that needs to be addressed prior to merging this PR at this point? There is nothing else I am aware of, as it's been pretty thoroughly tested.

docs/source/getting_started/quickstart.rst

examples/template_chatml.jinja

aarnphm · 2023-11-29T17:07:03Z

vllm/entrypoints/openai/api_server.py

+                        help="The file path to the chat template, "
+                        "or the template in single-line form "
+                        "for the specified model")
+    parser.add_argument("--response-role",


I think this largely depends on the templates itself right? By default I don't think this is needed.

The response-role? Some models may not use user/assistant as the role names. The templates are free for the creator to choose with that. I defaulted to the same behavior as the OpenAI API, but made it so there is compatibility with the flexibility provided by the chat_template feature.

Ye but we already have that in the message['role'] right (usually alternate between user and assistant, so for example in the phi-1.5 cases it would be BOB and SANDRA?) not sure why we need this

Yes it usually alternates between user/assistant...but given the following request, how would we know the role to respond with?

curl http://0.0.0.0:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer " \ -d '{ "model": "Tostino/Inkbot-13B-8k-0.2", "stream": false, "n": 1, "messages": [ {"role": "meta-current_date", "content": "2023-10-20"}, {"role": "meta-task_name", "content": "general"}, {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of the USA?"} ] }'

Previously, it was hard-coded as assistant. Now it is configurable with that cli argument. Ideally, this would be additional metadata about how to use the template that would be stored somewhere in the tokenizer...but HF didn't think that far ahead.

Users should be responsible for doing few-shot prompt right?

No, that is totally unnecessary for a lot of models and just eats up context space. Not to mention that locks us into two-role conversations. What if there is a user_context role that is used for any input from a file the user wants to interact with, and that is appended after their text input. That is a real use case I have been using this implementation for.

vllm/entrypoints/openai/api_server.py

aarnphm · 2023-11-29T17:42:03Z

vllm/entrypoints/openai/api_server.py

+        if request.add_generation_prompt:
+            return response_role
+        else:
+            return request.messages[-1]["role"]


Like here should it just be request.messages[-1]['role']?

No, that doesn't work.

Yard1 · 2023-11-30T05:23:57Z

@Tostino Would it be possible to add some simple unit tests for this?

Tostino · 2023-11-30T05:36:00Z

@Yard1 Sure, are there any existing tests for the server that I can add to?

Yard1 · 2023-11-30T05:55:46Z

I don't think there are any for the OpenAI server specifically.

Tostino · 2023-11-30T17:11:53Z

@Yard1 So, to properly test this, it looks like I need to refactor a whole lot more code... (I could be mistaken...my dayjob is not python so i've never used any of the testing libraries before, and am learning on-the-fly)

Are you sure you want me to do that? I am not doing any more work that will be thrown away...I've spent far too much time on this already.

And it now looks like i'll have to spend more time rebasing, because there are conflicts again.

1. Addition of the `--chat-template` command-line argument to specify a chat template file or single-line template for the model. 2. Implementation of the `--response-role` command-line argument for defining the role name in chat responses when `add_generation_prompt` is set to true. 3. Introduction of the `--disable-endpoints` argument to allow disabling of specific server endpoints. 4. Update to the chat API request handling to support handling finishing a partial response correctly, and echoing input portions of messages (request.add_generation_prompt, and request.echo). 5. Addition of new chat templates in JSON and Jinja formats (`template_chatml.json`, `template_alpaca.jinja`, and `template_inkbot.jinja`) showing the multiple ways they can be specified. 6. More robust error handling, and fix the responses to actually match the OpenAI API format. 7. Update quickstart.rst to show the new features.

…nd simplify the template loading code to remove support for json based templates.

Co-authored-by: Aaron Pham <[email protected]>

Fixed issues with chatml template not actually supporting the add_generation_prompt feature. This was just a copy/paste from a random model.

Tostino · 2023-11-30T22:22:48Z

Thank you very much @simon-mo. I just added a handful of tests, and fixed the chatml template after tests identified some issues. Should be ready for you now.

add_generation_prompt: is it really a request parameter or bound by the model itself. I have the feeling that most use case will always set it to one value and doesn't change across requests.

Agree here that most use cases will have a single value...but think of a chat UI that has two buttons (or hotkeys), one that sends the current message, and the other that has the model auto-complete the message the user is typing as they are typing. You would use different values for add_generation_prompt for each of those actions.

simon-mo · 2023-12-01T00:43:09Z

Oh one more future work could be loading chat template from http url. Let's see if that will be a common request and decide whether to be added.

flexchar · 2023-12-02T18:31:53Z

Hey all, I've been following the new and old thread for weeks, if not months, now. I wanted to tell that this is greatly awaited and appreciated work. Thank you @simon-mo and @Tostino for getting this into the framework.

Do we have an expected date to pin the new version for vLLM? :)

Tostino · 2023-12-02T21:49:32Z

@flexchar #1856

Should be pretty soon.

flexchar · 2023-12-08T13:13:29Z

I was waiting for this PR so long but I found out that making request to /chat/completion vs /completion where I manually prepare prepare ChatML string, ends up costing 2x as long. My input is 1000 tokens and output 100 tokens. @Tostino any ideas why that could be?

Tostino · 2023-12-08T15:34:26Z

I was waiting for this PR so long but I found out that making request to /chat/completion vs /completion where I manually prepare prepare ChatML string, ends up costing 2x as long. My input is 1000 tokens and output 100 tokens. @Tostino any ideas why that could be?

@flexchar Not off the top of my head, i've not had any noticeable slowdown between chat/completions and completions. I do notice a slowdown when using streaming (~10-20%), but nothing like a 2x slowdown.

I'm getting ~300-500 tokens/sec generated throughput using 1x 3090 and a llama2 13b AWQ model using the chat/completions endpoint.

Do you have an example you can share to trigger it?

Edit I tried it myself... could not reproduce. They both generate in roughly the same amount of time on my machine.

chat/completions:

curl http://0.0.0.0:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer " \
  -d '{
    "model": "Tostino/Inkbot-13B-8k-0.2",
    "stop": ["\n<#user#>", "\n<#bot#>"],
    "stream": false,
    "add_generation_prompt": true,
    "echo": false,
    "n": 1,
    "temperature": 0,
    "messages": [
	{"role": "meta-current_date", "content": "2023-10-20"},
	{"role": "meta-task_name", "content": "general"},
	{"role": "system", "content": "You are a helpful assistant. Please give long and detailed answers. You are an extremely competent electrical engineer."},
	{"role": "user", "content": "Hello!"},
	{"role": "assistant", "content": "Hello, how are you?"},
	{"role": "user", "content": "Please help me understand impedence matching in the context of a DC circuit."}
	]
  }'

completions:

curl http://0.0.0.0:8000/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer " \
  -d '{
    "model": "Tostino/Inkbot-13B-8k-0.2",
    "echo": false,
    "max_tokens": 1024,
    "stream": false,
    "temperature": 0,
    "prompt": "<#meta#>\n- Date: 2023-10-20\n- Task: general\n<#system#>\nYou are a helpful assistant. Please give long and detailed answers. You are an extremely competent electrical engineer.\n<#chat#>\n<#user#>\nHello!\n<#bot#>\nHello, how are you?\n<#user#>\nPlease help me understand impedence matching in the context of a DC circuit.\n<#bot#>\n"
  }'

flexchar · 2023-12-10T11:50:35Z

Thank you for trying. I don't have an easy reproducable example but the next time I work on that part I will do one certainly. I appreciate you testing thou :)

PeterXiaTian · 2024-01-25T09:59:15Z

vllm/entrypoints/api_server.py　　　这里为什么还没有这个参数了？

PhilipMay · 2024-04-24T13:19:42Z

How do I use the chat template with the "offline inference" with vllm.LLM?

This PR only enables this for the REST API as far as I can see. @Tostino

Tostino · 2024-04-25T15:11:04Z

Sorry, on mobile right now so going from memory. I believe that there wasn't a "local" equivalent of the chat/completions API when I implemented this. So it was implemented for the REST endpoint, because that's all it could work with unless I did a bunch of extra work to also add an equivalent local version of the chat completions API.

PeterXiaTian · 2024-04-26T00:40:48Z

well，i see，thanks发自我的 iPhone在 2024年4月25日，23:11，Adam Brusselback ***@***.***> 写道： Sorry, on mobile right now so going from memory. I believe that there wasn't a "local" equivalent of the chat/completions API when I implemented this. So it was implemented for the REST endpoint, because that's all it could work with unless I did a bunch of extra work to also add an equivalent local version of the chat completions API. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>

Tostino mentioned this pull request Nov 22, 2023

Add Chat Template Support to vLLM #1493

Closed

aarnphm reviewed Nov 22, 2023

View reviewed changes

examples/template_chatml.json Outdated Show resolved Hide resolved

aarnphm reviewed Nov 22, 2023

View reviewed changes

vllm/entrypoints/openai/api_server.py Outdated Show resolved Hide resolved

Tostino mentioned this pull request Nov 27, 2023

Merge fastchat Conversation‘s stop words/tokens to OpenAI API server. #1800

Closed

dongxiaolong mentioned this pull request Nov 28, 2023

server : improvements and maintenance ggerganov/llama.cpp#4216

Open

10 tasks

aarnphm reviewed Nov 29, 2023

View reviewed changes

docs/source/getting_started/quickstart.rst Outdated Show resolved Hide resolved

aarnphm reviewed Nov 29, 2023

View reviewed changes

examples/template_chatml.jinja Outdated Show resolved Hide resolved

aarnphm reviewed Nov 29, 2023

View reviewed changes

vllm/entrypoints/openai/api_server.py Show resolved Hide resolved

aarnphm reviewed Nov 29, 2023

View reviewed changes

Tostino and others added 4 commits November 30, 2023 12:18

Remove support for --disable-endpoints.

cc4934e

Remove template_chatml.json and replace with template_chatml.jinja, a…

1ec40cb

…nd simplify the template loading code to remove support for json based templates.

Update docs/source/getting_started/quickstart.rst

8918c4c

Co-authored-by: Aaron Pham <[email protected]>

simon-mo changed the title ~~Add Chat Template Support to vLLM (take #2)~~ Support chat template and echo for chat API Nov 30, 2023

simon-mo self-assigned this Nov 30, 2023

simon-mo mentioned this pull request Nov 30, 2023

Add latency metrics #1870

Closed

Add a few tests, and refactor the methods required to support that.

c068bf0

Fixed issues with chatml template not actually supporting the add_generation_prompt feature. This was just a copy/paste from a random model.

simon-mo added 3 commits December 1, 2023 00:31

readability pass

70bffb7

fix tests

0e78eb3

lint

72d63cf

simon-mo approved these changes Dec 1, 2023

View reviewed changes

simon-mo merged commit 66785cc into vllm-project:main Dec 1, 2023
2 checks passed

This was referenced Dec 1, 2023

Server UI improvements ggerganov/llama.cpp#4236

Closed

Adding OpenAI Compatible RESTful API microsoft/DeepSpeed-MII#317

Merged

kai01ai mentioned this pull request Dec 4, 2023

Support specify template in OpenAI server #1263

Closed

xjpang pushed a commit to xjpang/vllm that referenced this pull request Dec 4, 2023

Support chat template and echo for chat API (vllm-project#1756)

f1b082c

casper-hansen mentioned this pull request Dec 7, 2023

Streaming broken in OpenAI server in v0.2.3 (0.2.2 works) #1967

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Support chat template and echo for chat API (vllm-project#1756)

67bade5

sjchoi1 pushed a commit to casys-kaist-internal/vllm that referenced this pull request May 7, 2024

Support chat template and echo for chat API (vllm-project#1756)

5b7d25a

This was referenced May 31, 2024

feat: flag to specify chat prompt template for openai api server #1215

Closed

[Feature]: Option to override HuggingFace's configurations #5205

Closed

Support chat template and echo for chat API #1756

Support chat template and echo for chat API #1756

Conversation

Tostino commented Nov 22, 2023 • edited Loading

aarnphm commented Nov 22, 2023 • edited Loading

casper-hansen commented Nov 23, 2023 • edited Loading

Tostino commented Nov 23, 2023 • edited Loading

casper-hansen commented Nov 23, 2023

aarnphm commented Nov 24, 2023

intrafindBreno commented Nov 24, 2023

dongxiaolong commented Nov 27, 2023

tjtanaa commented Nov 27, 2023 • edited Loading

Tostino commented Nov 27, 2023

tjtanaa commented Nov 28, 2023 • edited Loading

Tostino commented Nov 28, 2023 • edited Loading

Tostino commented Nov 29, 2023 • edited Loading

aarnphm Nov 29, 2023

Choose a reason for hiding this comment

Tostino Nov 29, 2023 • edited Loading

Choose a reason for hiding this comment

aarnphm Nov 29, 2023 • edited Loading

Choose a reason for hiding this comment

Tostino Nov 29, 2023 • edited Loading

Choose a reason for hiding this comment

aarnphm Nov 29, 2023

Choose a reason for hiding this comment

Tostino Nov 29, 2023 • edited Loading

Choose a reason for hiding this comment

aarnphm Nov 29, 2023

Choose a reason for hiding this comment

Tostino Nov 29, 2023

Choose a reason for hiding this comment

Yard1 commented Nov 30, 2023

Tostino commented Nov 30, 2023 • edited Loading

Yard1 commented Nov 30, 2023

Tostino commented Nov 30, 2023 • edited Loading

Tostino commented Nov 30, 2023 • edited Loading

simon-mo commented Dec 1, 2023

flexchar commented Dec 2, 2023

Tostino commented Dec 2, 2023

flexchar commented Dec 8, 2023

Tostino commented Dec 8, 2023 • edited Loading

flexchar commented Dec 10, 2023

PeterXiaTian commented Jan 25, 2024

PhilipMay commented Apr 24, 2024

Tostino commented Apr 25, 2024

PeterXiaTian commented Apr 26, 2024 via email

Support chat template and `echo` for chat API #1756

Support chat template and `echo` for chat API #1756

Tostino commented Nov 22, 2023 •

edited

Loading

aarnphm commented Nov 22, 2023 •

edited

Loading

casper-hansen commented Nov 23, 2023 •

edited

Loading

Tostino commented Nov 23, 2023 •

edited

Loading

tjtanaa commented Nov 27, 2023 •

edited

Loading

tjtanaa commented Nov 28, 2023 •

edited

Loading

Tostino commented Nov 28, 2023 •

edited

Loading

Tostino commented Nov 29, 2023 •

edited

Loading

Tostino Nov 29, 2023 •

edited

Loading

aarnphm Nov 29, 2023 •

edited

Loading

Tostino Nov 29, 2023 •

edited

Loading

Tostino Nov 29, 2023 •

edited

Loading

Tostino commented Nov 30, 2023 •

edited

Loading

Tostino commented Nov 30, 2023 •

edited

Loading

Tostino commented Nov 30, 2023 •

edited

Loading

Tostino commented Dec 8, 2023 •

edited

Loading