Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support chat template and echo for chat API #1756

Merged
merged 10 commits into from
Dec 1, 2023

Conversation

Tostino
Copy link
Contributor

@Tostino Tostino commented Nov 22, 2023

This pull request introduces the chat template feature to vLLM, utilizing the template stored in the tokenizer, enhancing its compatibility with the OpenAI Chat API.
https://huggingface.co/blog/chat-templates

This only affects the OpenAI API chat/completion endpoint, the regular completion endpoint does not utilize this feature.

There has already been a ton of discussion under this PR: #1493 but I accidentally messed things up by replacing the branch, so we are trying this again...

  1. Addition of the --chat-template command-line argument to specify a chat template file or single-line template for the model.
  2. Implementation of the --response-role command-line argument for defining the role name in chat responses when add_generation_prompt is set to true.
  3. Update to the chat API request handling to support handling finishing a partial response correctly, and echoing input portions of messages (request.add_generation_prompt, and request.echo).
  4. Addition of new chat templates examples (template_chatml.jinja, template_alpaca.jinja, and template_inkbot.jinja) showing the multiple ways they can be specified.
  5. More robust error handling, and fix the responses to actually match the OpenAI API format.
  6. Update quickstart.rst to show the new features.

@aarnphm
Copy link
Contributor

aarnphm commented Nov 22, 2023

@simon-mo whats your opinion on just supporting chat templates through envvar? By doing this, there would be no --prompt-template or additional file logic in argparse. We can redirect users to the HF chat templates docs on this.

vllm should be able to document how people can pass in the jinja templates through a envvar, so that vllm won't handle any parsing, and users are responsible for this?

I think this is powerful in a way that users have full control of the chat templates. By default, if there is none provided then fall back to huggingface behaviour?

I think we can provide a default chat templates if needed

edit: This is probably good for serverless as well.

@casper-hansen
Copy link
Contributor

casper-hansen commented Nov 23, 2023

Can we also make sure there is a way to disable chat templates such that users provide messages that are preformatted?

@Tostino
Copy link
Contributor Author

Tostino commented Nov 23, 2023

Can we also make sure there is a way to disable chat templates such that users provide messages that are preformatted?

As said before, you use the regular completion endpoint if you want to provide a pre-formatted string to the model.

Explain how we would provide the preformatted text using the chat/completion endpoint which requires a list of messages.
Here is an example call:

curl http://0.0.0.0:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer " \
  -d '{
    "model": "Tostino/Inkbot-13B-8k-0.2",
    "stop": ["\n<#user#>", "\n<#bot#>"],
    "stream": false,
    "add_generation_prompt": true,
    "echo": true,
    "n": 1,
    "messages": [
	{"role": "meta-current_date", "content": "2023-10-20"},
	{"role": "meta-task_name", "content": "kg"},
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": "Hello!"},
	{"role": "assistant", "content": "Hello, how are you?"},
	{"role": "user", "content": "Great, thank you! Now i would like some help planning my weekend in Asheville, I would like some hike suggestions."},
	{"role": "assistant", "content": "Asheville is a fantastic destination for hiking enthusiasts! Here are a few suggestions:\n\n1. Black Balsam Knob Trail: This 4.4-mile round trip trail offers stunning views of the surrounding mountains.\n2. Max Patch Loop: This 2.4-mile loop is perfect for families or beginners, featuring wildflowers and mountain views.\n3. Shut-ins Trail: The 2.8-mile hike takes you through a scenic gorge with cascading waterfalls.\n4. Graveyard Fields: For a longer hike, this 8-mile loop takes you through meadows, streams, and a beautiful forest.\n5. Linville Gorge: If you are up for a more challenging hike, the Linville Gorge has various trails with breathtaking views."},
	{"role": "user", "content": "Extract key concepts and relationships from the conversation to form a knowledge graph."}	
	]
  }'

@casper-hansen
Copy link
Contributor

I see where you are coming from - perhaps it should be mentioned in the PR that this only applies to certain endpoints.

@aarnphm
Copy link
Contributor

aarnphm commented Nov 24, 2023

Can we also make sure there is a way to disable chat templates such that users provide messages that are preformatted?

As said before, you use the regular completion endpoint if you want to provide a pre-formatted string to the model.

Explain how we would provide the preformatted text using the chat/completion endpoint which requires a list of messages. Here is an example call:

curl http://0.0.0.0:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer " \
  -d '{
    "model": "Tostino/Inkbot-13B-8k-0.2",
    "stop": ["\n<#user#>", "\n<#bot#>"],
    "stream": false,
    "add_generation_prompt": true,
    "echo": true,
    "n": 1,
    "messages": [
	{"role": "meta-current_date", "content": "2023-10-20"},
	{"role": "meta-task_name", "content": "kg"},
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": "Hello!"},
	{"role": "assistant", "content": "Hello, how are you?"},
	{"role": "user", "content": "Great, thank you! Now i would like some help planning my weekend in Asheville, I would like some hike suggestions."},
	{"role": "assistant", "content": "Asheville is a fantastic destination for hiking enthusiasts! Here are a few suggestions:\n\n1. Black Balsam Knob Trail: This 4.4-mile round trip trail offers stunning views of the surrounding mountains.\n2. Max Patch Loop: This 2.4-mile loop is perfect for families or beginners, featuring wildflowers and mountain views.\n3. Shut-ins Trail: The 2.8-mile hike takes you through a scenic gorge with cascading waterfalls.\n4. Graveyard Fields: For a longer hike, this 8-mile loop takes you through meadows, streams, and a beautiful forest.\n5. Linville Gorge: If you are up for a more challenging hike, the Linville Gorge has various trails with breathtaking views."},
	{"role": "user", "content": "Extract key concepts and relationships from the conversation to form a knowledge graph."}	
	]
  }'

I think chat templates should only apply for /v1/chat/completions. Users can change the chat templates if needed. /v1/completions should still stay the way they were before, which means that users should have full responsibility for formatting their prompt.

@intrafindBreno
Copy link

This pull request is a great contribution to the vllm project. I hope it gets merged soon!

@dongxiaolong
Copy link

Hi @Tostino ,

Thank you so much for your PR! Your contribution has successfully addressed key issues in model inference, enabling the implementation of function calls using models like OpenHermes. Here is an example I created that demonstrates the effective application of OpenHermes: OpenHermes Functions with VLLM. Your work has truly unlocked the potential of OpenHermes. Adding another practical example to your contribution would greatly help others understand the significant impact of this work.

@tjtanaa
Copy link
Contributor

tjtanaa commented Nov 27, 2023

Huggingface now supports chat_template through their tokenizer class. I think if the community are using this feature, it will make chat template tracking easier.

To that end, we have setup a huggingface space that allows developers and the community to encourage the use of huggingface tokenizer's chat_template. We have provided many of the well-known prompt templates. There are many more templates to come within these few weeks.

The ui interface also allows user to download chat template as jinja2 which I think would be beneficial for this PR feature as well.

@Tostino
Copy link
Contributor Author

Tostino commented Nov 27, 2023

@tjtanaa That is a neat tool, glad there is more work going into this in the same direction. Being able to download the template to a file easily would be beneficial for users.
One thing to note, is that it incorrectly parses my template for Tostino/Inkbot-13B-8k-0.2 when I tried. The newlines are ignored.
image

@dongxiaolong Glad you found it useful! Very cool example to see working.

@tjtanaa
Copy link
Contributor

tjtanaa commented Nov 28, 2023

@tjtanaa That is a neat tool, glad there is more work going into this in the same direction. Being able to download the template to a file easily would be beneficial for users. One thing to note, is that it incorrectly parses my template for Tostino/Inkbot-13B-8k-0.2 when I tried. The newlines are ignored. image

@dongxiaolong Glad you found it useful! Very cool example to see working.

Thank you for taking the time to try it out.
I am following the huggingface example which stores the chat-template into a one-liner jinja2 code. So if you want to have a newline, you should try to add in + '\n' explicitly. e.g.

airoboros_v2.jinja2

{% if not add_generation_prompt is defined %}
{% set add_generation_prompt = false %}
{% endif %}
{% for message in messages %}
{% if not loop.first %}
{% endif %}
{% if message['role'] == 'system' %}
{{ message['content'] + '\n' }}
{% elif message['role'] == 'user' %}
{{ 'USER: ' + message['content'] + '\n' }}
{% elif message['role'] == 'assistant' %}
{{ 'ASSISTANT: ' + message['content'] + '</s>' }}
{% endif %}
{% endfor %}
{% if add_generation_prompt %}
{{ 'ASSISTANT:' }}
{% endif %}

@Tostino
Copy link
Contributor Author

Tostino commented Nov 28, 2023

@tjtanaa where can we discuss that issue other than on this PR?

My template is valid jinja, and it does have new lines embedded in the single-line version of it. There was an extra step to correctly load the single line jinja in this PR to deal with new lines properly, I'm guessing you just need to do the same.

Edit: I'll open a discussion on the HF repo.

@Tostino
Copy link
Contributor Author

Tostino commented Nov 29, 2023

@aarnphm / @simon-mo / @WoosukKwon (or anyone else appropriate)

Is there anything that needs to be addressed prior to merging this PR at this point? There is nothing else I am aware of, as it's been pretty thoroughly tested.

help="The file path to the chat template, "
"or the template in single-line form "
"for the specified model")
parser.add_argument("--response-role",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this largely depends on the templates itself right? By default I don't think this is needed.

Copy link
Contributor Author

@Tostino Tostino Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The response-role? Some models may not use user/assistant as the role names. The templates are free for the creator to choose with that. I defaulted to the same behavior as the OpenAI API, but made it so there is compatibility with the flexibility provided by the chat_template feature.

Copy link
Contributor

@aarnphm aarnphm Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ye but we already have that in the message['role'] right (usually alternate between user and assistant, so for example in the phi-1.5 cases it would be BOB and SANDRA?) not sure why we need this

Copy link
Contributor Author

@Tostino Tostino Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it usually alternates between user/assistant...but given the following request, how would we know the role to respond with?

curl http://0.0.0.0:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer " \
  -d '{
    "model": "Tostino/Inkbot-13B-8k-0.2",
    "stream": false,
    "n": 1,
    "messages": [
	{"role": "meta-current_date", "content": "2023-10-20"},
	{"role": "meta-task_name", "content": "general"},
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": "What is the capital of the USA?"}	
	]
  }'

Previously, it was hard-coded as assistant. Now it is configurable with that cli argument. Ideally, this would be additional metadata about how to use the template that would be stored somewhere in the tokenizer...but HF didn't think that far ahead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users should be responsible for doing few-shot prompt right?

Copy link
Contributor Author

@Tostino Tostino Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that is totally unnecessary for a lot of models and just eats up context space. Not to mention that locks us into two-role conversations. What if there is a user_context role that is used for any input from a file the user wants to interact with, and that is appended after their text input. That is a real use case I have been using this implementation for.

Comment on lines +227 to +230
if request.add_generation_prompt:
return response_role
else:
return request.messages[-1]["role"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like here should it just be request.messages[-1]['role']?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that doesn't work.

@Yard1
Copy link
Collaborator

Yard1 commented Nov 30, 2023

@Tostino Would it be possible to add some simple unit tests for this?

@Tostino
Copy link
Contributor Author

Tostino commented Nov 30, 2023

@Yard1 Sure, are there any existing tests for the server that I can add to?

@Yard1
Copy link
Collaborator

Yard1 commented Nov 30, 2023

I don't think there are any for the OpenAI server specifically.

@Tostino
Copy link
Contributor Author

Tostino commented Nov 30, 2023

@Yard1 So, to properly test this, it looks like I need to refactor a whole lot more code... (I could be mistaken...my dayjob is not python so i've never used any of the testing libraries before, and am learning on-the-fly)

Are you sure you want me to do that? I am not doing any more work that will be thrown away...I've spent far too much time on this already.

And it now looks like i'll have to spend more time rebasing, because there are conflicts again.

Tostino and others added 4 commits November 30, 2023 12:18
1. Addition of the `--chat-template` command-line argument to specify a chat template file or single-line template for the model.
2. Implementation of the `--response-role` command-line argument for defining the role name in chat responses when `add_generation_prompt` is set to true.
3. Introduction of the `--disable-endpoints` argument to allow disabling of specific server endpoints.
4. Update to the chat API request handling to support handling finishing a partial response correctly, and echoing input portions of messages (request.add_generation_prompt, and request.echo).
5. Addition of new chat templates in JSON and Jinja formats (`template_chatml.json`, `template_alpaca.jinja`, and `template_inkbot.jinja`) showing the multiple ways they can be specified.
6. More robust error handling, and fix the responses to actually match the OpenAI API format.
7. Update quickstart.rst to show the new features.
…nd simplify the template loading code to remove support for json based templates.
@simon-mo simon-mo changed the title Add Chat Template Support to vLLM (take #2) Support chat template and echo for chat API Nov 30, 2023
@simon-mo simon-mo self-assigned this Nov 30, 2023
@simon-mo simon-mo mentioned this pull request Nov 30, 2023
Fixed issues with chatml template not actually supporting the add_generation_prompt feature. This was just a copy/paste from a random model.
@Tostino
Copy link
Contributor Author

Tostino commented Nov 30, 2023

Thank you very much @simon-mo. I just added a handful of tests, and fixed the chatml template after tests identified some issues. Should be ready for you now.

  • add_generation_prompt: is it really a request parameter or bound by the model itself. I have the feeling that most use case will always set it to one value and doesn't change across requests.

Agree here that most use cases will have a single value...but think of a chat UI that has two buttons (or hotkeys), one that sends the current message, and the other that has the model auto-complete the message the user is typing as they are typing. You would use different values for add_generation_prompt for each of those actions.

@simon-mo
Copy link
Collaborator

simon-mo commented Dec 1, 2023

Oh one more future work could be loading chat template from http url. Let's see if that will be a common request and decide whether to be added.

@simon-mo simon-mo merged commit 66785cc into vllm-project:main Dec 1, 2023
2 checks passed
@flexchar
Copy link

flexchar commented Dec 2, 2023

Hey all, I've been following the new and old thread for weeks, if not months, now. I wanted to tell that this is greatly awaited and appreciated work. Thank you @simon-mo and @Tostino for getting this into the framework.

Do we have an expected date to pin the new version for vLLM? :)

@Tostino
Copy link
Contributor Author

Tostino commented Dec 2, 2023

@flexchar #1856

Should be pretty soon.

@flexchar
Copy link

flexchar commented Dec 8, 2023

I was waiting for this PR so long but I found out that making request to /chat/completion vs /completion where I manually prepare prepare ChatML string, ends up costing 2x as long. My input is 1000 tokens and output 100 tokens. @Tostino any ideas why that could be?

@Tostino
Copy link
Contributor Author

Tostino commented Dec 8, 2023

I was waiting for this PR so long but I found out that making request to /chat/completion vs /completion where I manually prepare prepare ChatML string, ends up costing 2x as long. My input is 1000 tokens and output 100 tokens. @Tostino any ideas why that could be?

@flexchar Not off the top of my head, i've not had any noticeable slowdown between chat/completions and completions. I do notice a slowdown when using streaming (~10-20%), but nothing like a 2x slowdown.

I'm getting ~300-500 tokens/sec generated throughput using 1x 3090 and a llama2 13b AWQ model using the chat/completions endpoint.

Do you have an example you can share to trigger it?

Edit I tried it myself... could not reproduce. They both generate in roughly the same amount of time on my machine.

chat/completions:

curl http://0.0.0.0:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer " \
  -d '{
    "model": "Tostino/Inkbot-13B-8k-0.2",
    "stop": ["\n<#user#>", "\n<#bot#>"],
    "stream": false,
    "add_generation_prompt": true,
    "echo": false,
    "n": 1,
    "temperature": 0,
    "messages": [
	{"role": "meta-current_date", "content": "2023-10-20"},
	{"role": "meta-task_name", "content": "general"},
	{"role": "system", "content": "You are a helpful assistant. Please give long and detailed answers. You are an extremely competent electrical engineer."},
	{"role": "user", "content": "Hello!"},
	{"role": "assistant", "content": "Hello, how are you?"},
	{"role": "user", "content": "Please help me understand impedence matching in the context of a DC circuit."}
	]
  }'

completions:

curl http://0.0.0.0:8000/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer " \
  -d '{
    "model": "Tostino/Inkbot-13B-8k-0.2",
    "echo": false,
    "max_tokens": 1024,
    "stream": false,
    "temperature": 0,
    "prompt": "<#meta#>\n- Date: 2023-10-20\n- Task: general\n<#system#>\nYou are a helpful assistant. Please give long and detailed answers. You are an extremely competent electrical engineer.\n<#chat#>\n<#user#>\nHello!\n<#bot#>\nHello, how are you?\n<#user#>\nPlease help me understand impedence matching in the context of a DC circuit.\n<#bot#>\n"
  }'

@flexchar
Copy link

Thank you for trying. I don't have an easy reproducable example but the next time I work on that part I will do one certainly. I appreciate you testing thou :)

@PeterXiaTian
Copy link

vllm/entrypoints/api_server.py   这里为什么还没有这个参数了?

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
@PhilipMay
Copy link

How do I use the chat template with the "offline inference" with vllm.LLM?

This PR only enables this for the REST API as far as I can see. @Tostino

@Tostino
Copy link
Contributor Author

Tostino commented Apr 25, 2024

Sorry, on mobile right now so going from memory. I believe that there wasn't a "local" equivalent of the chat/completions API when I implemented this. So it was implemented for the REST endpoint, because that's all it could work with unless I did a bunch of extra work to also add an equivalent local version of the chat completions API.

@PeterXiaTian
Copy link

PeterXiaTian commented Apr 26, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.