-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support chat template and echo
for chat API
#1756
Conversation
@simon-mo whats your opinion on just supporting chat templates through envvar? By doing this, there would be no vllm should be able to document how people can pass in the jinja templates through a envvar, so that vllm won't handle any parsing, and users are responsible for this? I think this is powerful in a way that users have full control of the chat templates. By default, if there is none provided then fall back to huggingface behaviour? I think we can provide a default chat templates if needed edit: This is probably good for serverless as well. |
Can we also make sure there is a way to disable chat templates such that users provide messages that are preformatted? |
As said before, you use the regular Explain how we would provide the preformatted text using the
|
I see where you are coming from - perhaps it should be mentioned in the PR that this only applies to certain endpoints. |
I think chat templates should only apply for |
This pull request is a great contribution to the vllm project. I hope it gets merged soon! |
Hi @Tostino , Thank you so much for your PR! Your contribution has successfully addressed key issues in model inference, enabling the implementation of function calls using models like OpenHermes. Here is an example I created that demonstrates the effective application of OpenHermes: OpenHermes Functions with VLLM. Your work has truly unlocked the potential of OpenHermes. Adding another practical example to your contribution would greatly help others understand the significant impact of this work. |
Huggingface now supports To that end, we have setup a huggingface space that allows developers and the community to encourage the use of huggingface tokenizer's The ui interface also allows user to download chat template as jinja2 which I think would be beneficial for this PR feature as well. |
@tjtanaa That is a neat tool, glad there is more work going into this in the same direction. Being able to download the template to a file easily would be beneficial for users. @dongxiaolong Glad you found it useful! Very cool example to see working. |
Thank you for taking the time to try it out. airoboros_v2.jinja2
|
@tjtanaa where can we discuss that issue other than on this PR? My template is valid jinja, and it does have new lines embedded in the single-line version of it. There was an extra step to correctly load the single line jinja in this PR to deal with new lines properly, I'm guessing you just need to do the same. Edit: I'll open a discussion on the HF repo. |
@aarnphm / @simon-mo / @WoosukKwon (or anyone else appropriate) Is there anything that needs to be addressed prior to merging this PR at this point? There is nothing else I am aware of, as it's been pretty thoroughly tested. |
help="The file path to the chat template, " | ||
"or the template in single-line form " | ||
"for the specified model") | ||
parser.add_argument("--response-role", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this largely depends on the templates itself right? By default I don't think this is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The response-role? Some models may not use user/assistant as the role names. The templates are free for the creator to choose with that. I defaulted to the same behavior as the OpenAI API, but made it so there is compatibility with the flexibility provided by the chat_template feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ye but we already have that in the message['role'] right (usually alternate between user and assistant, so for example in the phi-1.5 cases it would be BOB and SANDRA?) not sure why we need this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it usually alternates between user/assistant...but given the following request, how would we know the role to respond with?
curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer " \
-d '{
"model": "Tostino/Inkbot-13B-8k-0.2",
"stream": false,
"n": 1,
"messages": [
{"role": "meta-current_date", "content": "2023-10-20"},
{"role": "meta-task_name", "content": "general"},
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of the USA?"}
]
}'
Previously, it was hard-coded as assistant
. Now it is configurable with that cli argument. Ideally, this would be additional metadata about how to use the template that would be stored somewhere in the tokenizer...but HF didn't think that far ahead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users should be responsible for doing few-shot prompt right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that is totally unnecessary for a lot of models and just eats up context space. Not to mention that locks us into two-role conversations. What if there is a user_context
role that is used for any input from a file the user wants to interact with, and that is appended after their text input. That is a real use case I have been using this implementation for.
if request.add_generation_prompt: | ||
return response_role | ||
else: | ||
return request.messages[-1]["role"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like here should it just be request.messages[-1]['role']
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that doesn't work.
@Tostino Would it be possible to add some simple unit tests for this? |
@Yard1 Sure, are there any existing tests for the server that I can add to? |
I don't think there are any for the OpenAI server specifically. |
@Yard1 So, to properly test this, it looks like I need to refactor a whole lot more code... (I could be mistaken...my dayjob is not python so i've never used any of the testing libraries before, and am learning on-the-fly) Are you sure you want me to do that? I am not doing any more work that will be thrown away...I've spent far too much time on this already. And it now looks like i'll have to spend more time rebasing, because there are conflicts again. |
1. Addition of the `--chat-template` command-line argument to specify a chat template file or single-line template for the model. 2. Implementation of the `--response-role` command-line argument for defining the role name in chat responses when `add_generation_prompt` is set to true. 3. Introduction of the `--disable-endpoints` argument to allow disabling of specific server endpoints. 4. Update to the chat API request handling to support handling finishing a partial response correctly, and echoing input portions of messages (request.add_generation_prompt, and request.echo). 5. Addition of new chat templates in JSON and Jinja formats (`template_chatml.json`, `template_alpaca.jinja`, and `template_inkbot.jinja`) showing the multiple ways they can be specified. 6. More robust error handling, and fix the responses to actually match the OpenAI API format. 7. Update quickstart.rst to show the new features.
…nd simplify the template loading code to remove support for json based templates.
Co-authored-by: Aaron Pham <[email protected]>
echo
for chat API
Fixed issues with chatml template not actually supporting the add_generation_prompt feature. This was just a copy/paste from a random model.
Thank you very much @simon-mo. I just added a handful of tests, and fixed the chatml template after tests identified some issues. Should be ready for you now.
Agree here that most use cases will have a single value...but think of a chat UI that has two buttons (or hotkeys), one that sends the current message, and the other that has the model auto-complete the message the user is typing as they are typing. You would use different values for add_generation_prompt for each of those actions. |
Oh one more future work could be loading chat template from http url. Let's see if that will be a common request and decide whether to be added. |
I was waiting for this PR so long but I found out that making request to |
@flexchar Not off the top of my head, i've not had any noticeable slowdown between I'm getting ~300-500 tokens/sec generated throughput using 1x 3090 and a llama2 13b AWQ model using the Do you have an example you can share to trigger it? Edit I tried it myself... could not reproduce. They both generate in roughly the same amount of time on my machine.
|
Thank you for trying. I don't have an easy reproducable example but the next time I work on that part I will do one certainly. I appreciate you testing thou :) |
vllm/entrypoints/api_server.py 这里为什么还没有这个参数了? |
How do I use the chat template with the "offline inference" with This PR only enables this for the REST API as far as I can see. @Tostino |
Sorry, on mobile right now so going from memory. I believe that there wasn't a "local" equivalent of the |
well,i see,thanks发自我的 iPhone在 2024年4月25日,23:11,Adam Brusselback ***@***.***> 写道:
Sorry, on mobile right now so going from memory. I believe that there wasn't a "local" equivalent of the chat/completions API when I implemented this. So it was implemented for the REST endpoint, because that's all it could work with unless I did a bunch of extra work to also add an equivalent local version of the chat completions API.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
|
This pull request introduces the chat template feature to vLLM, utilizing the template stored in the tokenizer, enhancing its compatibility with the OpenAI Chat API.
https://huggingface.co/blog/chat-templates
This only affects the OpenAI API
chat/completion
endpoint, the regularcompletion
endpoint does not utilize this feature.There has already been a ton of discussion under this PR: #1493 but I accidentally messed things up by replacing the branch, so we are trying this again...
--chat-template
command-line argument to specify a chat template file or single-line template for the model.--response-role
command-line argument for defining the role name in chat responses whenadd_generation_prompt
is set to true.template_chatml.jinja
,template_alpaca.jinja
, andtemplate_inkbot.jinja
) showing the multiple ways they can be specified.