-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix][Frontend] Use correct, shared tokenizer in OpenAI server #3512
Conversation
Test failures look unrelated (network blips). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a test? We can mock some stuff - just to make sure that if we go through the OpenAI server with different lora requests, they are tokenized correctly.
0aa9277
to
06188e7
Compare
The front-end server code currently doesn't use lora-specific tokenizers. It also won't make use of the recently introduced parallel async tokenization if enabled.
06188e7
to
1db1b92
Compare
Can the same tokenizer be used to apply the chat template as well? |
else: | ||
return self.engine.get_tokenizer_group() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
else: | |
return self.engine.get_tokenizer_group() | |
return self.engine.get_tokenizer_group() |
Currently the LoRA tokenizers aren't used in the OpenAI APIs, meaning the behaviour won't be correct if adapters are used that have custom added tokens. This PR includes changes to address that. It mostly replaces vllm-project#3512. More work is needed to address remaining inconsistencies in tokenization behaviour between the OpenAI front-end and standalone LLMEngine/AsyncLLMEngine use, including: - Standalone cases don't honor truncation and add_special_tokens request parameters - OpenAI API cases don't make use of TokenizerGroups for possible parallelization of tokenization As well as some other inefficiencies. But these are to be addressed in follow-on PRs.
Closing as superseded by #6227. |
The front-end server code currently doesn't use lora-specific tokenizers.
It also won't make use of the recently introduced parallel async tokenization if enabled.