Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BugFix][Frontend] Use correct, shared tokenizer in OpenAI server #3512

Closed
wants to merge 1 commit into from

Conversation

njhill
Copy link
Member

@njhill njhill commented Mar 19, 2024

The front-end server code currently doesn't use lora-specific tokenizers.

It also won't make use of the recently introduced parallel async tokenization if enabled.

@njhill
Copy link
Member Author

njhill commented Mar 20, 2024

Test failures look unrelated (network blips).

Copy link
Collaborator

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a test? We can mock some stuff - just to make sure that if we go through the OpenAI server with different lora requests, they are tokenized correctly.

@njhill njhill force-pushed the openai-tokenizer branch 2 times, most recently from 0aa9277 to 06188e7 Compare March 29, 2024 17:30
The front-end server code currently doesn't use lora-specific tokenizers.

It also won't make use of the recently introduced parallel async tokenization if enabled.
@DarkLight1337
Copy link
Member

Can the same tokenizer be used to apply the chat template as well?

Comment on lines +385 to +386
else:
return self.engine.get_tokenizer_group()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
else:
return self.engine.get_tokenizer_group()
return self.engine.get_tokenizer_group()

njhill added a commit to njhill/vllm that referenced this pull request Jul 8, 2024
Currently the LoRA tokenizers aren't used in the OpenAI APIs, meaning the behaviour won't be correct if adapters are used that have custom added tokens. This PR includes changes to address that. It mostly replaces vllm-project#3512.

More work is needed to address remaining inconsistencies in tokenization behaviour between the OpenAI front-end and standalone LLMEngine/AsyncLLMEngine use, including:
- Standalone cases don't honor truncation and add_special_tokens request parameters
- OpenAI API cases don't make use of TokenizerGroups for possible parallelization of tokenization

As well as some other inefficiencies.

But these are to be addressed in follow-on PRs.
@DarkLight1337
Copy link
Member

Closing as superseded by #6227.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants