Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] [Frontend] Make detokenization optional #3749

Merged

Conversation

mgerstgrasser
Copy link
Contributor

FIX #3635

This PR makes detokenization optional during generation. The changes here are the minimal ones required to enable this: We add an optional detokenize argument to SamplingParams which defaults to True. In LLMEngine we then skip detokenization if detokenize is set to False for the current request. In that case, CompletionOutput.text is returned empty, but CompletionOutput.token_ids contains the generated token IDs.

One thing I don't know is how (if at all) we could make this work with th OpenAI API server, at least in combination with the openai python client library. If anyone has ideas on that, or of course any other feedback, I'd be happy to make changes!

@mgerstgrasser
Copy link
Contributor Author

Quick code snippet to test this:

from vllm import LLM, SamplingParams

llm = LLM(
    model=f"gpt2",
)

prompt = "Today is a good day to"

print("Now with detokenize=False")
sampling_params = SamplingParams(max_tokens=10, temperature=0.0, detokenize=False)
outputs = llm.generate(prompt, sampling_params)
print(outputs[0].outputs[0])

print("Now with detokenize=True")
sampling_params = SamplingParams(max_tokens=10, temperature=0.0, detokenize=True)
outputs = llm.generate(prompt, sampling_params)
print(outputs[0].outputs[0])

yields (skipping the tqdm bar):

Now with detokenize=False
CompletionOutput(index=0, text='', token_ids=[307, 257, 636, 286, 262, 995, 13, 198, 198, 40], cumulative_logprob=-16.733878153841943, logprobs=None, finish_reason=length, stop_reason=None)
Now with detokenize=True
CompletionOutput(index=0, text=' be a part of the world.\n\nI', token_ids=[307, 257, 636, 286, 262, 995, 13, 198, 198, 40], cumulative_logprob=-16.733878153841943, logprobs=None, finish_reason=length, stop_reason=None)

With a larger list of prompts and larger max_tokens I see a roughly 20% speedup on an A100 when disabling detokenization.

@ywang96
Copy link
Member

ywang96 commented Mar 31, 2024

+1 on this feature - in practice this feature can benefit a lot of use cases when the detokenization is not necessary and only the token ids themselves are needed for downstream tasks.

We also observe roughly 10-15% speedup on ShareGPT by simply commenting out

self.detokenizer.decode_prompt_logprobs_inplace(
seq_group, prompt_logprobs)
seq_group.prompt_logprobs = prompt_logprobs

There's also #3748, so I think we should consolidate the efforts for this feature
Edit: the other PR simply disables the use of tokenizer completely, so these two features should technically be able to co-exist.

@mgerstgrasser
Copy link
Contributor Author

mgerstgrasser commented Mar 31, 2024

After digging some more, I don't see an easy way to support this through the OpenAI server, at least while maintaining close compatibility wih their API. Instead, I propose adding token input and output to the (non-OpenAI) API server. I've added a proposed implementation to this PR. The API server now takes either prompt or prompt_token_ids as input. If detokenize is set to True it will return text, otherwise it will return token_ids. (I've also added a --uvicorn-log-level CLI argument to the API server, mirroring the OpenAI server.)

@mgerstgrasser
Copy link
Contributor Author

mgerstgrasser commented Mar 31, 2024

+1 on this feature - in practice this feature can benefit a lot of use cases when the detokenization is not necessary and only the token ids themselves are needed for downstream tasks.

We also observe roughly 10-15% speedup on ShareGPT by simply commenting out

self.detokenizer.decode_prompt_logprobs_inplace(
seq_group, prompt_logprobs)
seq_group.prompt_logprobs = prompt_logprobs

There's also #3748, so I think we should consolidate the efforts for this feature Edit: the other PR simply disables the use of tokenizer completely, so these two features should technically be able to co-exist.

Thank you! And yeah, the other PR was opened almost the same minute as mine, otherwise we could have coordinated. They are indeed somewhat orthogonal though.

@ywang96
Copy link
Member

ywang96 commented Apr 1, 2024

After digging some more, I don't see an easy way to support this through the OpenAI server, at least while maintaining close compatibility wih their API. Instead, I propose adding token input and output to the (non-OpenAI) API server.

@mgerstgrasser The non-OpenAI API server is actually deprecated already and no longer accepting changes. IMO if OpenAI API does not support token ids only as output, then we should keep it as a feature on the engine level instead of server level (at least for this PR).

I believe the goal of the OpenAI API server is the ability to serve an OSS model as a drop-in placement for OpenAI endpoints, so I think we should try to include as little vLLM custom logic there as possible. Plus, there's no restriction for users to build their own API servers with just AsyncLLMEngine. WDYT?

@mgerstgrasser
Copy link
Contributor Author

After digging some more, I don't see an easy way to support this through the OpenAI server, at least while maintaining close compatibility wih their API. Instead, I propose adding token input and output to the (non-OpenAI) API server.

@mgerstgrasser The non-OpenAI API server is actually deprecated already and no longer accepting changes. IMO if OpenAI API does not support token ids only as output, then we should keep it as a feature on the engine level instead of server level (at least for this PR).

I believe the goal of the OpenAI API server is the ability to serve an OSS model as a drop-in placement for OpenAI endpoints, so I think we should try to include as little vLLM custom logic there as possible. Plus, there's no restriction for users to build their own API servers with just AsyncLLMEngine. WDYT?

Yes, absolutely! I kind of did the API server changes for my own project primarily (so exactly as you suggest, easy to build your own API server if needed). I figured I'd be happy to share that, but also don't mind removing it from the PR!

@ywang96 ywang96 self-assigned this Apr 1, 2024
@njhill njhill self-requested a review April 3, 2024 14:02
@ywang96
Copy link
Member

ywang96 commented Apr 3, 2024

@mgerstgrasser Hello! Thanks for making this PR - IMO overall the logic looks good to me, but do you mind adding some tests just as a sanity check?

The other thing I'm not sure about is whether we should enforce detokenize=True if the request is coming from a call to the OAI server in serving_chat.py and serving_completion.py, since the OpenAI API itself does not support token ids only as output.

@mgerstgrasser
Copy link
Contributor Author

mgerstgrasser commented Apr 3, 2024

@mgerstgrasser Hello! Thanks for making this PR - IMO overall the logic looks good to me, but do you mind adding some tests just as a sanity check?

Yes, absolutely! Is there anywhere existing where these could fit in well? I don't see any end-to-end tests where I could easily add something.

The other thing I'm not sure about is whether we should enforce detokenize=True if the request is coming from a call to the OAI server in serving_chat.py and serving_completion.py, since the OpenAI API itself does not support token ids only as output.

Hm, it's already implicitly enforced, because detokenize=True is the default in the SamplingParameters constructor. Since CompletionRequest and ChatCompletionRequest don't include any logic to override it, it can't actually be changed through the API server already. If we wanted to be more explicit about it, we could explicitly set it to True when calling the SamplingParameters constructor in .to_sampling_params() in those Request classes?

@ywang96
Copy link
Member

ywang96 commented Apr 3, 2024

Yes, absolutely! Is there anywhere existing where these could fit in well? I don't see any end-to-end tests where I could easily add something.

Perhaps a new file under tests/engine?

Since CompletionRequest and ChatCompletionRequest don't include any logic to override it

Oh yes that's right - could you leave a # NOTE like the following to indicate that this parameter is only exposed at the engine level for now? Thanks!

vllm/vllm/utils.py

Lines 141 to 143 in b95047f

# NOTE: This import statement should be executed lazily since
# the Neuron-X backend does not have the `cuda_utils` module.
from vllm._C import cuda_utils

@mgerstgrasser mgerstgrasser force-pushed the make_detokenization_optional branch from b98a52b to fedea30 Compare April 4, 2024 00:06
@mgerstgrasser
Copy link
Contributor Author

Yes, absolutely! Is there anywhere existing where these could fit in well? I don't see any end-to-end tests where I could easily add something.

Perhaps a new file under tests/engine?

Since CompletionRequest and ChatCompletionRequest don't include any logic to override it

Oh yes that's right - could you leave a # NOTE like the following to indicate that this parameter is only exposed at the engine level for now? Thanks!

vllm/vllm/utils.py

Lines 141 to 143 in b95047f

# NOTE: This import statement should be executed lazily since
# the Neuron-X backend does not have the `cuda_utils` module.
from vllm._C import cuda_utils

Done, and done! I see a few checks failed now, but as far as I can tell due to internal errors?

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me thanks @mgerstgrasser @ywang96

vllm/sampling_params.py Outdated Show resolved Hide resolved
@njhill
Copy link
Member

njhill commented Apr 4, 2024

@mgerstgrasser looks like you need to run format.sh too

Co-authored-by: Nick Hill <[email protected]>
@mgerstgrasser
Copy link
Contributor Author

@mgerstgrasser looks like you need to run format.sh too

@njhill I did, not sure why it failed earlier, but the simplification commit triggered a re-run and seems to be good now.

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgerstgrasser Thank you again for working on this PR!

@ywang96 ywang96 merged commit aabe8f4 into vllm-project:main Apr 4, 2024
34 checks passed
@mgerstgrasser
Copy link
Contributor Author

@mgerstgrasser Thank you again for working on this PR!

Of course! Thank you both @ywang96 @njhill for being so quick and responsive with reviewing, and for all your work on this fantastic project in general!

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request Apr 22, 2024
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Usage]: Is it possible to generate without detokenizing?
3 participants