[Core] [Frontend] Make detokenization optional #3749

mgerstgrasser · 2024-03-30T19:45:21Z

This PR makes detokenization optional during generation. The changes here are the minimal ones required to enable this: We add an optional detokenize argument to SamplingParams which defaults to True. In LLMEngine we then skip detokenization if detokenize is set to False for the current request. In that case, CompletionOutput.text is returned empty, but CompletionOutput.token_ids contains the generated token IDs.

One thing I don't know is how (if at all) we could make this work with th OpenAI API server, at least in combination with the openai python client library. If anyone has ideas on that, or of course any other feedback, I'd be happy to make changes!

mgerstgrasser · 2024-03-31T00:23:12Z

Quick code snippet to test this:

from vllm import LLM, SamplingParams

llm = LLM(
    model=f"gpt2",
)

prompt = "Today is a good day to"

print("Now with detokenize=False")
sampling_params = SamplingParams(max_tokens=10, temperature=0.0, detokenize=False)
outputs = llm.generate(prompt, sampling_params)
print(outputs[0].outputs[0])

print("Now with detokenize=True")
sampling_params = SamplingParams(max_tokens=10, temperature=0.0, detokenize=True)
outputs = llm.generate(prompt, sampling_params)
print(outputs[0].outputs[0])

yields (skipping the tqdm bar):

Now with detokenize=False
CompletionOutput(index=0, text='', token_ids=[307, 257, 636, 286, 262, 995, 13, 198, 198, 40], cumulative_logprob=-16.733878153841943, logprobs=None, finish_reason=length, stop_reason=None)
Now with detokenize=True
CompletionOutput(index=0, text=' be a part of the world.\n\nI', token_ids=[307, 257, 636, 286, 262, 995, 13, 198, 198, 40], cumulative_logprob=-16.733878153841943, logprobs=None, finish_reason=length, stop_reason=None)

With a larger list of prompts and larger max_tokens I see a roughly 20% speedup on an A100 when disabling detokenization.

ywang96 · 2024-03-31T01:06:42Z

+1 on this feature - in practice this feature can benefit a lot of use cases when the detokenization is not necessary and only the token ids themselves are needed for downstream tasks.

We also observe roughly 10-15% speedup on ShareGPT by simply commenting out

vllm/vllm/engine/llm_engine.py

Lines 419 to 421 in 9c82a1b

    
           self.detokenizer.decode_prompt_logprobs_inplace( 
        
               seq_group, prompt_logprobs) 
        
           seq_group.prompt_logprobs = prompt_logprobs

~~There's also #3748, so I think we should consolidate the efforts for this feature~~
Edit: the other PR simply disables the use of tokenizer completely, so these two features should technically be able to co-exist.

mgerstgrasser · 2024-03-31T05:01:25Z

After digging some more, I don't see an easy way to support this through the OpenAI server, at least while maintaining close compatibility wih their API. Instead, I propose adding token input and output to the (non-OpenAI) API server. I've added a proposed implementation to this PR. The API server now takes either prompt or prompt_token_ids as input. If detokenize is set to True it will return text, otherwise it will return token_ids. (I've also added a --uvicorn-log-level CLI argument to the API server, mirroring the OpenAI server.)

mgerstgrasser · 2024-03-31T05:05:06Z

+1 on this feature - in practice this feature can benefit a lot of use cases when the detokenization is not necessary and only the token ids themselves are needed for downstream tasks.

We also observe roughly 10-15% speedup on ShareGPT by simply commenting out

vllm/vllm/engine/llm_engine.py

Lines 419 to 421 in 9c82a1b

self.detokenizer.decode_prompt_logprobs_inplace(

seq_group, prompt_logprobs)

seq_group.prompt_logprobs = prompt_logprobs

~~There's also #3748, so I think we should consolidate the efforts for this feature~~ Edit: the other PR simply disables the use of tokenizer completely, so these two features should technically be able to co-exist.

Thank you! And yeah, the other PR was opened almost the same minute as mine, otherwise we could have coordinated. They are indeed somewhat orthogonal though.

ywang96 · 2024-04-01T00:11:14Z

After digging some more, I don't see an easy way to support this through the OpenAI server, at least while maintaining close compatibility wih their API. Instead, I propose adding token input and output to the (non-OpenAI) API server.

@mgerstgrasser The non-OpenAI API server is actually deprecated already and no longer accepting changes. IMO if OpenAI API does not support token ids only as output, then we should keep it as a feature on the engine level instead of server level (at least for this PR).

I believe the goal of the OpenAI API server is the ability to serve an OSS model as a drop-in placement for OpenAI endpoints, so I think we should try to include as little vLLM custom logic there as possible. Plus, there's no restriction for users to build their own API servers with just AsyncLLMEngine. WDYT?

mgerstgrasser · 2024-04-01T00:14:04Z

After digging some more, I don't see an easy way to support this through the OpenAI server, at least while maintaining close compatibility wih their API. Instead, I propose adding token input and output to the (non-OpenAI) API server.

@mgerstgrasser The non-OpenAI API server is actually deprecated already and no longer accepting changes. IMO if OpenAI API does not support token ids only as output, then we should keep it as a feature on the engine level instead of server level (at least for this PR).

I believe the goal of the OpenAI API server is the ability to serve an OSS model as a drop-in placement for OpenAI endpoints, so I think we should try to include as little vLLM custom logic there as possible. Plus, there's no restriction for users to build their own API servers with just AsyncLLMEngine. WDYT?

Yes, absolutely! I kind of did the API server changes for my own project primarily (so exactly as you suggest, easy to build your own API server if needed). I figured I'd be happy to share that, but also don't mind removing it from the PR!

This reverts commit 3dd6255.

ywang96 · 2024-04-03T23:18:24Z

@mgerstgrasser Hello! Thanks for making this PR - IMO overall the logic looks good to me, but do you mind adding some tests just as a sanity check?

The other thing I'm not sure about is whether we should enforce detokenize=True if the request is coming from a call to the OAI server in serving_chat.py and serving_completion.py, since the OpenAI API itself does not support token ids only as output.

mgerstgrasser · 2024-04-03T23:30:54Z

@mgerstgrasser Hello! Thanks for making this PR - IMO overall the logic looks good to me, but do you mind adding some tests just as a sanity check?

Yes, absolutely! Is there anywhere existing where these could fit in well? I don't see any end-to-end tests where I could easily add something.

The other thing I'm not sure about is whether we should enforce detokenize=True if the request is coming from a call to the OAI server in serving_chat.py and serving_completion.py, since the OpenAI API itself does not support token ids only as output.

Hm, it's already implicitly enforced, because detokenize=True is the default in the SamplingParameters constructor. Since CompletionRequest and ChatCompletionRequest don't include any logic to override it, it can't actually be changed through the API server already. If we wanted to be more explicit about it, we could explicitly set it to True when calling the SamplingParameters constructor in .to_sampling_params() in those Request classes?

ywang96 · 2024-04-03T23:47:33Z

Yes, absolutely! Is there anywhere existing where these could fit in well? I don't see any end-to-end tests where I could easily add something.

Perhaps a new file under tests/engine?

Since CompletionRequest and ChatCompletionRequest don't include any logic to override it

Oh yes that's right - could you leave a # NOTE like the following to indicate that this parameter is only exposed at the engine level for now? Thanks!

vllm/vllm/utils.py

Lines 141 to 143 in b95047f

    
           # NOTE: This import statement should be executed lazily since 
        
           # the Neuron-X backend does not have the `cuda_utils` module. 
        
           from vllm._C import cuda_utils

mgerstgrasser · 2024-04-04T01:17:41Z

Yes, absolutely! Is there anywhere existing where these could fit in well? I don't see any end-to-end tests where I could easily add something.

Perhaps a new file under tests/engine?

Since CompletionRequest and ChatCompletionRequest don't include any logic to override it

Oh yes that's right - could you leave a # NOTE like the following to indicate that this parameter is only exposed at the engine level for now? Thanks!

vllm/vllm/utils.py

Lines 141 to 143 in b95047f

# NOTE: This import statement should be executed lazily since

# the Neuron-X backend does not have the `cuda_utils` module.

from vllm._C import cuda_utils

Done, and done! I see a few checks failed now, but as far as I can tell due to internal errors?

njhill

Looks great to me thanks @mgerstgrasser @ywang96

vllm/sampling_params.py

njhill · 2024-04-04T02:53:29Z

@mgerstgrasser looks like you need to run format.sh too

Co-authored-by: Nick Hill <[email protected]>

mgerstgrasser · 2024-04-04T03:08:58Z

@mgerstgrasser looks like you need to run format.sh too

@njhill I did, not sure why it failed earlier, but the simplification commit triggered a re-run and seems to be good now.

ywang96

@mgerstgrasser Thank you again for working on this PR!

mgerstgrasser · 2024-04-04T06:17:12Z

@mgerstgrasser Thank you again for working on this PR!

Of course! Thank you both @ywang96 @njhill for being so quick and responsive with reviewing, and for all your work on this fantastic project in general!

Co-authored-by: Nick Hill <[email protected]>

make detokenization optional

bcb371e

ywang96 mentioned this pull request Mar 31, 2024

Make initialization of tokenizer and detokenizer optional #3748

Merged

Allow skipping detokenization in API server

3dd6255

Revert "Allow skipping detokenization in API server"

36fb3be

This reverts commit 3dd6255.

ywang96 self-assigned this Apr 1, 2024

njhill self-requested a review April 3, 2024 14:02

mgerstgrasser added 2 commits April 3, 2024 17:03

Add note.

9627425

Add detokenization test

fedea30

mgerstgrasser force-pushed the make_detokenization_optional branch from b98a52b to fedea30 Compare April 4, 2024 00:06

njhill approved these changes Apr 4, 2024

View reviewed changes

vllm/sampling_params.py Outdated Show resolved Hide resolved

Simplification

61e33a2

Co-authored-by: Nick Hill <[email protected]>

njhill approved these changes Apr 4, 2024

View reviewed changes

ywang96 approved these changes Apr 4, 2024

View reviewed changes

ywang96 merged commit aabe8f4 into vllm-project:main Apr 4, 2024
34 checks passed

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request Apr 22, 2024

[Core] [Frontend] Make detokenization optional (vllm-project#3749)

69d36e6

Co-authored-by: Nick Hill <[email protected]>

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024

[Core] [Frontend] Make detokenization optional (vllm-project#3749)

f3dfd18

Co-authored-by: Nick Hill <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] [Frontend] Make detokenization optional #3749

[Core] [Frontend] Make detokenization optional #3749

mgerstgrasser commented Mar 30, 2024

mgerstgrasser commented Mar 31, 2024

ywang96 commented Mar 31, 2024 •

edited

Loading

mgerstgrasser commented Mar 31, 2024 •

edited

Loading

mgerstgrasser commented Mar 31, 2024 •

edited

Loading

ywang96 commented Apr 1, 2024

mgerstgrasser commented Apr 1, 2024

ywang96 commented Apr 3, 2024

mgerstgrasser commented Apr 3, 2024 •

edited

Loading

ywang96 commented Apr 3, 2024 •

edited

Loading

mgerstgrasser commented Apr 4, 2024

njhill left a comment

njhill commented Apr 4, 2024

mgerstgrasser commented Apr 4, 2024

ywang96 left a comment

mgerstgrasser commented Apr 4, 2024

[Core] [Frontend] Make detokenization optional #3749

[Core] [Frontend] Make detokenization optional #3749

Conversation

mgerstgrasser commented Mar 30, 2024

mgerstgrasser commented Mar 31, 2024

ywang96 commented Mar 31, 2024 • edited Loading

mgerstgrasser commented Mar 31, 2024 • edited Loading

mgerstgrasser commented Mar 31, 2024 • edited Loading

ywang96 commented Apr 1, 2024

mgerstgrasser commented Apr 1, 2024

ywang96 commented Apr 3, 2024

mgerstgrasser commented Apr 3, 2024 • edited Loading

ywang96 commented Apr 3, 2024 • edited Loading

mgerstgrasser commented Apr 4, 2024

njhill left a comment

Choose a reason for hiding this comment

njhill commented Apr 4, 2024

mgerstgrasser commented Apr 4, 2024

ywang96 left a comment

Choose a reason for hiding this comment

mgerstgrasser commented Apr 4, 2024

ywang96 commented Mar 31, 2024 •

edited

Loading

mgerstgrasser commented Mar 31, 2024 •

edited

Loading

mgerstgrasser commented Mar 31, 2024 •

edited

Loading

mgerstgrasser commented Apr 3, 2024 •

edited

Loading

ywang96 commented Apr 3, 2024 •

edited

Loading