You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SequenceGroup, a group of sequence, that originates from the same request. In most usecases, a sequence group contains only one sequence. In parallel sampling, a request can fork into many sequences, depending on the sampling parameter n. In beam search, sequences in the sequence group can change, grow, die.
Sequence, consists of a sequence seen by the inference engine. It has prompt, generated tokens, kv cache...
In order to support diverse sampling algorithms, vLLM currently takes a SequenceGroup-native approach: many functions operate in the SequenceGroup-level, e.g. prepare_input takes in a list of SequenceGroup.
The problem is, many functions in an inference engine, naturally fit into Sequence-level operations. For example, when we talk about the batchsize for decoding, it is the number of Sequence we are running for decoding, not the number of SequenceGroup.
To fill in the gap, there are many code in vLLM, that receives SequenceGroup, and unpack the SequenceGroup into Sequence for further operations. Notably, prepare input:
This turns out to be very inefficient, makes the code difficult to read/maintain.
To have a rough impression about how inefficient these conversion can be, take a look at #7051 , where simply removing some get_seqs call in SequenceGroup, can lead to 1% end-to-end throughput gain.
Per the discussion in #6226 , we will not directly drop beam search support. However, we should figure out a way to support it, without hurting the performance of majority usecase.
The proposal I want to discuss, is to move the vLLM code into a Sequence-native approach. It is inspired by the lightllm approach:
each request will have a request id, a sequence group id
a sequence in the sequence group, will have a sequence group id, and a sequence id
there will be a global mapping Dict[int, List[int]], maps the sequence group id to the ids of sequences inside the group, only for a sequence group with parallel sampling or beam search
All functions that operate on the Sequence level (mainly the model runner part), will natively receive a list of Sequence. They don't need to unpack SequenceGroup any more.
For some functions that operate on the SequenceGroup level (mainly the scheduler logic for gang-scheduling a sequence group, and the output processor logic that creates/removes sequence in the group), they have to reconstruct the sequence group from given list of sequence, leveraging the global mapping. Note that, an important optimization, is we can skip all the sequence group logic, when we find the global mapping is empty, meaning that we don't have any parallel sampling or beam search.
When we do have parallel sampling or beam search, this will incur some performance drop. However, with the greatly simplified code in the model runner, we can expect the other part of vLLM can be greatly accelerated. Therefore, beam search or parallel sampling can also be faster in the end of the day.
An example benefit, is that this function can be greatly simplified ( we can return early):
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Proposal to improve performance
We have two concepts in vLLM:
n
. In beam search, sequences in the sequence group can change, grow, die.In order to support diverse sampling algorithms, vLLM currently takes a SequenceGroup-native approach: many functions operate in the SequenceGroup-level, e.g.
prepare_input
takes in a list ofSequenceGroup
.The problem is, many functions in an inference engine, naturally fit into Sequence-level operations. For example, when we talk about the batchsize for decoding, it is the number of Sequence we are running for decoding, not the number of SequenceGroup.
To fill in the gap, there are many code in vLLM, that receives SequenceGroup, and unpack the SequenceGroup into Sequence for further operations. Notably, prepare input:
vllm/vllm/worker/model_runner.py
Lines 507 to 510 in 825b044
This turns out to be very inefficient, makes the code difficult to read/maintain.
To have a rough impression about how inefficient these conversion can be, take a look at #7051 , where simply removing some
get_seqs
call inSequenceGroup
, can lead to 1% end-to-end throughput gain.Per the discussion in #6226 , we will not directly drop beam search support. However, we should figure out a way to support it, without hurting the performance of majority usecase.
The proposal I want to discuss, is to move the vLLM code into a Sequence-native approach. It is inspired by the lightllm approach:
Dict[int, List[int]]
, maps the sequence group id to the ids of sequences inside the group, only for a sequence group with parallel sampling or beam searchAll functions that operate on the Sequence level (mainly the model runner part), will natively receive a list of Sequence. They don't need to unpack
SequenceGroup
any more.For some functions that operate on the SequenceGroup level (mainly the scheduler logic for gang-scheduling a sequence group, and the output processor logic that creates/removes sequence in the group), they have to reconstruct the sequence group from given list of sequence, leveraging the global mapping. Note that, an important optimization, is we can skip all the sequence group logic, when we find the global mapping is empty, meaning that we don't have any parallel sampling or beam search.
When we do have parallel sampling or beam search, this will incur some performance drop. However, with the greatly simplified code in the model runner, we can expect the other part of vLLM can be greatly accelerated. Therefore, beam search or parallel sampling can also be faster in the end of the day.
An example benefit, is that this function can be greatly simplified ( we can return early):
vllm/vllm/engine/output_processor/single_step.py
Line 82 in 825b044
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
The text was updated successfully, but these errors were encountered: