Skip to content

Commit

Permalink
Fix performance of top_p and top_k calculations (#449)
Browse files Browse the repository at this point in the history
This change is fixing the performance issue I have introduced in the PR
#414 -- due to the usage of `torch.where` both functions have been
called. Now we will run only the selected one.
  • Loading branch information
kdamaszk authored Oct 30, 2024
1 parent 94858b5 commit d3257b2
Showing 1 changed file with 7 additions and 6 deletions.
13 changes: 7 additions & 6 deletions vllm/model_executor/layers/sampler.py
Original file line number Diff line number Diff line change
Expand Up @@ -267,12 +267,13 @@ def forward(

if do_top_p_top_k and flashinfer_top_k_top_p_sampling is None:
# If we have a scalar p and k, we can use the optimized version.
logits = torch.where(
self._scalar_p_and_k,
self._apply_top_k_top_p_opt(logits, self._top_p_scalar,
self._top_k_scalar),
_apply_top_k_top_p(logits, sampling_tensors.top_ps,
sampling_tensors.top_ks))
if self._scalar_p_and_k.any():
logits = self._apply_top_k_top_p_opt(logits,
self._top_p_scalar.item(),
self._top_k_scalar.item())
else:
logits = _apply_top_k_top_p(logits, sampling_tensors.top_ps,
sampling_tensors.top_ks)

if do_min_p:
logits = _apply_min_p(logits, sampling_tensors.min_ps)
Expand Down

0 comments on commit d3257b2

Please sign in to comment.