[Speculative decoding 2/9] Multi-step worker #1

cadedaniel · 2024-01-05T23:45:42Z

Speculative decoding requires running a draft model several times. If this is orchestrated by the LLMEngine process, we incur high overhead going back and forth to the workers which hold the draft model.

To fix this, we instead allow a worker to execute multiple forward passes.

Co-authored-by: Woosuk Kwon <[email protected]>

* Align top_p and top_k with huggingface * remove _get_prompt_and_output_tokens * rename _apply_top_p_top_k * compare top_p top_k with hf * fix test errors

…m-multi-step-worker-pr

cadedaniel changed the base branch from rejection-sampler to main January 5, 2024 23:46

cadedaniel changed the base branch from main to rejection-sampler January 5, 2024 23:46

cadedaniel changed the title ~~Multi step worker~~ [Speculative decoding 2/9] Multi-step worker Jan 5, 2024

cadedaniel mentioned this pull request Jan 6, 2024

[Speculative decoding 1/9] Optimized rejection sampler vllm-project/vllm#2336

Merged

NadavShmayo and others added 7 commits January 7, 2024 09:48

Changed scheduler to use deques instead of lists (vllm-project#2290)

05921a9

Co-authored-by: Woosuk Kwon <[email protected]>

Fix eager mode performance (vllm-project#2377)

c884819

[Minor] Remove unused code in attention (vllm-project#2384)

28c3f12

Add baichuan chat template jinjia file (vllm-project#2390)

74cd5ab

[Speculative decoding 1/9] Optimized rejection sampler (vllm-project#…

79d64c4

…2336)

get_ip(): Fix ipv4 ipv6 dualstack (vllm-project#2408)

4b61c6b

Rename phi_1_5 -> phi (vllm-project#2385)

50376fa

cadedaniel force-pushed the multi-step-worker branch 2 times, most recently from 391f76a to 4269c84 Compare January 12, 2024 00:32

multi step worker

b20ed29

cadedaniel force-pushed the multi-step-worker branch from 4269c84 to b20ed29 Compare January 12, 2024 00:35

litone01 and others added 15 commits January 11, 2024 19:26

[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (vllm-…

6549aef

…project#1011)

[Minor] Fix the format in quick start guide related to Model Scope (v…

f745847

…llm-project#2425)

Add gradio chatbot for openai webserver (vllm-project#2307)

9746058

fix: deque mutated during iteration in abort_seq_group (vllm-project#…

48cf1e4

…2371)

Allow setting fastapi root_path argument (vllm-project#2341)

ce03624

Address Phi modeling update 2 (vllm-project#2428)

7878958

Suggest using dtype=half when OOM.

cb7a1c1

Update quickstart.rst (vllm-project#2369)

827cbcd

Aligning top_p and top_k Sampling (vllm-project#1885)

218dc2c

* Align top_p and top_k with huggingface * remove _get_prompt_and_output_tokens * rename _apply_top_p_top_k * compare top_p top_k with hf * fix test errors

[Minor] Fix err msg (vllm-project#2431)

35c4bc2

[Minor] Optimize cuda graph memory usage (vllm-project#2437)

9f659bf

[CI] Add Buildkite (vllm-project#2355)

6e01e8c

Announce the second vLLM meetup (vllm-project#2444)

2a18da2

Allow buildkite to retry build on agent lost (vllm-project#2446)

bfc072a

fix weigit loading for GQA with TP (vllm-project#2379)

f780504

simon-mo and others added 7 commits January 16, 2024 09:50

CI: make sure benchmark script exit on error (vllm-project#2449)

947f0b2

ci: retry on build failure as well (vllm-project#2457)

8cd5a99

Add StableLM3B model (vllm-project#2372)

e1957c6

OpenAI Server refactoring (vllm-project#2360)

14cc317

fix block table size miscalculation

5ffceca

Merge remote-tracking branch 'upstream/main' into public-vllm-upstrea…

6a22d07

…m-multi-step-worker-pr

lint

0e1c3b3

cadedaniel closed this Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Speculative decoding 2/9] Multi-step worker #1

[Speculative decoding 2/9] Multi-step worker #1

cadedaniel commented Jan 5, 2024 •

edited

Loading

[Speculative decoding 2/9] Multi-step worker #1

[Speculative decoding 2/9] Multi-step worker #1

Conversation

cadedaniel commented Jan 5, 2024 • edited Loading

cadedaniel commented Jan 5, 2024 •

edited

Loading