[Feature]: Benchmark script with speculative decode metrics #7586

cermeng · 2024-08-16T08:25:04Z

🚀 The feature, motivation and pitch

I am looking to assess the performance of vllm for speculative decode, but I have been unable to find an offline benchmark script similar to benchmark_latency.py that would allow me to test speculative decode performance. While I can use benchmark_latency.py to obtain e2e latency, it does not provide all of the spec-decode metrics such as the time spent on scoring, verifying, and proposing, as well as the acceptance rate.

Thanks to @cadedaniel's excellent contributions such as #6963 and #3103, we are now able to display spec-decode metrics, including scoring time, verification time, proposal time, and acceptance rate, in the server logging.

However, these metrics can only be viewed in online server logs and are implemented through an asynchronous collector, which could result in inaccuracies. I am considering adding a script called 'benchmark_spec_decode.py' for spec-decode benchmarking in order to capture more spec-decode-related metrics.

Some Proposal

Add a new field spec_decode_metrics of RequestMetrics

vllm/vllm/sequence.py

Lines 87 to 112 in 9587b05

    
           class RequestMetrics: 
        
               """Metrics associated with a request. 
        
               Attributes: 
        
                   arrival_time: The time when the request arrived. 
        
                   first_scheduled_time: The time when the request was first scheduled. 
        
                   first_token_time: The time when the first token was generated. 
        
                   time_in_queue: The time the request spent in the queue. 
        
                   finished_time: The time when the request was finished. 
        
                   scheduler_time: The time spent in the scheduler when this request was 
        
                                   being considered by the scheduler. 
        
                   model_forward_time: The time spent in the model forward pass when this 
        
                                       request was in the batch. 
        
                   model_execute_time: The time spent in the model execute function. This 
        
                                       will include model forward, block/sync across 
        
                                       workers, cpu-gpu sync time and sampling time. 
        
               """ 
        
               arrival_time: float 
        
               last_token_time: float 
        
               first_scheduled_time: Optional[float] 
        
               first_token_time: Optional[float] 
        
               time_in_queue: Optional[float] 
        
               finished_time: Optional[float] = None 
        
               scheduler_time: Optional[float] = None 
        
               model_forward_time: Optional[float] = None 
        
               model_execute_time: Optional[float] = None

and we can also consolidate the class SpecDecodeWorkerMetrics for more metrics related to spec-decode

vllm/vllm/spec_decode/metrics.py

Line 13 in 9587b05

class SpecDecodeWorkerMetrics:

Alternatives

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

cermeng · 2024-08-16T08:29:54Z

@cadedaniel I'm wondering if you can provide me with some feedback and suggestions. I'm glad to contribute.

cadedaniel · 2024-08-16T19:01:53Z

The idea is good and a contribution here is welcome. My primary concern is latency overheads from metrics collection; i.e. the additional logic required to parse the acceptance rate into per-sequence acceptance info.

Suggestion (either/or):

Improve the metrics emitting logic so you can easily read the logs using a benchmark script like you suggest, using existing metrics logic (which has low overhead). This can be turned on by default as long as it's low overhead.
Add an opt-in flag for fine-grained speculative metrics (can be per sequence). When it's disabled, there is ~zero latency overhead. When it's enabled, it can print metrics to the log in SpecDecodeWorker; the user can then scrape the logs to get the information they need.

BTW, similar discussion happening here #7522

cadedaniel · 2024-08-16T19:22:24Z

also, you can enable the metrics to be printed in benchmark_latency via:

diff --git a/benchmarks/benchmark_latency.py b/benchmarks/benchmark_latency.py
index 97afd301c8f..0ee2bfabb82 100644
--- a/benchmarks/benchmark_latency.py
+++ b/benchmarks/benchmark_latency.py
@@ -47,6 +47,7 @@ def main(args: argparse.Namespace):
         distributed_executor_backend=args.distributed_executor_backend,
         otlp_traces_endpoint=args.otlp_traces_endpoint,
         enable_prefix_caching=args.enable_prefix_caching,
+        disable_log_stats=False,
     )
 
     sampling_params = SamplingParams(

ccamacho · 2024-08-16T20:12:55Z

Hi Im evaluating speculative decoding, and I'm not able to get any gain from it.

I tested opt 2.7B/llama 3.1 8B/llama 3 8B with the following server configuration parameters:

Using a draft model

--port=8080
--model=/mnt/models
--served-model-name={{.Name}}
--distributed-executor-backend=mp
--use-v2-block-manager
--enforce-eager
--speculative-model=/mnt/models/-accelerator
--num-speculative-tokens=3
--ngram-prompt-lookup-max=3

Using n-gram

--port=8080
--model=/mnt/models
--served-model-name={{.Name}}
--distributed-executor-backend=mp
--use-v2-block-manager
--enforce-eager
--speculative-model=[ngram]
--num-speculative-tokens=3
--ngram-prompt-lookup-max=3

Speculative decoding disabled

--port=8080
--model=/mnt/models
--served-model-name={{.Name}}
--distributed-executor-backend=mp
--use-v2-block-manager
--enforce-eager

And the overall behavior is that the less performant approach is with the draft model, then n-gram, and the best case is with speculative decoding off. These results are with an A100-40GB and vLLM 0.5.3 post1.

Is there any guide on the best configuration or scenarios where we can et the most of this feature?

Thanks!

LiuXiaoxuanPKU · 2024-08-19T16:15:20Z

Hi Im evaluating speculative decoding, and I'm not able to get any gain from it.

I tested opt 2.7B/llama 3.1 8B/llama 3 8B with the following server configuration parameters:

Using a draft model
--port=8080
--model=/mnt/models
--served-model-name={{.Name}}
--distributed-executor-backend=mp
--use-v2-block-manager
--enforce-eager
--speculative-model=/mnt/models/-accelerator
--num-speculative-tokens=3
--ngram-prompt-lookup-max=3
Using n-gram
--port=8080
--model=/mnt/models
--served-model-name={{.Name}}
--distributed-executor-backend=mp
--use-v2-block-manager
--enforce-eager
--speculative-model=[ngram]
--num-speculative-tokens=3
--ngram-prompt-lookup-max=3
Speculative decoding disabled
--port=8080
--model=/mnt/models
--served-model-name={{.Name}}
--distributed-executor-backend=mp
--use-v2-block-manager
--enforce-eager
And the overall behavior is that the less performant approach is with the draft model, then n-gram, and the best case is with speculative decoding off. These results are with an A100-40GB and vLLM 0.5.3 post1.

Is there any guide on the best configuration or scenarios where we can et the most of this feature?

Thanks!

Hey! Thanks for the interest. What's the draft model you are using for opt 2.7B/llama 3.1 8B/llama 3 8B?

ngram is normally good for document QA or summary, it's not good for online chatting. The perf of SD is workload, model, and hardware dependent.

github-actions · 2024-11-18T02:04:49Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

cermeng added the feature request label Aug 16, 2024

sergeykochetkov mentioned this issue Aug 17, 2024

[Performance] [Speculative decoding]: Replace scoring spec tokens via batched 1-step generation by n-step prefill #7255

Closed

github-actions bot added the stale label Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Benchmark script with speculative decode metrics #7586

[Feature]: Benchmark script with speculative decode metrics #7586

cermeng commented Aug 16, 2024

cermeng commented Aug 16, 2024

cadedaniel commented Aug 16, 2024

cadedaniel commented Aug 16, 2024

ccamacho commented Aug 16, 2024 •

edited

Loading

LiuXiaoxuanPKU commented Aug 19, 2024

Using a draft model

Using n-gram

Speculative decoding disabled

github-actions bot commented Nov 18, 2024

[Feature]: Benchmark script with speculative decode metrics #7586

[Feature]: Benchmark script with speculative decode metrics #7586

Comments

cermeng commented Aug 16, 2024

🚀 The feature, motivation and pitch

Some Proposal

Alternatives

Additional context

cermeng commented Aug 16, 2024

cadedaniel commented Aug 16, 2024

cadedaniel commented Aug 16, 2024

ccamacho commented Aug 16, 2024 • edited Loading

Using a draft model

Using n-gram

Speculative decoding disabled

LiuXiaoxuanPKU commented Aug 19, 2024

Using a draft model

Using n-gram

Speculative decoding disabled

github-actions bot commented Nov 18, 2024

ccamacho commented Aug 16, 2024 •

edited

Loading