Error in benchmark model with vllm backend #3230

hahmad2008 · 2024-03-06T12:03:59Z

I couldn't benchmark my model, seems the benchmark send requests without wait for the response, so the following error is raised:

python benchmark_serving.py \
               --backend vllm \
               --model "MYMODEL/PATH" \
               --port 8000 --host 0.0.0.0 \
               --dataset ShareGPT_V3_unfiltered_cleaned_split.json \
               --num-prompts 400 \
               --endpoint /generate \
               --save-result

I also run the service and I can access it thought http://0.0.0.0:8000

but I got this error:

Namespace(backend='vllm', version='N/A', base_url=None, host='0.0.0.0', port=8000, endpoint='/generate', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', model='MYMODEL/PATH', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=400, request_rate=inf, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=True)
Token indices sequence length is longer than the specified maximum sequence length for this model (515 > 255). Running this sequence through the model will result in indexing errors
  0%|                                                                                                                     | 0/400 [00:00<?, ?it/s]Traffic request rate: inf
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [00:05<00:00, 68.13it/s]
completed:  0
myconda/lib/python3.9/site-packages/numpy/core/fromnumeric.py:3504: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
myconda/lib/python3.9/site-packages/numpy/core/_methods.py:129: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
  File "~/benchmark/benchmark_serving.py", line 389, in <module>
    main(args)
  File "~/benchmark/benchmark_serving.py", line 261, in main
    benchmark_result = asyncio.run(
  File "myconda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "myconda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "~/benchmark/benchmark_serving.py", line 204, in benchmark
    metrics = calculate_metrics(
  File "~/benchmark/benchmark_serving.py", line 151, in calculate_metrics
    p99_ttft_ms=np.percentile(ttfts, 99) * 1000,
  File "myconda/lib/python3.9/site-packages/numpy/lib/function_base.py", line 4283, in percentile
    return _quantile_unchecked(
  File "myconda/lib/python3.9/site-packages/numpy/lib/function_base.py", line 4555, in _quantile_unchecked
    return _ureduce(a,
  File "myconda/lib/python3.9/site-packages/numpy/lib/function_base.py", line 3823, in _ureduce
    r = func(a, **kwargs)
  File "myconda/lib/python3.9/site-packages/numpy/lib/function_base.py", line 4722, in _quantile_ureduce_func
    result = _quantile(arr,
  File "myconda/lib/python3.9/site-packages/numpy/lib/function_base.py", line 4831, in _quantile
    slices_having_nans = np.isnan(arr[-1, ...])
IndexError: index -1 is out of bounds for axis 0 with size 0

Do I miss anything here?

The text was updated successfully, but these errors were encountered:

ywang96 · 2024-03-06T17:24:51Z

@hahmad2008 Hello! It seems that the none of the request goes through as you can tell from RuntimeWarning: Mean of empty which means the metrics list are empty.

Are you benchmarking on the vllm API server instead of OpenAI server?

hahmad2008 · 2024-03-06T17:32:40Z

@ywang96 Thanks for your response. I run vllm service serve run vllm_service:service --port 8000 --host 0.0.0.0
and this is the log from service:

(ServeReplica:default:OpenLLMDeployment pid=14442) INFO 2024-03-06 11:47:41,540 default_OpenLLMDeployment wMWRGW f70ceb65-1bfb-446a-a125-7e68ada39198 /generate replica.py:772 - __CALL__ OK 0.8ms
(ServeReplica:default:OpenLLMDeployment pid=14442) INFO 2024-03-06 11:47:41,546 default_OpenLLMDeployment wMWRGW a28f82e1-4db1-423f-a2c6-eef4ab8f947d /generate replica.py:772 - __CALL__ OK 2.9ms
(ServeReplica:default:OpenLLMDeployment pid=14442) INFO 2024-03-06 11:47:41,547 default_OpenLLMDeployment wMWRGW

seems the request are sent but the benchmark script doesn't wait the response.

ywang96 · 2024-03-06T17:57:03Z

Could you clarify on serve run vllm_service:service (I'm guessing you're using something like BentoML to launch the server)?

It's possible that there's additional layer of output processing on top of the actual vllm container running inside this service, so the benchmark script doesn't know how to process the output coming out from this service.

Also I would suggest using the openai server if possible since the /generate API server has already been locked from any further development.

hahmad2008 · 2024-03-06T18:09:24Z

it define a vllm class to serve LLM. I can access the service while it is running on this link 'http://0.0.0:8000/generate?prompt=REQUESTED_PROMPT'

ywang96 · 2024-03-06T18:29:47Z

My suggestion is adding a line print(data.decode("utf-8") right after when we iterate over streaming output to see what it looks like:

vllm/benchmarks/backend_request_func.py

Line 107 in a33ce60

async for data in response.content.iter_any():

It would also help if you can add print(response.reason) after

vllm/benchmarks/backend_request_func.py

Line 120 in a33ce60

output.success = False

to see if the client session is not receiving a successful response

hahmad2008 · 2024-03-06T19:02:13Z

@ywang96
response.reason: Unprocessable Entity

I print the response itself before this

vllm/benchmarks/backend_request_func.py

Line 106 in a33ce60

if response.status == 200:

I got:

response:  <ClientResponse(http://0.0.0.0:8000/generate) [422 Unprocessable Entity]>
<CIMultiDictProxy('Date': 'Wed, 06 Mar 2024 19:03:55 GMT', 'Server': 'uvicorn', 'Content-Length': '142', 'Content-Type': 'application/json', 'x-request-id': '6b904b36-3c2c-4cd4-94bb-687dc9aae818')>

hahmad2008 · 2024-03-06T19:17:48Z

@ywang96
i make payload to only have prompt field. the payload has really weird thing:


api_url:  http://0.0.0.0:8000/generate

payload:  {'prompt': 'Here\'s an example of log-level transformation in a panel regression model using R:\n```perl\n# Load the necessary libraries\nlibrary(plm)\nlibrary(ggplot2)\n\n# Load the data and create the log-level transformation\ndata("Grunfeld", package = "plm")\ndata <- Grunfeld\ndata$invest_log <- log(data$invest)\n\n# Fit the log-level transformed panel regression model\nmodel <- plm(invest_log ~ value + capital, data = data, index = c("firm", "year"))\nsummary(model)\n\n# Plot the results\nggplot(data, aes(x = value, y = invest_log, color = factor(firm))) +\n  geom_point() +\n  geom_smooth(method = "lm", se = FALSE) +\n  labs(x = "Value", y = "Investment (Log-Level)", color = "Firm")\n```\nIn this example, the "Grunfeld" data set is loaded from the "plm" library, and a log-level transformation is created by taking the logarithm of the "invest" variable. The panel regression model is fit using the "plm" function, with "firm" and "year" as the panel index variables, and "value" and "capital" as the independent variables. The summary of the model shows the coefficients, standard errors, t-values, and p-values for each variable. Finally, a plot is created using the "ggplot2" library to visualize the relationship between "value" and the log-level transformed "investment."'}

response:  <ClientResponse(http://0.0.0.0:8000/generate) [422 Unprocessable Entity]>
<CIMultiDictProxy('Date': 'Wed, 06 Mar 2024 19:14:47 GMT', 'Server': 'uvicorn', 'Content-Length': '142', 'Content-Type': 'application/json', 'x-request-id': 'ac548d8e-eb39-48c0-b9b0-5a1419b8a058')>

 response.reason:  Unprocessable Entity

ywang96 · 2024-03-06T19:32:26Z

Unprocessable Entity is an input error - have you tried sending a sample request to the service and see how it works?

hahmad2008 · 2024-03-06T19:39:42Z

this work for me and I get a response

curl -X 'POST' \
  'http://localhost:8000/generate?prompt=what%27s%20your%20name%3F' \
  -H 'accept: application/json' \
  -d ''

but this way, it doesn't work:

$ curl -X 'POST' 'http://0.0.0.0:8000/generate'  -H 'accept: application/json' -H 'Content-Type: application/json'  -d '{ "prompt": "What is the meaning of life?"}'

{"detail":[{"type":"missing","loc":["query","prompt"],"msg":"Field required","input":null,"url":"https://errors.pydantic.dev/2.6/v/missing"}]}

ywang96 · 2024-03-06T20:11:58Z

Yep - so your service indeed behaves very differently from the vanilla vllm server and does some input parsing from the API URL itself, which is an anti-pattern imo, so this should not be supported by the benchmark script we have in the first place.

I'd suggest you take a look at request_backend_func.py and add your own request function to work with your service. We made it so that the main benchmark script benchmark_serving.py won't need to be touched at all.

#2992 is a good example on how to add your own request function.

hahmad2008 · 2024-03-06T20:18:59Z

@ywang96
I fixed this part and now the http code for the response is 200. but the data is null!!!

response:  <ClientResponse(http://0.0.0.0:8000/generate) [200 OK]>
<CIMultiDictProxy('Date': 'Wed, 06 Mar 2024 20:07:46 GMT', 'Server': 'uvicorn', 'Content-Length': '2', 'Content-Type': 'application/json', 'x-request-id': '6963f57a-1910-4399-8650-e0d92a10f475')>

response.reason:  OK
data:  b'{"text":""}'
body:  {"text":""}

Traceback (most recent call last):
  File "benchmark/benchmark_serving.py", line 387, in <module>
    main(args)
  File "benchmark/benchmark_serving.py", line 259, in main
    benchmark_result = asyncio.run(
  File "myenv/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "myenv/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "enchmark/benchmark_serving.py", line 195, in benchmark
    outputs = await asyncio.gather(*tasks)
  File "benchmark/backend_request_func.py", line 125, in async_request_vllm
    output.generated_text = json.loads(
IndexError: string index out of range
  0%|                                                                                                                             | 0/400 [00:00<?, ?it/s]

and I got a response for this

$ curl -X 'POST' 'http://0.0.0.0:5000/generate'  -H 'accept: application/json' -H 'Content-Type: application/json'  -d '{ "prompt": "What is the meaning of life?"}'

{"text":"\n\nThe question of the meaning of life has puzzled humanity for millennia and remains a highly subjective and philosophical matter. Different individuals, cultures, and philosophies offer various interpretations."}

ywang96 · 2024-03-06T20:22:19Z

Does your service support streaming at all? This seems to me that the server does not support streaming, but we do need it for calculate TTFT (time to first token). You can look at the request function for deepspeed-mii (that doesnt support streaming either, so TTFT is always set to 0)

vllm/benchmarks/backend_request_func.py

Lines 175 to 215 in a33ce60

    
           async def async_request_deepspeed_mii( 
        
               request_func_input: RequestFuncInput, 
        
               pbar: Optional[tqdm] = None, 
        
           ) -> RequestFuncOutput: 
        
               async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session: 
        
                   assert request_func_input.best_of == 1 
        
                   assert not request_func_input.use_beam_search 
        
                   payload = { 
        
                       "prompts": request_func_input.prompt, 
        
                       "max_new_tokens": request_func_input.output_len, 
        
                       "ignore_eos": True, 
        
                       "do_sample": True, 
        
                       "temperature": 
        
                       0.01,  # deepspeed-mii does not accept 0.0 temperature. 
        
                       "top_p": 1.0, 
        
                   } 
        
                   output = RequestFuncOutput() 
        
                   output.prompt_len = request_func_input.prompt_len 
        
                   # DeepSpeed-MII doesn't support streaming as of Jan 28 2024, will use 0 as placeholder. 
        
                   # https://github.com/microsoft/DeepSpeed-MII/pull/311 
        
                   output.ttft = 0 
        
                   st = time.perf_counter() 
        
                   try: 
        
                       async with session.post(url=request_func_input.api_url, 
        
                                               json=payload) as resp: 
        
                           if resp.status == 200: 
        
                               parsed_resp = await resp.json() 
        
                               output.latency = time.perf_counter() - st 
        
                               output.generated_text = parsed_resp[0]["generated_text"] 
        
                               output.success = True 
        
                           else: 
        
                               output.success = False 
        
                   except (aiohttp.ClientOSError, aiohttp.ServerDisconnectedError): 
        
                       output.success = False 
        
                   if pbar: 
        
                       pbar.update(1) 
        
                   return output

github-actions · 2024-10-30T02:00:45Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions · 2024-11-30T02:02:05Z

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

hahmad2008 mentioned this issue Mar 6, 2024

IMPORTANT Bug: Model return empty response (output len = 0), when recieved multiple concurrent request. #3209

Closed

github-actions bot added the stale label Oct 30, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in benchmark model with vllm backend #3230

Error in benchmark model with vllm backend #3230

hahmad2008 commented Mar 6, 2024

ywang96 commented Mar 6, 2024

hahmad2008 commented Mar 6, 2024

ywang96 commented Mar 6, 2024

hahmad2008 commented Mar 6, 2024 •

edited

Loading

ywang96 commented Mar 6, 2024

hahmad2008 commented Mar 6, 2024 •

edited

Loading

hahmad2008 commented Mar 6, 2024 •

edited

Loading

ywang96 commented Mar 6, 2024

hahmad2008 commented Mar 6, 2024 •

edited

Loading

ywang96 commented Mar 6, 2024

hahmad2008 commented Mar 6, 2024

ywang96 commented Mar 6, 2024 •

edited

Loading

github-actions bot commented Oct 30, 2024

github-actions bot commented Nov 30, 2024

Error in benchmark model with vllm backend #3230

Error in benchmark model with vllm backend #3230

Comments

hahmad2008 commented Mar 6, 2024

ywang96 commented Mar 6, 2024

hahmad2008 commented Mar 6, 2024

ywang96 commented Mar 6, 2024

hahmad2008 commented Mar 6, 2024 • edited Loading

ywang96 commented Mar 6, 2024

hahmad2008 commented Mar 6, 2024 • edited Loading

hahmad2008 commented Mar 6, 2024 • edited Loading

ywang96 commented Mar 6, 2024

hahmad2008 commented Mar 6, 2024 • edited Loading

ywang96 commented Mar 6, 2024

hahmad2008 commented Mar 6, 2024

ywang96 commented Mar 6, 2024 • edited Loading

github-actions bot commented Oct 30, 2024

github-actions bot commented Nov 30, 2024

hahmad2008 commented Mar 6, 2024 •

edited

Loading

hahmad2008 commented Mar 6, 2024 •

edited

Loading

hahmad2008 commented Mar 6, 2024 •

edited

Loading

hahmad2008 commented Mar 6, 2024 •

edited

Loading

ywang96 commented Mar 6, 2024 •

edited

Loading