Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in benchmark model with vllm backend #3230

Closed
hahmad2008 opened this issue Mar 6, 2024 · 14 comments
Closed

Error in benchmark model with vllm backend #3230

hahmad2008 opened this issue Mar 6, 2024 · 14 comments
Labels

Comments

@hahmad2008
Copy link

I couldn't benchmark my model, seems the benchmark send requests without wait for the response, so the following error is raised:

python benchmark_serving.py \
               --backend vllm \
               --model "MYMODEL/PATH" \
               --port 8000 --host 0.0.0.0 \
               --dataset ShareGPT_V3_unfiltered_cleaned_split.json \
               --num-prompts 400 \
               --endpoint /generate \
               --save-result 

I also run the service and I can access it thought http://0.0.0.0:8000

but I got this error:

Namespace(backend='vllm', version='N/A', base_url=None, host='0.0.0.0', port=8000, endpoint='/generate', dataset='ShareGPT_V3_unfiltered_cleaned_split.json', model='MYMODEL/PATH', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=400, request_rate=inf, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=True)
Token indices sequence length is longer than the specified maximum sequence length for this model (515 > 255). Running this sequence through the model will result in indexing errors
  0%|                                                                                                                     | 0/400 [00:00<?, ?it/s]Traffic request rate: inf
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 400/400 [00:05<00:00, 68.13it/s]
completed:  0
myconda/lib/python3.9/site-packages/numpy/core/fromnumeric.py:3504: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
myconda/lib/python3.9/site-packages/numpy/core/_methods.py:129: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
  File "~/benchmark/benchmark_serving.py", line 389, in <module>
    main(args)
  File "~/benchmark/benchmark_serving.py", line 261, in main
    benchmark_result = asyncio.run(
  File "myconda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "myconda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "~/benchmark/benchmark_serving.py", line 204, in benchmark
    metrics = calculate_metrics(
  File "~/benchmark/benchmark_serving.py", line 151, in calculate_metrics
    p99_ttft_ms=np.percentile(ttfts, 99) * 1000,
  File "myconda/lib/python3.9/site-packages/numpy/lib/function_base.py", line 4283, in percentile
    return _quantile_unchecked(
  File "myconda/lib/python3.9/site-packages/numpy/lib/function_base.py", line 4555, in _quantile_unchecked
    return _ureduce(a,
  File "myconda/lib/python3.9/site-packages/numpy/lib/function_base.py", line 3823, in _ureduce
    r = func(a, **kwargs)
  File "myconda/lib/python3.9/site-packages/numpy/lib/function_base.py", line 4722, in _quantile_ureduce_func
    result = _quantile(arr,
  File "myconda/lib/python3.9/site-packages/numpy/lib/function_base.py", line 4831, in _quantile
    slices_having_nans = np.isnan(arr[-1, ...])
IndexError: index -1 is out of bounds for axis 0 with size 0

Do I miss anything here?

@ywang96
Copy link
Member

ywang96 commented Mar 6, 2024

@hahmad2008 Hello! It seems that the none of the request goes through as you can tell from RuntimeWarning: Mean of empty which means the metrics list are empty.

Are you benchmarking on the vllm API server instead of OpenAI server?

@hahmad2008
Copy link
Author

@ywang96 Thanks for your response. I run vllm service serve run vllm_service:service --port 8000 --host 0.0.0.0
and this is the log from service:

(ServeReplica:default:OpenLLMDeployment pid=14442) INFO 2024-03-06 11:47:41,540 default_OpenLLMDeployment wMWRGW f70ceb65-1bfb-446a-a125-7e68ada39198 /generate replica.py:772 - __CALL__ OK 0.8ms
(ServeReplica:default:OpenLLMDeployment pid=14442) INFO 2024-03-06 11:47:41,546 default_OpenLLMDeployment wMWRGW a28f82e1-4db1-423f-a2c6-eef4ab8f947d /generate replica.py:772 - __CALL__ OK 2.9ms
(ServeReplica:default:OpenLLMDeployment pid=14442) INFO 2024-03-06 11:47:41,547 default_OpenLLMDeployment wMWRGW

seems the request are sent but the benchmark script doesn't wait the response.

@ywang96
Copy link
Member

ywang96 commented Mar 6, 2024

Could you clarify on serve run vllm_service:service (I'm guessing you're using something like BentoML to launch the server)?

It's possible that there's additional layer of output processing on top of the actual vllm container running inside this service, so the benchmark script doesn't know how to process the output coming out from this service.

Also I would suggest using the openai server if possible since the /generate API server has already been locked from any further development.

@hahmad2008
Copy link
Author

hahmad2008 commented Mar 6, 2024

it define a vllm class to serve LLM. I can access the service while it is running on this link 'http://0.0.0:8000/generate?prompt=REQUESTED_PROMPT'

@ywang96
Copy link
Member

ywang96 commented Mar 6, 2024

My suggestion is adding a line print(data.decode("utf-8") right after when we iterate over streaming output to see what it looks like:

async for data in response.content.iter_any():

It would also help if you can add print(response.reason) after

output.success = False
to see if the client session is not receiving a successful response

@hahmad2008
Copy link
Author

hahmad2008 commented Mar 6, 2024

@ywang96
response.reason: Unprocessable Entity

I print the response itself before this

if response.status == 200:

I got:

response:  <ClientResponse(http://0.0.0.0:8000/generate) [422 Unprocessable Entity]>
<CIMultiDictProxy('Date': 'Wed, 06 Mar 2024 19:03:55 GMT', 'Server': 'uvicorn', 'Content-Length': '142', 'Content-Type': 'application/json', 'x-request-id': '6b904b36-3c2c-4cd4-94bb-687dc9aae818')>

@hahmad2008
Copy link
Author

hahmad2008 commented Mar 6, 2024

@ywang96
i make payload to only have prompt field. the payload has really weird thing:


api_url:  http://0.0.0.0:8000/generate

payload:  {'prompt': 'Here\'s an example of log-level transformation in a panel regression model using R:\n```perl\n# Load the necessary libraries\nlibrary(plm)\nlibrary(ggplot2)\n\n# Load the data and create the log-level transformation\ndata("Grunfeld", package = "plm")\ndata <- Grunfeld\ndata$invest_log <- log(data$invest)\n\n# Fit the log-level transformed panel regression model\nmodel <- plm(invest_log ~ value + capital, data = data, index = c("firm", "year"))\nsummary(model)\n\n# Plot the results\nggplot(data, aes(x = value, y = invest_log, color = factor(firm))) +\n  geom_point() +\n  geom_smooth(method = "lm", se = FALSE) +\n  labs(x = "Value", y = "Investment (Log-Level)", color = "Firm")\n```\nIn this example, the "Grunfeld" data set is loaded from the "plm" library, and a log-level transformation is created by taking the logarithm of the "invest" variable. The panel regression model is fit using the "plm" function, with "firm" and "year" as the panel index variables, and "value" and "capital" as the independent variables. The summary of the model shows the coefficients, standard errors, t-values, and p-values for each variable. Finally, a plot is created using the "ggplot2" library to visualize the relationship between "value" and the log-level transformed "investment."'}

response:  <ClientResponse(http://0.0.0.0:8000/generate) [422 Unprocessable Entity]>
<CIMultiDictProxy('Date': 'Wed, 06 Mar 2024 19:14:47 GMT', 'Server': 'uvicorn', 'Content-Length': '142', 'Content-Type': 'application/json', 'x-request-id': 'ac548d8e-eb39-48c0-b9b0-5a1419b8a058')>

 response.reason:  Unprocessable Entity

@ywang96
Copy link
Member

ywang96 commented Mar 6, 2024

Unprocessable Entity is an input error - have you tried sending a sample request to the service and see how it works?

@hahmad2008
Copy link
Author

hahmad2008 commented Mar 6, 2024

this work for me and I get a response

curl -X 'POST' \
  'http://localhost:8000/generate?prompt=what%27s%20your%20name%3F' \
  -H 'accept: application/json' \
  -d ''

but this way, it doesn't work:

$ curl -X 'POST' 'http://0.0.0.0:8000/generate'  -H 'accept: application/json' -H 'Content-Type: application/json'  -d '{ "prompt": "What is the meaning of life?"}'

{"detail":[{"type":"missing","loc":["query","prompt"],"msg":"Field required","input":null,"url":"https://errors.pydantic.dev/2.6/v/missing"}]}

@ywang96
Copy link
Member

ywang96 commented Mar 6, 2024

Yep - so your service indeed behaves very differently from the vanilla vllm server and does some input parsing from the API URL itself, which is an anti-pattern imo, so this should not be supported by the benchmark script we have in the first place.

I'd suggest you take a look at request_backend_func.py and add your own request function to work with your service. We made it so that the main benchmark script benchmark_serving.py won't need to be touched at all.

#2992 is a good example on how to add your own request function.

@hahmad2008
Copy link
Author

@ywang96
I fixed this part and now the http code for the response is 200. but the data is null!!!

response:  <ClientResponse(http://0.0.0.0:8000/generate) [200 OK]>
<CIMultiDictProxy('Date': 'Wed, 06 Mar 2024 20:07:46 GMT', 'Server': 'uvicorn', 'Content-Length': '2', 'Content-Type': 'application/json', 'x-request-id': '6963f57a-1910-4399-8650-e0d92a10f475')>

response.reason:  OK
data:  b'{"text":""}'
body:  {"text":""}

Traceback (most recent call last):
  File "benchmark/benchmark_serving.py", line 387, in <module>
    main(args)
  File "benchmark/benchmark_serving.py", line 259, in main
    benchmark_result = asyncio.run(
  File "myenv/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "myenv/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "enchmark/benchmark_serving.py", line 195, in benchmark
    outputs = await asyncio.gather(*tasks)
  File "benchmark/backend_request_func.py", line 125, in async_request_vllm
    output.generated_text = json.loads(
IndexError: string index out of range
  0%|                                                                                                                             | 0/400 [00:00<?, ?it/s]

and I got a response for this

$ curl -X 'POST' 'http://0.0.0.0:5000/generate'  -H 'accept: application/json' -H 'Content-Type: application/json'  -d '{ "prompt": "What is the meaning of life?"}'

{"text":"\n\nThe question of the meaning of life has puzzled humanity for millennia and remains a highly subjective and philosophical matter. Different individuals, cultures, and philosophies offer various interpretations."}

@ywang96
Copy link
Member

ywang96 commented Mar 6, 2024

Does your service support streaming at all? This seems to me that the server does not support streaming, but we do need it for calculate TTFT (time to first token). You can look at the request function for deepspeed-mii (that doesnt support streaming either, so TTFT is always set to 0)

async def async_request_deepspeed_mii(
request_func_input: RequestFuncInput,
pbar: Optional[tqdm] = None,
) -> RequestFuncOutput:
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
assert request_func_input.best_of == 1
assert not request_func_input.use_beam_search
payload = {
"prompts": request_func_input.prompt,
"max_new_tokens": request_func_input.output_len,
"ignore_eos": True,
"do_sample": True,
"temperature":
0.01, # deepspeed-mii does not accept 0.0 temperature.
"top_p": 1.0,
}
output = RequestFuncOutput()
output.prompt_len = request_func_input.prompt_len
# DeepSpeed-MII doesn't support streaming as of Jan 28 2024, will use 0 as placeholder.
# https://github.com/microsoft/DeepSpeed-MII/pull/311
output.ttft = 0
st = time.perf_counter()
try:
async with session.post(url=request_func_input.api_url,
json=payload) as resp:
if resp.status == 200:
parsed_resp = await resp.json()
output.latency = time.perf_counter() - st
output.generated_text = parsed_resp[0]["generated_text"]
output.success = True
else:
output.success = False
except (aiohttp.ClientOSError, aiohttp.ServerDisconnectedError):
output.success = False
if pbar:
pbar.update(1)
return output

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Oct 30, 2024
Copy link

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants