[Bugfix] Guard for negative counter metrics to prevent crash #10430

tjohnson31415 · 2024-11-18T18:50:22Z

I'm not sure how it happens, but we have observed crashes when running vLLM in online model due to a negative value being sent to increment a Prometheus counter:

ERROR 11-15 03:59:17 engine.py:165] ValueError('Error in model execution: Counters can only be incremented by non-negative amounts.')
ERROR 11-15 03:59:17 engine.py:165] Traceback (most recent call last):
ERROR 11-15 03:59:17 engine.py:165]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 11-15 03:59:17 engine.py:165]     return func(*args, **kwargs)
ERROR 11-15 03:59:17 engine.py:165]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 11-15 03:59:17 engine.py:165]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/worker/model_runner.py", line 1697, in execute_model
ERROR 11-15 03:59:17 engine.py:165]     model_input.async_callback()
ERROR 11-15 03:59:17 engine.py:165]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/utils.py", line 1150, in weak_bound
ERROR 11-15 03:59:17 engine.py:165]     unbound(inst, *args, **kwargs)
ERROR 11-15 03:59:17 engine.py:165]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/llm_engine.py", line 1243, in _process_model_outputs
ERROR 11-15 03:59:17 engine.py:165]     self.do_log_stats(scheduler_outputs, outputs, finished_before,
ERROR 11-15 03:59:17 engine.py:165]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/llm_engine.py", line 1581, in do_log_stats
ERROR 11-15 03:59:17 engine.py:165]     logger.log(stats)
ERROR 11-15 03:59:17 engine.py:165]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/metrics.py", line 519, in log
ERROR 11-15 03:59:17 engine.py:165]     self._log_prometheus(stats)
ERROR 11-15 03:59:17 engine.py:165]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/metrics.py", line 478, in _log_prometheus
ERROR 11-15 03:59:17 engine.py:165]     self._log_counter(self.metrics.counter_generation_tokens,
ERROR 11-15 03:59:17 engine.py:165]   File "/opt/vllm/lib64/python3.12/site-packages/vllm/engine/metrics.py", line 429, in _log_counter
ERROR 11-15 03:59:17 engine.py:165]     counter.labels(**self.labels).inc(data)
ERROR 11-15 03:59:17 engine.py:165]   File "/opt/vllm/lib64/python3.12/site-packages/prometheus_client/metrics.py", line 313, in inc
ERROR 11-15 03:59:17 engine.py:165]     raise ValueError('Counters can only be incremented by non-negative amounts.')
ERROR 11-15 03:59:17 engine.py:165] ValueError: Counters can only be incremented by non-negative amounts.

This PR adds a check on the value of the counter before calling the prometheus client to avoid the crash, but the root cause of the negative value needs more investigation.

FIX #6642

#6325 is related and shows the same error.

Signed-off-by: Travis Johnson <[email protected]>

github-actions · 2024-11-18T18:50:36Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Travis Johnson <[email protected]>

prashantgupta24 · 2024-11-18T21:15:58Z

Not sure if it's worth adding a test to test_cancellation within tests/async_engine/test_async_llm_engine.py (or other cancellation tests) to make sure metric readings are recorded correctly in case of cancellations?

DarkLight1337 · 2024-11-19T03:28:03Z

Let's fix #6325 in another PR.

…oject#10430) Signed-off-by: Travis Johnson <[email protected]> Signed-off-by: Manjul Mohan <[email protected]>

fix: token counter variable should be int not float

781b874

Signed-off-by: Travis Johnson <[email protected]>

add check to prevent exception from negative counter

338740a

Signed-off-by: Travis Johnson <[email protected]>

tjohnson31415 force-pushed the negative-counter-metric branch from 4102ffd to 338740a Compare November 18, 2024 18:52

tjohnson31415 marked this pull request as ready for review November 18, 2024 22:31

tjohnson31415 requested review from zhuohan123, youkaichao, alexm-neuralmagic, comaniac and njhill as code owners November 18, 2024 22:31

DarkLight1337 approved these changes Nov 19, 2024

View reviewed changes

DarkLight1337 enabled auto-merge (squash) November 19, 2024 03:28

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 19, 2024

DarkLight1337 merged commit 272e31c into vllm-project:main Nov 19, 2024
62 of 64 checks passed

mikejuliet13 pushed a commit to mikejuliet13/vllm that referenced this pull request Nov 19, 2024

[Bugfix] Guard for negative counter metrics to prevent crash (vllm-pr…

c0482f6

…oject#10430) Signed-off-by: Travis Johnson <[email protected]> Signed-off-by: Manjul Mohan <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Guard for negative counter metrics to prevent crash #10430

[Bugfix] Guard for negative counter metrics to prevent crash #10430

tjohnson31415 commented Nov 18, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 18, 2024

prashantgupta24 commented Nov 18, 2024 •

edited

Loading

DarkLight1337 commented Nov 19, 2024

[Bugfix] Guard for negative counter metrics to prevent crash #10430

[Bugfix] Guard for negative counter metrics to prevent crash #10430

Conversation

tjohnson31415 commented Nov 18, 2024 • edited by github-actions bot Loading

github-actions bot commented Nov 18, 2024

prashantgupta24 commented Nov 18, 2024 • edited Loading

DarkLight1337 commented Nov 19, 2024

tjohnson31415 commented Nov 18, 2024 •

edited by github-actions bot

Loading

prashantgupta24 commented Nov 18, 2024 •

edited

Loading