Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: v0.5.5 crash: "AssertionError: expected running sequences" #8016

Open
1 task done
zoltan-fedor opened this issue Aug 30, 2024 · 32 comments · Fixed by #8059
Open
1 task done

[Bug]: v0.5.5 crash: "AssertionError: expected running sequences" #8016

zoltan-fedor opened this issue Aug 30, 2024 · 32 comments · Fixed by #8059
Labels
bug Something isn't working

Comments

@zoltan-fedor
Copy link

zoltan-fedor commented Aug 30, 2024

Your current environment

Running the standard v0.5.5 docker image from your Dockerhub repo without anything additional added to it.

🐛 Describe the bug

When using Llama 3.1 70b AWQ model running on 4 A10G 24Gb GPUs with args:

--model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
--tensor-parallel-size 4
--gpu-memory-utilization 0.95
--enforce-eager
--trust-remote-code
--worker-use-ray
--enable-prefix-caching
--num-scheduler-steps 8
--dtype half
--max-model-len 32768

vLLM crashes and requires a full restart. Error:

	
INFO 08-29 19:33:37 server.py:222] vLLM ZMQ RPC Server was interrupted.
Future exception was never retrieved
future: <Future finished exception=AssertionError('expected running sequences')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 111, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 1064, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 113, in generator
    raise result
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 55, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 930, in run_engine_loop
    result = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 873, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 356, in step_async
    request_outputs = self._process_model_outputs(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 1232, in _process_model_outputs
    self.output_processor.process_outputs(seq_group, outputs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/output_processor/multi_step.py", line 73, in process_outputs
    assert seqs, "expected running sequences"
AssertionError: expected running sequences

The issue is random, the same query does NOT reproduce it.

We have upgraded 6 hours ago and since then this happened 3 times.

We now need to downgrade and consider v0.5.5 a buggy release.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@zoltan-fedor zoltan-fedor added the bug Something isn't working label Aug 30, 2024
@WoosukKwon
Copy link
Collaborator

@zoltan-fedor Thanks for reporting the bug. Could you please use without --num-scheduler-steps 8? I think there were several bug fixes on it after v0.5.5.

@zoltan-fedor
Copy link
Author

zoltan-fedor commented Aug 31, 2024

thanks @WoosukKwon, unfortunately even without the --num-scheduler-steps 8 flag it has still failed (although with a different error):

│ ERROR 08-30 19:19:10 client.py:412]     await self.check_health(socket=socket)                                                                                                                                                                                                                                                                                                                        │
│ ERROR 08-30 19:19:10 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 431, in check_health                                                                                                                                                                                                                                                 │
│ ERROR 08-30 19:19:10 client.py:412]     await self._send_one_way_rpc_request(                                                                                                                                                                                                                                                                                                                         │
│ ERROR 08-30 19:19:10 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 266, in _send_one_way_rpc_request                                                                                                                                                                                                                                    │
│ ERROR 08-30 19:19:10 client.py:412]     raise response                                                                                                                                                                                                                                                                                                                                                │
│ ERROR 08-30 19:19:10 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.                                                                                                                                                                                                                                                                                    │
│ ERROR 08-30 19:19:10 client.py:265] Got Unhealthy response from RPC Server                                                                                                                                                                                                                                                                                                                            │
│ ERROR 08-30 19:19:10 client.py:412] AsyncEngineDeadError('Background loop is stopped.')                                                                                                                                                                                                                                                                                                               │
│ ERROR 08-30 19:19:10 client.py:412] Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                │
│ ERROR 08-30 19:19:10 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate                                                                                                                                                                                                                                                     │
│ ERROR 08-30 19:19:10 client.py:412]     await self.check_health(socket=socket)                                                                                                                                                                                                                                                                                                                        │
│ ERROR 08-30 19:19:10 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 431, in check_health                                                                                                                                                                                                                                                 │
│ ERROR 08-30 19:19:10 client.py:412]     await self._send_one_way_rpc_request(                                                                                                                                                                                                                                                                                                                         │
│ ERROR 08-30 19:19:10 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 266, in _send_one_way_rpc_request                                                                                                                                                                                                                                    │
│ ERROR 08-30 19:19:10 client.py:412]     raise response                                                                                                                                                                                                                                                                                                                                                │
│ ERROR 08-30 19:19:10 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.                                                                                                                                                                                                                                                                                    │
│ CRITICAL 08-30 19:19:10 launcher.py:82] AsyncLLMEngine has failed, terminating server process                                                                                                                                                                                                                                                                                                         │
│ INFO:     10.94.90.10:51000 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error                                                                                                                                                                                                                                                                                                               │
│ CRITICAL 08-30 19:19:10 launcher.py:82] AsyncLLMEngine has failed, terminating server process                                                                                                                                                                                                                                                                                                         │
│ INFO:     10.94.88.168:38032 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error                                                                                                                                                                                                                                                                                                              │
│ [2024-08-30 19:19:10,138 E 64 3526] logging.cc:115: Stack trace:                                                                                                                                                                                                                                                                                                                                      │
│  /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x10b96aa) [0x7fc380d896aa] ray::operator<<()                                                                                                                                                                                                                                                                                                │
│ /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x10bc932) [0x7fc380d8c932] ray::TerminateHandler()                                                                                                                                                                                                                                                                                           │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c) [0x7fc4ce2cc37c]                                                                                                                                                                                                                                                                                                                                   │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7) [0x7fc4ce2cc3e7]                                                                                                                                                                                                                                                                                                                                   │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa36f) [0x7fc4ce2cc36f]                                                                                                                                                                                                                                                                                                                                   │
│ /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe5ab35) [0x7fc4807e0b35] c10d::ProcessGroupNCCL::ncclCommWatchdog()                                                                                                                                                                                                                                                             │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7fc4ce2f8df4]                                                                                                                                                                                                                                                                                                                                   │
│ /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fc4cf4c9609] start_thread                                                                                                                                                                                                                                                                                                                      │
│ /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fc4cf603353] __clone                                                                                                                                                                                                                                                                                                                              │
│                                                                                                                                                                                                                                                                                                                                                                                                       │
│ *** SIGABRT received at time=1725070750 on cpu 13 ***                                                                                                                                                                                                                                                                                                                                                 │
│ PC: @     0x7fc4cf52700b  (unknown)  raise                                                                                                                                                                                                                                                                                                                                                            │
│     @     0x7fc4cf527090       3216  (unknown)                                                                                                                                                                                                                                                                                                                                                        │
│     @     0x7fc4ce2cc37c  (unknown)  (unknown)                                                                                                                                                                                                                                                                                                                                                        │
│     @     0x7fc4ce2cc090  (unknown)  (unknown)                                                                                                                                                                                                                                                                                                                                                        │
│ [2024-08-30 19:19:10,139 E 64 3526] logging.cc:440: *** SIGABRT received at time=1725070750 on cpu 13 ***                                                                                                                                                                                                                                                                                             │
│ [2024-08-30 19:19:10,139 E 64 3526] logging.cc:440: PC: @     0x7fc4cf52700b  (unknown)  raise                                                                                                                                                                                                                                                                                                        │
│ [2024-08-30 19:19:10,139 E 64 3526] logging.cc:440:     @     0x7fc4cf527090       3216  (unknown)                                                                                                                                                                                                                                                                                                    │
│ [2024-08-30 19:19:10,140 E 64 3526] logging.cc:440:     @     0x7fc4ce2cc37c  (unknown)  (unknown)                                                                                                                                                                                                                                                                                                    │
│ [2024-08-30 19:19:10,140 E 64 3526] logging.cc:440:     @     0x7fc4ce2cc090  (unknown)  (unknown)                                                                                                                                                                                                                                                                                                    │
│ Fatal Python error: Aborted   

Back to version v0.5.4 again.

@zoltan-fedor
Copy link
Author

zoltan-fedor commented Aug 31, 2024

Two minutes later the next error:

│     return self._call_impl(*args, **kwargs)                                                                                                                                                                                                                                                                                                                                                           │
│   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl                                                                                                                                                                                                                                                                                                 │
│     return forward_call(*args, **kwargs)                                                                                                                                                                                                                                                                                                                                                              │
│   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 110, in forward                                                                                                                                                                                                                                                                                          │
│     self._init_sampling_tensors(logits, sampling_metadata)                                                                                                                                                                                                                                                                                                                                            │
│   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 87, in _init_sampling_tensors                                                                                                                                                                                                                                                                            │
│     do_min_p) = SamplingTensors.from_sampling_metadata(                                                                                                                                                                                                                                                                                                                                               │
│   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/sampling_metadata.py", line 520, in from_sampling_metadata                                                                                                                                                                                                                                                                        │
│     sampling_tensors = SamplingTensors.from_lists(                                                                                                                                                                                                                                                                                                                                                    │
│   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/sampling_metadata.py", line 564, in from_lists                                                                                                                                                                                                                                                                                    │
│     temperatures_t = torch.tensor(                                                                                                                                                                                                                                                                                                                                                                    │
│ RuntimeError: CUDA error: an illegal memory access was encountered                                                                                                                                                                                                                                                                                                                                    │
│ CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                                                                                                                                                                                                                                               │
│ For debugging consider passing CUDA_LAUNCH_BLOCKING=1                                                                                                                                                                                                                                                                                                                                                 │
│ Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                                                                                                                                                                                                                                                                                   │
│                                                                                                                                                                                                                                                                                                                                                                                                       │
│                                                                                                                                                                                                                                                                                                                                                                                                       │
│ The above exception was the direct cause of the following exception:                                                                                                                                                                                                                                                                                                                                  │
│                                                                                                                                                                                                                                                                                                                                                                                                       │
│ Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                                                    │
│   File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run                                                                                                                                                                                                                                                                                                                                    │
│   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 67, in _log_task_completion                                                                                                                                                                                                                                                                                    │
│     raise AsyncEngineDeadError(                                                                                                                                                                                                                                                                                                                                                                       │
│ vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.                                                                                                                                                                                                                  │
│ ERROR 08-30 19:24:28 client.py:265] Got Unhealthy response from RPC Server                                                                                                                                                                                                                                                                                                                            │
│ ERROR 08-30 19:24:28 client.py:412] AsyncEngineDeadError('Background loop is stopped.')                                                                                                                                                                                                                                                                                                               │
│ ERROR 08-30 19:24:28 client.py:412] Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate                                                                                                                                                                                                                                                     │
│ ERROR 08-30 19:24:28 client.py:412]     await self.check_health(socket=socket)                                                                                                                                                                                                                                                                                                                        │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 431, in check_health                                                                                                                                                                                                                                                 │
│ ERROR 08-30 19:24:28 client.py:412]     await self._send_one_way_rpc_request(                                                                                                                                                                                                                                                                                                                         │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 266, in _send_one_way_rpc_request                                                                                                                                                                                                                                    │
│ ERROR 08-30 19:24:28 client.py:412]     raise response                                                                                                                                                                                                                                                                                                                                                │
│ ERROR 08-30 19:24:28 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.                                                                                                                                                                                                                                                                                    │
│ CRITICAL 08-30 19:24:28 launcher.py:82] AsyncLLMEngine has failed, terminating server process                                                                                                                                                                                                                                                                                                         │
│ INFO:     10.94.90.10:46306 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error                                                                                                                                                                                                                                                                                                               │
│ ERROR 08-30 19:24:28 client.py:265] Got Unhealthy response from RPC Server                                                                                                                                                                                                                                                                                                                            │
│ ERROR 08-30 19:24:28 client.py:412] AsyncEngineDeadError('Background loop is stopped.')                                                                                                                                                                                                                                                                                                               │
│ ERROR 08-30 19:24:28 client.py:412] Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate                                                                                                                                                                                                                                                     │
│ ERROR 08-30 19:24:28 client.py:412]     await self.check_health(socket=socket)                                                                                                                                                                                                                                                                                                                        │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 431, in check_health                                                                                                                                                                                                                                                 │
│ ERROR 08-30 19:24:28 client.py:412]     await self._send_one_way_rpc_request(                                                                                                                                                                                                                                                                                                                         │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 266, in _send_one_way_rpc_request                                                                                                                                                                                                                                    │
│ ERROR 08-30 19:24:28 client.py:412]     raise response                                                                                                                                                                                                                                                                                                                                                │
│ ERROR 08-30 19:24:28 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.                                                                                                                                                                                                                                                                                    │
│ ERROR 08-30 19:24:28 client.py:265] Got Unhealthy response from RPC Server                                                                                                                                                                                                                                                                                                                            │
│ ERROR 08-30 19:24:28 client.py:412] AsyncEngineDeadError('Background loop is stopped.')                                                                                                                                                                                                                                                                                                               │
│ ERROR 08-30 19:24:28 client.py:412] Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate                                                                                                                                                                                                                                                     │
│ ERROR 08-30 19:24:28 client.py:412]     await self.check_health(socket=socket)                                                                                                                                                                                                                                                                                                                        │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 431, in check_health                                                                                                                                                                                                                                                 │
│ ERROR 08-30 19:24:28 client.py:412]     await self._send_one_way_rpc_request(                                                                                                                                                                                                                                                                                                                         │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 266, in _send_one_way_rpc_request                                                                                                                                                                                                                                    │
│ ERROR 08-30 19:24:28 client.py:412]     raise response                                                                                                                                                                                                                                                                                                                                                │
│ ERROR 08-30 19:24:28 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.                                                                                                                                                                                                                                                                                    │
│ CRITICAL 08-30 19:24:28 launcher.py:82] AsyncLLMEngine has failed, terminating server process                                                                                                                                                                                                                                                                                                         │
│ INFO:     10.94.91.181:55960 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error                                                                                                                                                                                                                                                                                                              │
│ CRITICAL 08-30 19:24:28 launcher.py:82] AsyncLLMEngine has failed, terminating server process                                                                                                                                                                                                                                                                                                         │
│ INFO:     10.94.88.168:48038 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error                                                                                                                                                                                                                                                                                                              │
│ [2024-08-30 19:24:28,686 E 65 3518] logging.cc:115: Stack trace:                                                                                                                                                                                                                                                                                                                                      │
│  /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x10b96aa) [0x7f7b2e5c76aa] ray::operator<<()                                                                                                                                                                                                                                                                                                │
│ /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x10bc932) [0x7f7b2e5ca932] ray::TerminateHandler()                                                                                                                                                                                                                                                                                           │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c) [0x7f7c7bb0a37c]                                                                                                                                                                                                                                                                                                                                   │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7) [0x7f7c7bb0a3e7]                                                                                                                                                                                                                                                                                                                                   │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa36f) [0x7f7c7bb0a36f]                                                                                                                                                                                                                                                                                                                                   │
│ /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe5ab35) [0x7f7c2e01eb35] c10d::ProcessGroupNCCL::ncclCommWatchdog()                                                                                                                                                                                                                                                             │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7f7c7bb36df4]                                                                                                                                                                                                                                                                                                                                   │
│ /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f7c7cd07609] start_thread                                                                                                                                                                                                                                                                                                                      │
│ /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f7c7ce41353] __clone                                                                                                                                                                                                                                                                                                                              │
│                                                                                                                                                                                                                                                                                                                                                                                                       │
│ *** SIGABRT received at time=1725071068 on cpu 16 ***                                                                                                                                                                                                                                                                                                                                                 │
│ PC: @     0x7f7c7cd6500b  (unknown)  raise                                                                                                                                                                                                                                                                                                                                                            │
│     @     0x7f7c7cd65090       3216  (unknown)                                                                                                                                                                                                                                                                                                                                                        │
│     @     0x7f7c7bb0a37c  (unknown)  (unknown)                                                                                                                                                                                                                                                                                                                                                        │
│     @     0x7f7c7bb0a090  (unknown)  (unknown)                                                                                                                                                                                                                                                                                                                                                        │
│ [2024-08-30 19:24:28,688 E 65 3518] logging.cc:440: *** SIGABRT received at time=1725071068 on cpu 16 ***                                                                                                                                                                                                                                                                                             │
│ [2024-08-30 19:24:28,688 E 65 3518] logging.cc:440: PC: @     0x7f7c7cd6500b  (unknown)  raise                                                                                                                                                                                                                                                                                                        │
│ [2024-08-30 19:24:28,688 E 65 3518] logging.cc:440:     @     0x7f7c7cd65090       3216  (unknown)                                                                                                                                                                                                                                                                                                    │
│ [2024-08-30 19:24:28,688 E 65 3518] logging.cc:440:     @     0x7f7c7bb0a37c  (unknown)  (unknown)                                                                                                                                                                                                                                                                                                    │
│ [2024-08-30 19:24:28,689 E 65 3518] logging.cc:440:     @     0x7f7c7bb0a090  (unknown)  (unknown)                                                                                                                                                                                                                                                                                                    │
│ Fatal Python error: Aborted         

This seems to be the same illegal memory address issue as #8025

@ashgold
Copy link

ashgold commented Aug 31, 2024

@zoltan-fedor Thanks for reporting the bug. Could you please use without --num-scheduler-steps 8? I think there were several bug fixes on it after v0.5.5.

Hi.

I had the exact same issue.

There is an obvious behavior that is causing this issue.
When I place a steady load, it doesn't matter how long the load is maintained, but if I stop the load in the middle of making a request and getting a response, it seems to cause this issue by dropping the connection.

Below are the options I started VLLM with.

    - args:
      - --model
      - /data/models/llama-3-1-70b-instruct/base
      - --tensor-parallel-size
      - "4"
      - --load-format
      - "auto"
      - --max-model-len
      - "16384"
      - --disable-log-requests
      - --uvicorn-log-level
      - "warning"
      - --gpu-memory-utilization
      - "0.9"
      - --enable-prefix-caching
      - --num-scheduler-steps
      - "8"

Below is the log when the bug occurred.

ERROR 08-30 22:11:29 async_llm_engine.py:65] Engine background task failed
ERROR 08-30 22:11:29 async_llm_engine.py:65] Traceback (most recent call last):
ERROR 08-30 22:11:29 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 55, in _log_task_completion
ERROR 08-30 22:11:29 async_llm_engine.py:65]     return_value = task.result()
ERROR 08-30 22:11:29 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 930, in run_engine_loop
ERROR 08-30 22:11:29 async_llm_engine.py:65]     result = task.result()
ERROR 08-30 22:11:29 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 873, in engine_step
ERROR 08-30 22:11:29 async_llm_engine.py:65]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 08-30 22:11:29 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 356, in step_async
ERROR 08-30 22:11:29 async_llm_engine.py:65]     request_outputs = self._process_model_outputs(
ERROR 08-30 22:11:29 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 1232, in _process_model_outputs
ERROR 08-30 22:11:29 async_llm_engine.py:65]     self.output_processor.process_outputs(seq_group, outputs)
ERROR 08-30 22:11:29 async_llm_engine.py:65]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/output_processor/multi_step.py", line 73, in process_outputs
ERROR 08-30 22:11:29 async_llm_engine.py:65]     assert seqs, "expected running sequences"
ERROR 08-30 22:11:29 async_llm_engine.py:65] AssertionError: expected running sequences
ERROR:asyncio:Exception in callback functools.partial(<function _log_task_completion at 0x7f00de437be0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f00c643a320>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7f00de437be0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f00c643a320>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 55, in _log_task_completion
    return_value = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 930, in run_engine_loop
    result = task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 873, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 356, in step_async
    request_outputs = self._process_model_outputs(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 1232, in _process_model_outputs
    self.output_processor.process_outputs(seq_group, outputs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/output_processor/multi_step.py", line 73, in process_outputs
    assert seqs, "expected running sequences"
AssertionError: expected running sequences

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 67, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
ERROR 08-30 22:11:29 client.py:265] Got Unhealthy response from RPC Server
ERROR 08-30 22:11:29 client.py:412] AsyncEngineDeadError('Background loop is stopped.')
ERROR 08-30 22:11:29 client.py:412] Traceback (most recent call last):
ERROR 08-30 22:11:29 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate
ERROR 08-30 22:11:29 client.py:412]     await self.check_health(socket=socket)
ERROR 08-30 22:11:29 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 431, in check_health
ERROR 08-30 22:11:29 client.py:412]     await self._send_one_way_rpc_request(
ERROR 08-30 22:11:29 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 266, in _send_one_way_rpc_request
ERROR 08-30 22:11:29 client.py:412]     raise response
ERROR 08-30 22:11:29 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.
ERROR 08-30 22:11:29 client.py:265] Got Unhealthy response from RPC Server
ERROR 08-30 22:11:29 client.py:412] AsyncEngineDeadError('Background loop is stopped.')
ERROR 08-30 22:11:29 client.py:412] Traceback (most recent call last):
ERROR 08-30 22:11:29 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate
ERROR 08-30 22:11:29 client.py:412]     await self.check_health(socket=socket)
ERROR 08-30 22:11:29 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 431, in check_health
ERROR 08-30 22:11:29 client.py:412]     await self._send_one_way_rpc_request(
ERROR 08-30 22:11:29 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 266, in _send_one_way_rpc_request
ERROR 08-30 22:11:29 client.py:412]     raise response
ERROR 08-30 22:11:29 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 555, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f55771d0ca0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 754, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 774, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 295, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 75, in app
    await response(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in __call__
    async with anyio.create_task_group() as task_group:
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__
    raise BaseExceptionGroup(
exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
ERROR 08-30 22:11:29 client.py:265] Got Unhealthy response from RPC Server
ERROR 08-30 22:11:29 client.py:412] AsyncEngineDeadError('Background loop is stopped.')
ERROR 08-30 22:11:29 client.py:412] Traceback (most recent call last):
ERROR 08-30 22:11:29 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate
ERROR 08-30 22:11:29 client.py:412]     await self.check_health(socket=socket)
ERROR 08-30 22:11:29 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 431, in check_health
ERROR 08-30 22:11:29 client.py:412]     await self._send_one_way_rpc_request(
ERROR 08-30 22:11:29 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 266, in _send_one_way_rpc_request
ERROR 08-30 22:11:29 client.py:412]     raise response
ERROR 08-30 22:11:29 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 555, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f557ebc4580


@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Aug 31, 2024

The source of AssertionError: expected running sequences is due to abort not yet being supported with multi-step scheduling. multi-step scheduling is a new feature we are still working on - I would not yet recommend using multi-step in production use cases until the feature is finalized. The tracking issue for development of multi-step scheduling is here:

@zoltan-fedor re: the issues you are seeing with illegal memory access on v0.5.4 / v0.5.5, we have seen intermittent reports of this with --enable-prefix-caching. We have been working on trying to reproduce the issue. If possible, sharing:

  • the full logs
  • anything you can re: access patterns (the client code which generates the issue)

Would help us a lot of reproduce and resolve the issue

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Aug 31, 2024

Two minutes later the next error:

│     return self._call_impl(*args, **kwargs)                                                                                                                                                                                                                                                                                                                                                           │
│   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl                                                                                                                                                                                                                                                                                                 │
│     return forward_call(*args, **kwargs)                                                                                                                                                                                                                                                                                                                                                              │
│   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 110, in forward                                                                                                                                                                                                                                                                                          │
│     self._init_sampling_tensors(logits, sampling_metadata)                                                                                                                                                                                                                                                                                                                                            │
│   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 87, in _init_sampling_tensors                                                                                                                                                                                                                                                                            │
│     do_min_p) = SamplingTensors.from_sampling_metadata(                                                                                                                                                                                                                                                                                                                                               │
│   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/sampling_metadata.py", line 520, in from_sampling_metadata                                                                                                                                                                                                                                                                        │
│     sampling_tensors = SamplingTensors.from_lists(                                                                                                                                                                                                                                                                                                                                                    │
│   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/sampling_metadata.py", line 564, in from_lists                                                                                                                                                                                                                                                                                    │
│     temperatures_t = torch.tensor(                                                                                                                                                                                                                                                                                                                                                                    │
│ RuntimeError: CUDA error: an illegal memory access was encountered                                                                                                                                                                                                                                                                                                                                    │
│ CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                                                                                                                                                                                                                                               │
│ For debugging consider passing CUDA_LAUNCH_BLOCKING=1                                                                                                                                                                                                                                                                                                                                                 │
│ Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                                                                                                                                                                                                                                                                                   │
│                                                                                                                                                                                                                                                                                                                                                                                                       │
│                                                                                                                                                                                                                                                                                                                                                                                                       │
│ The above exception was the direct cause of the following exception:                                                                                                                                                                                                                                                                                                                                  │
│                                                                                                                                                                                                                                                                                                                                                                                                       │
│ Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                                                    │
│   File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run                                                                                                                                                                                                                                                                                                                                    │
│   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 67, in _log_task_completion                                                                                                                                                                                                                                                                                    │
│     raise AsyncEngineDeadError(                                                                                                                                                                                                                                                                                                                                                                       │
│ vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.                                                                                                                                                                                                                  │
│ ERROR 08-30 19:24:28 client.py:265] Got Unhealthy response from RPC Server                                                                                                                                                                                                                                                                                                                            │
│ ERROR 08-30 19:24:28 client.py:412] AsyncEngineDeadError('Background loop is stopped.')                                                                                                                                                                                                                                                                                                               │
│ ERROR 08-30 19:24:28 client.py:412] Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate                                                                                                                                                                                                                                                     │
│ ERROR 08-30 19:24:28 client.py:412]     await self.check_health(socket=socket)                                                                                                                                                                                                                                                                                                                        │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 431, in check_health                                                                                                                                                                                                                                                 │
│ ERROR 08-30 19:24:28 client.py:412]     await self._send_one_way_rpc_request(                                                                                                                                                                                                                                                                                                                         │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 266, in _send_one_way_rpc_request                                                                                                                                                                                                                                    │
│ ERROR 08-30 19:24:28 client.py:412]     raise response                                                                                                                                                                                                                                                                                                                                                │
│ ERROR 08-30 19:24:28 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.                                                                                                                                                                                                                                                                                    │
│ CRITICAL 08-30 19:24:28 launcher.py:82] AsyncLLMEngine has failed, terminating server process                                                                                                                                                                                                                                                                                                         │
│ INFO:     10.94.90.10:46306 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error                                                                                                                                                                                                                                                                                                               │
│ ERROR 08-30 19:24:28 client.py:265] Got Unhealthy response from RPC Server                                                                                                                                                                                                                                                                                                                            │
│ ERROR 08-30 19:24:28 client.py:412] AsyncEngineDeadError('Background loop is stopped.')                                                                                                                                                                                                                                                                                                               │
│ ERROR 08-30 19:24:28 client.py:412] Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate                                                                                                                                                                                                                                                     │
│ ERROR 08-30 19:24:28 client.py:412]     await self.check_health(socket=socket)                                                                                                                                                                                                                                                                                                                        │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 431, in check_health                                                                                                                                                                                                                                                 │
│ ERROR 08-30 19:24:28 client.py:412]     await self._send_one_way_rpc_request(                                                                                                                                                                                                                                                                                                                         │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 266, in _send_one_way_rpc_request                                                                                                                                                                                                                                    │
│ ERROR 08-30 19:24:28 client.py:412]     raise response                                                                                                                                                                                                                                                                                                                                                │
│ ERROR 08-30 19:24:28 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.                                                                                                                                                                                                                                                                                    │
│ ERROR 08-30 19:24:28 client.py:265] Got Unhealthy response from RPC Server                                                                                                                                                                                                                                                                                                                            │
│ ERROR 08-30 19:24:28 client.py:412] AsyncEngineDeadError('Background loop is stopped.')                                                                                                                                                                                                                                                                                                               │
│ ERROR 08-30 19:24:28 client.py:412] Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate                                                                                                                                                                                                                                                     │
│ ERROR 08-30 19:24:28 client.py:412]     await self.check_health(socket=socket)                                                                                                                                                                                                                                                                                                                        │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 431, in check_health                                                                                                                                                                                                                                                 │
│ ERROR 08-30 19:24:28 client.py:412]     await self._send_one_way_rpc_request(                                                                                                                                                                                                                                                                                                                         │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 266, in _send_one_way_rpc_request                                                                                                                                                                                                                                    │
│ ERROR 08-30 19:24:28 client.py:412]     raise response                                                                                                                                                                                                                                                                                                                                                │
│ ERROR 08-30 19:24:28 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.                                                                                                                                                                                                                                                                                    │
│ CRITICAL 08-30 19:24:28 launcher.py:82] AsyncLLMEngine has failed, terminating server process                                                                                                                                                                                                                                                                                                         │
│ INFO:     10.94.91.181:55960 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error                                                                                                                                                                                                                                                                                                              │
│ CRITICAL 08-30 19:24:28 launcher.py:82] AsyncLLMEngine has failed, terminating server process                                                                                                                                                                                                                                                                                                         │
│ INFO:     10.94.88.168:48038 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error                                                                                                                                                                                                                                                                                                              │
│ [2024-08-30 19:24:28,686 E 65 3518] logging.cc:115: Stack trace:                                                                                                                                                                                                                                                                                                                                      │
│  /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x10b96aa) [0x7f7b2e5c76aa] ray::operator<<()                                                                                                                                                                                                                                                                                                │
│ /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x10bc932) [0x7f7b2e5ca932] ray::TerminateHandler()                                                                                                                                                                                                                                                                                           │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c) [0x7f7c7bb0a37c]                                                                                                                                                                                                                                                                                                                                   │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7) [0x7f7c7bb0a3e7]                                                                                                                                                                                                                                                                                                                                   │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa36f) [0x7f7c7bb0a36f]                                                                                                                                                                                                                                                                                                                                   │
│ /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe5ab35) [0x7f7c2e01eb35] c10d::ProcessGroupNCCL::ncclCommWatchdog()                                                                                                                                                                                                                                                             │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7f7c7bb36df4]                                                                                                                                                                                                                                                                                                                                   │
│ /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f7c7cd07609] start_thread                                                                                                                                                                                                                                                                                                                      │
│ /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f7c7ce41353] __clone                                                                                                                                                                                                                                                                                                                              │
│                                                                                                                                                                                                                                                                                                                                                                                                       │
│ *** SIGABRT received at time=1725071068 on cpu 16 ***                                                                                                                                                                                                                                                                                                                                                 │
│ PC: @     0x7f7c7cd6500b  (unknown)  raise                                                                                                                                                                                                                                                                                                                                                            │
│     @     0x7f7c7cd65090       3216  (unknown)                                                                                                                                                                                                                                                                                                                                                        │
│     @     0x7f7c7bb0a37c  (unknown)  (unknown)                                                                                                                                                                                                                                                                                                                                                        │
│     @     0x7f7c7bb0a090  (unknown)  (unknown)                                                                                                                                                                                                                                                                                                                                                        │
│ [2024-08-30 19:24:28,688 E 65 3518] logging.cc:440: *** SIGABRT received at time=1725071068 on cpu 16 ***                                                                                                                                                                                                                                                                                             │
│ [2024-08-30 19:24:28,688 E 65 3518] logging.cc:440: PC: @     0x7f7c7cd6500b  (unknown)  raise                                                                                                                                                                                                                                                                                                        │
│ [2024-08-30 19:24:28,688 E 65 3518] logging.cc:440:     @     0x7f7c7cd65090       3216  (unknown)                                                                                                                                                                                                                                                                                                    │
│ [2024-08-30 19:24:28,688 E 65 3518] logging.cc:440:     @     0x7f7c7bb0a37c  (unknown)  (unknown)                                                                                                                                                                                                                                                                                                    │
│ [2024-08-30 19:24:28,689 E 65 3518] logging.cc:440:     @     0x7f7c7bb0a090  (unknown)  (unknown)                                                                                                                                                                                                                                                                                                    │
│ Fatal Python error: Aborted         

This seems to be the same illegal memory address issue as #8025

Are these logs from v0.5.4 or v0.5.5?

Update: looks like v0.5.5 based on line numbers

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Aug 31, 2024

Two minutes later the next error:

│     return self._call_impl(*args, **kwargs)                                                                                                                                                                                                                                                                                                                                                           │
│   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl                                                                                                                                                                                                                                                                                                 │
│     return forward_call(*args, **kwargs)                                                                                                                                                                                                                                                                                                                                                              │
│   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 110, in forward                                                                                                                                                                                                                                                                                          │
│     self._init_sampling_tensors(logits, sampling_metadata)                                                                                                                                                                                                                                                                                                                                            │
│   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 87, in _init_sampling_tensors                                                                                                                                                                                                                                                                            │
│     do_min_p) = SamplingTensors.from_sampling_metadata(                                                                                                                                                                                                                                                                                                                                               │
│   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/sampling_metadata.py", line 520, in from_sampling_metadata                                                                                                                                                                                                                                                                        │
│     sampling_tensors = SamplingTensors.from_lists(                                                                                                                                                                                                                                                                                                                                                    │
│   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/sampling_metadata.py", line 564, in from_lists                                                                                                                                                                                                                                                                                    │
│     temperatures_t = torch.tensor(                                                                                                                                                                                                                                                                                                                                                                    │
│ RuntimeError: CUDA error: an illegal memory access was encountered                                                                                                                                                                                                                                                                                                                                    │
│ CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                                                                                                                                                                                                                                               │
│ For debugging consider passing CUDA_LAUNCH_BLOCKING=1                                                                                                                                                                                                                                                                                                                                                 │
│ Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                                                                                                                                                                                                                                                                                   │
│                                                                                                                                                                                                                                                                                                                                                                                                       │
│                                                                                                                                                                                                                                                                                                                                                                                                       │
│ The above exception was the direct cause of the following exception:                                                                                                                                                                                                                                                                                                                                  │
│                                                                                                                                                                                                                                                                                                                                                                                                       │
│ Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                                                    │
│   File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run                                                                                                                                                                                                                                                                                                                                    │
│   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 67, in _log_task_completion                                                                                                                                                                                                                                                                                    │
│     raise AsyncEngineDeadError(                                                                                                                                                                                                                                                                                                                                                                       │
│ vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.                                                                                                                                                                                                                  │
│ ERROR 08-30 19:24:28 client.py:265] Got Unhealthy response from RPC Server                                                                                                                                                                                                                                                                                                                            │
│ ERROR 08-30 19:24:28 client.py:412] AsyncEngineDeadError('Background loop is stopped.')                                                                                                                                                                                                                                                                                                               │
│ ERROR 08-30 19:24:28 client.py:412] Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate                                                                                                                                                                                                                                                     │
│ ERROR 08-30 19:24:28 client.py:412]     await self.check_health(socket=socket)                                                                                                                                                                                                                                                                                                                        │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 431, in check_health                                                                                                                                                                                                                                                 │
│ ERROR 08-30 19:24:28 client.py:412]     await self._send_one_way_rpc_request(                                                                                                                                                                                                                                                                                                                         │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 266, in _send_one_way_rpc_request                                                                                                                                                                                                                                    │
│ ERROR 08-30 19:24:28 client.py:412]     raise response                                                                                                                                                                                                                                                                                                                                                │
│ ERROR 08-30 19:24:28 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.                                                                                                                                                                                                                                                                                    │
│ CRITICAL 08-30 19:24:28 launcher.py:82] AsyncLLMEngine has failed, terminating server process                                                                                                                                                                                                                                                                                                         │
│ INFO:     10.94.90.10:46306 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error                                                                                                                                                                                                                                                                                                               │
│ ERROR 08-30 19:24:28 client.py:265] Got Unhealthy response from RPC Server                                                                                                                                                                                                                                                                                                                            │
│ ERROR 08-30 19:24:28 client.py:412] AsyncEngineDeadError('Background loop is stopped.')                                                                                                                                                                                                                                                                                                               │
│ ERROR 08-30 19:24:28 client.py:412] Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate                                                                                                                                                                                                                                                     │
│ ERROR 08-30 19:24:28 client.py:412]     await self.check_health(socket=socket)                                                                                                                                                                                                                                                                                                                        │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 431, in check_health                                                                                                                                                                                                                                                 │
│ ERROR 08-30 19:24:28 client.py:412]     await self._send_one_way_rpc_request(                                                                                                                                                                                                                                                                                                                         │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 266, in _send_one_way_rpc_request                                                                                                                                                                                                                                    │
│ ERROR 08-30 19:24:28 client.py:412]     raise response                                                                                                                                                                                                                                                                                                                                                │
│ ERROR 08-30 19:24:28 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.                                                                                                                                                                                                                                                                                    │
│ ERROR 08-30 19:24:28 client.py:265] Got Unhealthy response from RPC Server                                                                                                                                                                                                                                                                                                                            │
│ ERROR 08-30 19:24:28 client.py:412] AsyncEngineDeadError('Background loop is stopped.')                                                                                                                                                                                                                                                                                                               │
│ ERROR 08-30 19:24:28 client.py:412] Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate                                                                                                                                                                                                                                                     │
│ ERROR 08-30 19:24:28 client.py:412]     await self.check_health(socket=socket)                                                                                                                                                                                                                                                                                                                        │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 431, in check_health                                                                                                                                                                                                                                                 │
│ ERROR 08-30 19:24:28 client.py:412]     await self._send_one_way_rpc_request(                                                                                                                                                                                                                                                                                                                         │
│ ERROR 08-30 19:24:28 client.py:412]   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 266, in _send_one_way_rpc_request                                                                                                                                                                                                                                    │
│ ERROR 08-30 19:24:28 client.py:412]     raise response                                                                                                                                                                                                                                                                                                                                                │
│ ERROR 08-30 19:24:28 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.                                                                                                                                                                                                                                                                                    │
│ CRITICAL 08-30 19:24:28 launcher.py:82] AsyncLLMEngine has failed, terminating server process                                                                                                                                                                                                                                                                                                         │
│ INFO:     10.94.91.181:55960 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error                                                                                                                                                                                                                                                                                                              │
│ CRITICAL 08-30 19:24:28 launcher.py:82] AsyncLLMEngine has failed, terminating server process                                                                                                                                                                                                                                                                                                         │
│ INFO:     10.94.88.168:48038 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error                                                                                                                                                                                                                                                                                                              │
│ [2024-08-30 19:24:28,686 E 65 3518] logging.cc:115: Stack trace:                                                                                                                                                                                                                                                                                                                                      │
│  /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x10b96aa) [0x7f7b2e5c76aa] ray::operator<<()                                                                                                                                                                                                                                                                                                │
│ /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x10bc932) [0x7f7b2e5ca932] ray::TerminateHandler()                                                                                                                                                                                                                                                                                           │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c) [0x7f7c7bb0a37c]                                                                                                                                                                                                                                                                                                                                   │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7) [0x7f7c7bb0a3e7]                                                                                                                                                                                                                                                                                                                                   │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa36f) [0x7f7c7bb0a36f]                                                                                                                                                                                                                                                                                                                                   │
│ /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe5ab35) [0x7f7c2e01eb35] c10d::ProcessGroupNCCL::ncclCommWatchdog()                                                                                                                                                                                                                                                             │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7f7c7bb36df4]                                                                                                                                                                                                                                                                                                                                   │
│ /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f7c7cd07609] start_thread                                                                                                                                                                                                                                                                                                                      │
│ /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f7c7ce41353] __clone                                                                                                                                                                                                                                                                                                                              │
│                                                                                                                                                                                                                                                                                                                                                                                                       │
│ *** SIGABRT received at time=1725071068 on cpu 16 ***                                                                                                                                                                                                                                                                                                                                                 │
│ PC: @     0x7f7c7cd6500b  (unknown)  raise                                                                                                                                                                                                                                                                                                                                                            │
│     @     0x7f7c7cd65090       3216  (unknown)                                                                                                                                                                                                                                                                                                                                                        │
│     @     0x7f7c7bb0a37c  (unknown)  (unknown)                                                                                                                                                                                                                                                                                                                                                        │
│     @     0x7f7c7bb0a090  (unknown)  (unknown)                                                                                                                                                                                                                                                                                                                                                        │
│ [2024-08-30 19:24:28,688 E 65 3518] logging.cc:440: *** SIGABRT received at time=1725071068 on cpu 16 ***                                                                                                                                                                                                                                                                                             │
│ [2024-08-30 19:24:28,688 E 65 3518] logging.cc:440: PC: @     0x7f7c7cd6500b  (unknown)  raise                                                                                                                                                                                                                                                                                                        │
│ [2024-08-30 19:24:28,688 E 65 3518] logging.cc:440:     @     0x7f7c7cd65090       3216  (unknown)                                                                                                                                                                                                                                                                                                    │
│ [2024-08-30 19:24:28,688 E 65 3518] logging.cc:440:     @     0x7f7c7bb0a37c  (unknown)  (unknown)                                                                                                                                                                                                                                                                                                    │
│ [2024-08-30 19:24:28,689 E 65 3518] logging.cc:440:     @     0x7f7c7bb0a090  (unknown)  (unknown)                                                                                                                                                                                                                                                                                                    │
│ Fatal Python error: Aborted         

This seems to be the same illegal memory address issue as #8025

Are these logs from v0.5.4 or v0.5.5?

Update: looks like v0.5.5 based on line numbers

Update: I am able to reproduce the issue sporadically

@robertgshaw2-neuralmagic
Copy link
Collaborator

Update on AssertionError: expected running sequences:

@robertgshaw2-neuralmagic
Copy link
Collaborator

@zoltan-fedor - I reproed the issue once, but have not been able to retrigger with CUDA_LAUNCH_BLOCKING=1 after running for several hours. Will leave on overnight, but any data / request pattern you can share would help a lot

@zoltan-fedor
Copy link
Author

@robertgshaw2-neuralmagic , sorry, we do not have a way to reproduce it either.

@robertgshaw2-neuralmagic
Copy link
Collaborator

@robertgshaw2-neuralmagic , sorry, we do not have a way to reproduce it either.

No worries. I am sure it will occur soon + I can look into it further.

@robertgshaw2-neuralmagic
Copy link
Collaborator

Is there anything more you can share about your environment?

E.g. can you run collect_env.py?

@zoltan-fedor
Copy link
Author

There isn't much to share.
We are using your docker image from Dockerhub: https://hub.docker.com/r/vllm/vllm-openai/tags

No modification, we run it as-is.
At the top of this ticket you can see the parameters we use and the GPUs it is running on.

@robertgshaw2-neuralmagic
Copy link
Collaborator

There isn't much to share. We are using your docker image from Dockerhub: https://hub.docker.com/r/vllm/vllm-openai/tags

No modification, we run it as-is. At the top of this ticket you can see the parameters we use and the GPUs it is running on.

sounds good. thanks

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Sep 1, 2024

There isn't much to share. We are using your docker image from Dockerhub: https://hub.docker.com/r/vllm/vllm-openai/tags
No modification, we run it as-is. At the top of this ticket you can see the parameters we use and the GPUs it is running on.

sounds good. thanks

Indeed it reproduced. As expected, it is an illegal memory access in flash attention due to prefix caching. I will dig in further. took about 4 hours to trigger it

@zoltan-fedor
Copy link
Author

zoltan-fedor commented Sep 3, 2024

The source of AssertionError: expected running sequences is due to abort not yet being supported with multi-step scheduling. multi-step scheduling is a new feature we are still working on - I would not yet recommend using multi-step in production use cases until the feature is finalized. The tracking issue for development of multi-step scheduling is here:

* [[Tracking issue] [Help wanted]: Multi-step scheduling follow-ups #7528](https://github.com/vllm-project/vllm/issues/7528)

@zoltan-fedor re: the issues you are seeing with illegal memory access on v0.5.4 / v0.5.5, we have seen intermittent reports of this with --enable-prefix-caching. We have been working on trying to reproduce the issue. If possible, sharing:

* the full logs

* anything you can re: access patterns (the client code which generates the issue)

Would help us a lot of reproduce and resolve the issue

@robertgshaw2-neuralmagic , I have also seen the same illegal memory access error with v0.5.4 WITHOUT the --enable-prefix-caching flag!

│     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)                                                                                                                                                                                                                                                                                                                          │
│   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app                                                                                                                                                                                                                                                                                             │
│     raise exc                                                                                                                                                                                                                                                                                                                                                                                         │
│   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app                                                                                                                                                                                                                                                                                             │
│     await app(scope, receive, sender)                                                                                                                                                                                                                                                                                                                                                                 │
│   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__                                                                                                                                                                                                                                                                                                          │
│     await self.middleware_stack(scope, receive, send)                                                                                                                                                                                                                                                                                                                                                 │
│   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app                                                                                                                                                                                                                                                                                                               │
│     await route.handle(scope, receive, send)                                                                                                                                                                                                                                                                                                                                                          │
│   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle                                                                                                                                                                                                                                                                                                            │
│     await self.app(scope, receive, send)                                                                                                                                                                                                                                                                                                                                                              │
│   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app                                                                                                                                                                                                                                                                                                                │
│     await wrap_app_handling_exceptions(app, request)(scope, receive, send)                                                                                                                                                                                                                                                                                                                            │
│   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app                                                                                                                                                                                                                                                                                             │
│     raise exc                                                                                                                                                                                                                                                                                                                                                                                         │
│   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app                                                                                                                                                                                                                                                                                             │
│     await app(scope, receive, sender)                                                                                                                                                                                                                                                                                                                                                                 │
│   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app                                                                                                                                                                                                                                                                                                                │
│     response = await func(request)                                                                                                                                                                                                                                                                                                                                                                    │
│   File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app                                                                                                                                                                                                                                                                                                                 │
│     raw_response = await run_endpoint_function(                                                                                                                                                                                                                                                                                                                                                       │
│   File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function                                                                                                                                                                                                                                                                                               │
│     return await dependant.call(**values)                                                                                                                                                                                                                                                                                                                                                             │
│   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 204, in create_completion                                                                                                                                                                                                                                                                                │
│     generator = await openai_serving_completion.create_completion(                                                                                                                                                                                                                                                                                                                                    │
│   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 170, in create_completion                                                                                                                                                                                                                                                                        │
│     async for i, res in result_generator:                                                                                                                                                                                                                                                                                                                                                             │
│   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 346, in consumer                                                                                                                                                                                                                                                                                                                 │
│     raise e                                                                                                                                                                                                                                                                                                                                                                                           │
│   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 337, in consumer                                                                                                                                                                                                                                                                                                                 │
│     raise item                                                                                                                                                                                                                                                                                                                                                                                        │
│   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 312, in producer                                                                                                                                                                                                                                                                                                                 │
│     async for item in iterator:                                                                                                                                                                                                                                                                                                                                                                       │
│   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 216, in generate                                                                                                                                                                                                                                                                                         │
│     raise request_output                                                                                                                                                                                                                                                                                                                                                                              │
│ RuntimeError: CUDA error: an illegal memory access was encountered                                                                                                                                                                                                                                                                                                                                    │
│ CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                                                                                                                                                                                                                                               │
│ For debugging consider passing CUDA_LAUNCH_BLOCKING=1                                                                                                                                                                                                                                                                                                                                                 │
│ Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                                                                                                                                                                                                                                                                                   │
│                                                                                                                                                                                                                                                                                                                                                                                                       │
│ [2024-09-02 12:11:54,463 E 61 3464] logging.cc:115: Stack trace:                                                                                                                                                                                                                                                                                                                                      │
│  /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x10b96aa) [0x7f3da67f26aa] ray::operator<<()                                                                                                                                                                                                                                                                                                │
│ /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0x10bc932) [0x7f3da67f5932] ray::TerminateHandler()                                                                                                                                                                                                                                                                                           │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa37c) [0x7f3eecc6e37c]                                                                                                                                                                                                                                                                                                                                   │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3e7) [0x7f3eecc6e3e7]                                                                                                                                                                                                                                                                                                                                   │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa36f) [0x7f3eecc6e36f]                                                                                                                                                                                                                                                                                                                                   │
│ /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xe5ab35) [0x7f3e9f182b35] c10d::ProcessGroupNCCL::ncclCommWatchdog()                                                                                                                                                                                                                                                             │
│ /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6df4) [0x7f3eecc9adf4]                                                                                                                                                                                                                                                                                                                                   │
│ /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f3eede5c609] start_thread                                                                                                                                                                                                                                                                                                                      │
│ /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f3eedf96353] __clone                                                                                                                                                                                                                                                                                                                              │
│                                                                                                                                                                                                                                                                                                                                                                                                       │
│ *** SIGABRT received at time=1725279114 on cpu 19 ***                                                                                                                                                                                                                                                                                                                                                 │
│ PC: @     0x7f3eedeba00b  (unknown)  raise                                                                                                                                                                                                                                                                                                                                                            │
│     @     0x7f3eedeba090       3216  (unknown)                                                                                                                                                                                                                                                                                                                                                        │
│     @     0x7f3eecc6e37c  (unknown)  (unknown)                                                                                                                                                                                                                                                                                                                                                        │
│     @     0x7f3eecc6e090  (unknown)  (unknown)                                                                                                                                                                                                                                                                                                                                                        │
│ [2024-09-02 12:11:54,465 E 61 3464] logging.cc:440: *** SIGABRT received at time=1725279114 on cpu 19 ***                                                                                                                                                                                                                                                                                             │
│ [2024-09-02 12:11:54,465 E 61 3464] logging.cc:440: PC: @     0x7f3eedeba00b  (unknown)  raise                                                                                                                                                                                                                                                                                                        │
│ [2024-09-02 12:11:54,465 E 61 3464] logging.cc:440:     @     0x7f3eedeba090       3216  (unknown)                                                                                                                                                                                                                                                                                                    │
│ [2024-09-02 12:11:54,465 E 61 3464] logging.cc:440:     @     0x7f3eecc6e37c  (unknown)  (unknown)                                                                                                                                                                                                                                                                                                    │
│ [2024-09-02 12:11:54,466 E 61 3464] logging.cc:440:     @     0x7f3eecc6e090  (unknown)  (unknown)                                                                                                                                                                                                                                                                                                    │
│ Fatal Python error: Aborted                                                                                                                                                                                                                                                                                                                                                                           │
│                                                                                                      

@robertgshaw2-neuralmagic
Copy link
Collaborator

The source of AssertionError: expected running sequences is due to abort not yet being supported with multi-step scheduling. multi-step scheduling is a new feature we are still working on - I would not yet recommend using multi-step in production use cases until the feature is finalized. The tracking issue for development of multi-step scheduling is here:

* [[Tracking issue] [Help wanted]: Multi-step scheduling follow-ups #7528](https://github.com/vllm-project/vllm/issues/7528)

@zoltan-fedor re: the issues you are seeing with illegal memory access on v0.5.4 / v0.5.5, we have seen intermittent reports of this with --enable-prefix-caching. We have been working on trying to reproduce the issue. If possible, sharing:

* the full logs

* anything you can re: access patterns (the client code which generates the issue)

Would help us a lot of reproduce and resolve the issue

@robertgshaw2-neuralmagic , I have also seen the same error with v0.5.4 WITHOUT the --enable-prefix-caching flag!

So, with the following command?:

--model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
--tensor-parallel-size 4
--gpu-memory-utilization 0.95
--enforce-eager
--trust-remote-code
--worker-use-ray
--dtype half
--max-model-len 32768

@zoltan-fedor
Copy link
Author

  - "32768"

The source of AssertionError: expected running sequences is due to abort not yet being supported with multi-step scheduling. multi-step scheduling is a new feature we are still working on - I would not yet recommend using multi-step in production use cases until the feature is finalized. The tracking issue for development of multi-step scheduling is here:

* [[Tracking issue] [Help wanted]: Multi-step scheduling follow-ups #7528](https://github.com/vllm-project/vllm/issues/7528)

@zoltan-fedor re: the issues you are seeing with illegal memory access on v0.5.4 / v0.5.5, we have seen intermittent reports of this with --enable-prefix-caching. We have been working on trying to reproduce the issue. If possible, sharing:

* the full logs

* anything you can re: access patterns (the client code which generates the issue)

Would help us a lot of reproduce and resolve the issue

@robertgshaw2-neuralmagic , I have also seen the same error with v0.5.4 WITHOUT the --enable-prefix-caching flag!

So, with the following command?:

--model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
--tensor-parallel-size 4
--gpu-memory-utilization 0.95
--enforce-eager
--trust-remote-code
--worker-use-ray
--dtype half
--max-model-len 32768

That is correct.
That was the command, so no --enable-prefix-caching

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Sep 3, 2024

  - "32768"

The source of AssertionError: expected running sequences is due to abort not yet being supported with multi-step scheduling. multi-step scheduling is a new feature we are still working on - I would not yet recommend using multi-step in production use cases until the feature is finalized. The tracking issue for development of multi-step scheduling is here:

* [[Tracking issue] [Help wanted]: Multi-step scheduling follow-ups #7528](https://github.com/vllm-project/vllm/issues/7528)

@zoltan-fedor re: the issues you are seeing with illegal memory access on v0.5.4 / v0.5.5, we have seen intermittent reports of this with --enable-prefix-caching. We have been working on trying to reproduce the issue. If possible, sharing:

* the full logs

* anything you can re: access patterns (the client code which generates the issue)

Would help us a lot of reproduce and resolve the issue

@robertgshaw2-neuralmagic , I have also seen the same error with v0.5.4 WITHOUT the --enable-prefix-caching flag!

So, with the following command?:

--model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
--tensor-parallel-size 4
--gpu-memory-utilization 0.95
--enforce-eager
--trust-remote-code
--worker-use-ray
--dtype half
--max-model-len 32768

That is correct. That was the command, so no --enable-prefix-caching

Thanks. Will run this in the background today and see if I can reproduce. The illegal memory access I had before seemed to occur in attention, so perhaps it is something related to chunked-prefill rather than prefix caching (since chunked-prefill is on by default is max-len >32k)

@liulisi16323
Copy link

liulisi16323 commented Sep 5, 2024

  - "32768"

The source of AssertionError: expected running sequences is due to abort not yet being supported with multi-step scheduling. multi-step scheduling is a new feature we are still working on - I would not yet recommend using multi-step in production use cases until the feature is finalized. The tracking issue for development of multi-step scheduling is here:

* [[Tracking issue] [Help wanted]: Multi-step scheduling follow-ups #7528](https://github.com/vllm-project/vllm/issues/7528)

@zoltan-fedor re: the issues you are seeing with illegal memory access on v0.5.4 / v0.5.5, we have seen intermittent reports of this with --enable-prefix-caching. We have been working on trying to reproduce the issue. If possible, sharing:

* the full logs

* anything you can re: access patterns (the client code which generates the issue)

Would help us a lot of reproduce and resolve the issue

@robertgshaw2-neuralmagic , I have also seen the same error with v0.5.4 WITHOUT the --enable-prefix-caching flag!

So, with the following command?:

--model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
--tensor-parallel-size 4
--gpu-memory-utilization 0.95
--enforce-eager
--trust-remote-code
--worker-use-ray
--dtype half
--max-model-len 32768

That is correct. That was the command, so no --enable-prefix-caching

Thanks. Will run this in the background today and see if I can reproduce. The illegal memory access I had before seemed to occur in attention, so perhaps it is something related to chunked-prefill rather than prefix caching (since chunked-prefill is on by default is max-len >32k)

I've encountered this error too.
Error message:
Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered.
It's very likely to occur under high concurrency conditions. In my case, chunked-prefill is disable.

@robertgshaw2-neuralmagic
Copy link
Collaborator

  - "32768"

The source of AssertionError: expected running sequences is due to abort not yet being supported with multi-step scheduling. multi-step scheduling is a new feature we are still working on - I would not yet recommend using multi-step in production use cases until the feature is finalized. The tracking issue for development of multi-step scheduling is here:

* [[Tracking issue] [Help wanted]: Multi-step scheduling follow-ups #7528](https://github.com/vllm-project/vllm/issues/7528)

@zoltan-fedor re: the issues you are seeing with illegal memory access on v0.5.4 / v0.5.5, we have seen intermittent reports of this with --enable-prefix-caching. We have been working on trying to reproduce the issue. If possible, sharing:

* the full logs

* anything you can re: access patterns (the client code which generates the issue)

Would help us a lot of reproduce and resolve the issue

@robertgshaw2-neuralmagic , I have also seen the same error with v0.5.4 WITHOUT the --enable-prefix-caching flag!

So, with the following command?:

--model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
--tensor-parallel-size 4
--gpu-memory-utilization 0.95
--enforce-eager
--trust-remote-code
--worker-use-ray
--dtype half
--max-model-len 32768

That is correct. That was the command, so no --enable-prefix-caching

Thanks. Will run this in the background today and see if I can reproduce. The illegal memory access I had before seemed to occur in attention, so perhaps it is something related to chunked-prefill rather than prefix caching (since chunked-prefill is on by default is max-len >32k)

I've encountered this error too.
Error message:
Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered.
It's very likely to occur under high concurrency conditions. In my case, chunked-prefill is disable.

@liulisi16323 can you share your launch command? I ran @zoltan-fedor’s launch command for about 1 day (without prefix caching) and have not been able to trigger this issue

@zoltan-fedor - can you share driver and CUDA version? I will try to make an enc that more closely matches yours

@zoltan-fedor
Copy link
Author

zoltan-fedor commented Sep 5, 2024

@robertgshaw2-neuralmagic

can you share driver and CUDA version? I will try to make an enc that more closely matches yours

Driver Version: 535.183.01 CUDA Version: 12.4

@liulisi16323
Copy link

  - "32768"

The source of AssertionError: expected running sequences is due to abort not yet being supported with multi-step scheduling. multi-step scheduling is a new feature we are still working on - I would not yet recommend using multi-step in production use cases until the feature is finalized. The tracking issue for development of multi-step scheduling is here:

* [[Tracking issue] [Help wanted]: Multi-step scheduling follow-ups #7528](https://github.com/vllm-project/vllm/issues/7528)

@zoltan-fedor re: the issues you are seeing with illegal memory access on v0.5.4 / v0.5.5, we have seen intermittent reports of this with --enable-prefix-caching. We have been working on trying to reproduce the issue. If possible, sharing:

* the full logs

* anything you can re: access patterns (the client code which generates the issue)

Would help us a lot of reproduce and resolve the issue

@robertgshaw2-neuralmagic , I have also seen the same error with v0.5.4 WITHOUT the --enable-prefix-caching flag!

So, with the following command?:

--model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
--tensor-parallel-size 4
--gpu-memory-utilization 0.95
--enforce-eager
--trust-remote-code
--worker-use-ray
--dtype half
--max-model-len 32768

That is correct. That was the command, so no --enable-prefix-caching

Thanks. Will run this in the background today and see if I can reproduce. The illegal memory access I had before seemed to occur in attention, so perhaps it is something related to chunked-prefill rather than prefix caching (since chunked-prefill is on by default is max-len >32k)

I've encountered this error too.
Error message:
Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered.
It's very likely to occur under high concurrency conditions. In my case, chunked-prefill is disable.

@liulisi16323 can you share your launch command? I ran @zoltan-fedor’s launch command for about 1 day (without prefix caching) and have not been able to trigger this issue

@zoltan-fedor - can you share driver and CUDA version? I will try to make an enc that more closely matches yours

NVIDIA A800, Driver Version: 525.105.17,docker images: vllm/vllm-openai:v0.6.0, launch command: --model /Qwen2-72B-Instruct-GPTQ-Int4 --served-model-name qwen --host 0.0.0.0 --port 8000 --gpu-memory-utilization 0.65 --swap-space 0 --tensor-parallel-size 2 --enable-prefix-caching
max-model-len default 32k

@ashgold
Copy link

ashgold commented Sep 13, 2024

problem solved in v0.6.1.post1.

@zoltan-fedor
Copy link
Author

Thanks @ashgold, I have upgraded to this latest version and will monitor whether the issue arises again.

@yaronr
Copy link

yaronr commented Sep 15, 2024

I just encountered the same issue (I think) on 0.6.1.post2. I only see the log output below, no stack trace.

 vLLM ZMQ RPC Server was interrupted.
INFO 09-15 04:17:46 async_llm_engine.py:55] Engine is gracefully shutting down.
ERROR 09-15 04:17:49 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 215 died, exit code: -15
INFO 09-15 04:17:49 multiproc_worker_utils.py:123] Killing local vLLM worker processes
[root@llmatrix-nvda-5f57e4b2-629d-4b74-8a56-f1678de39a46 multicloud]# 

@TangJiakai
Copy link

problem solved in v0.6.1.post1.

No, I still get the error CUDA error: an illegal memory access was encountered

@ashgold
Copy link

ashgold commented Sep 16, 2024

problem solved in v0.6.1.post1.

No, I still get the error CUDA error: an illegal memory access was encountered

Can you share the details of the test environment and issues when they occur? If possible, I would like to reproduce it.
I conducted a long run test for more than 48 hours, but no issues occurred.

@TangJiakai
Copy link

@ashgold
I was executing requests concurrently on more than 3 GPU cards. It started off fine, but soon began to throw errors:

Exception in callback _log_task_completion(error_callback=<bound method...7fc9b858ee10>>)(<Task finishe...sertions.\n')>) at /data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py:38
handle: <Handle _log_task_completion(error_callback=<bound method...7fc9b858ee10>>)(<Task finishe...sertions.\n')>) at /data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py:38>
Traceback (most recent call last):
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 112, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1579, in execute_model
    logits = self.model.compute_logits(hidden_or_intermediate_states,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 457, in compute_logits
    logits = self.logits_processor(self.lm_head, hidden_states,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/vllm/lora/layers.py", line 1211, in forward
    return type(self.base_layer).forward(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py", line 72, in forward
    logits = _apply_logits_processors(logits, sampling_metadata)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py", line 142, in _apply_logits_processors
    logits_row = logits_processor(past_tokens_ids,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/lmformatenforcer/integrations/vllm.py", line 29, in __call__
    self.mask[allowed_tokens] = 0
    ~~~~~~~~~^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@TangJiakai
Copy link

At one point, the LLM on a certain card crashed and exited directly, and at the same time, the requests I sent to other LLMs also became ineffective. It's very strange.

@ashgold
Copy link

ashgold commented Sep 23, 2024

@ashgold I was executing requests concurrently on more than 3 GPU cards. It started off fine, but soon began to throw errors:

Exception in callback _log_task_completion(error_callback=<bound method...7fc9b858ee10>>)(<Task finishe...sertions.\n')>) at /data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py:38
handle: <Handle _log_task_completion(error_callback=<bound method...7fc9b858ee10>>)(<Task finishe...sertions.\n')>) at /data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py:38>
Traceback (most recent call last):
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/vllm/worker/model_runner_base.py", line 112, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1579, in execute_model
    logits = self.model.compute_logits(hidden_or_intermediate_states,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 457, in compute_logits
    logits = self.logits_processor(self.lm_head, hidden_states,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/vllm/lora/layers.py", line 1211, in forward
    return type(self.base_layer).forward(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py", line 72, in forward
    logits = _apply_logits_processors(logits, sampling_metadata)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py", line 142, in _apply_logits_processors
    logits_row = logits_processor(past_tokens_ids,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/tangjiakai/anaconda3/envs/agentscope/lib/python3.11/site-packages/lmformatenforcer/integrations/vllm.py", line 29, in __call__
    self.mask[allowed_tokens] = 0
    ~~~~~~~~~^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

This seems to be a different issue to the one I'm experiencing, and I'd suggest opening a separate bug to follow up on it.

@liulisi16323
Copy link

v0.6.1.post2, I still get the error under high concurrency conditions
[rank1]:[E924 15:10:22.264419715 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f620d546f86 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f620d4f5d10 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f620d621f08 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f620e83e3e6 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f620e843600 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f620e84a2ba in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f620e84c6fc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xd6df4 (0x7f625bff0df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x8609 (0x7f625d210609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f625d34a353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f620d546f86 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f620d4f5d10 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f620d621f08 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f620e83e3e6 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7f620e843600 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7f620e84a2ba in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f620e84c6fc in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xd6df4 (0x7f625bff0df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: + 0x8609 (0x7f625d210609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f625d34a353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f620d546f86 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe5aa84 (0x7f620e4d5a84 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd6df4 (0x7f625bff0df4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x8609 (0x7f625d210609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f625d34a353 in /usr/lib/x86_64-linux-gnu/libc.so.6)`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants