-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: VllmWorkerProcess does not exit correctly when TP > 1 #6219
Comments
I also encountered the same problem when profiling. After some investigation, I can share something:
believe the original line of code was to log the termination of the process due to an error while the server was functioning normally. However, this is normal GC in this case (run offline inference This is not a bug, but rather an inappropriate error logging. You can check the exit code of the script, which is 0.
I found a temporary solution to address this issue for nsys profiling that meets my requirements: only trace cuda events in |
Thanks @cermeng @LiuXiaoxuanPKU, this should hopefully be addressed by #7041. |
Does not appear to be fixed by #8492 |
Your current environment
🐛 Describe the bug
Reproduce:
Error message:
Error highlight:
ERROR 07-08 11:13:55 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 2481703 died, exit code: -15
The error might be OK in normal cases since it's the exit logic. But it's important when using nsys for profiling. nsys will stuck if the exit logic is incorrect.
@njhill any context here?
The text was updated successfully, but these errors were encountered: