-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] tp=4 tp=8 no response #1755
Comments
可能和 #1750 遇到的是同一类问题。 export NCCL_P2P_DISABLE=1 如果不能解决的话,麻烦在启动命令中加入 --log-level INFO,把日志贴上来吧。 |
@
如图所示, 加入了 export NCCL_P2P_DISABLE=1 之后 也是一样 ,一直卡死, 并且 最后一张卡 100% |
I haven't reproduced this issue. My device is A100-80G(x8) |
我感觉得用 gdb 来debug问题所在。
执行完上述操作后,会在当前工作目录产生一个 gdb.txt 文件,麻烦把这个文件传到issue中来吧。
|
|
H800*8, same error. use llama3.1-70B-instruct. The system hangs after about 900 calls. |
cc @lzhangzz |
@lvhan028 Hi, I tried that branch last night. But still does not work. |
Checklist
Describe the bug
发现一个问题, 在rtx4090 * 8 环境, 针对qwen1.5-110b-awq设置--tp 8 或者 qwen2-72b-awq 设置--tp 4 都会卡死 一直无响应,张量并行 设置大了 好像基本都会有这样的卡死情况。
Reproduction
CUDA_VISIBLE_DEVICES=0,1,2,3 lmdeploy serve api_server /home/nlp/pretrain_models/Qwen2-72B-Instruct-AWQ
--model-name qwen
--server-name 0.0.0.0
--server-port 23334
--tp 4
--cache-max-entry-count 0.1
--quant-policy 4
--model-format awq
Environment
Error traceback
No response
The text was updated successfully, but these errors were encountered: