Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qwen_1.8b sft训练过程一直卡着不动,看资源使用情况,GPU利用率为0,可能是什么原因导致的? #2672

Closed
1 task done
Julylmm opened this issue Mar 2, 2024 · 4 comments
Labels
solved This problem has been already solved

Comments

@Julylmm
Copy link

Julylmm commented Mar 2, 2024

Reminder

  • I have read the README and searched the existing issues.

Reproduction

没有报错,但是训练过程卡着不动
image

Expected behavior

No response

System Info

No response

Others

No response

@hiyouga
Copy link
Owner

hiyouga commented Mar 2, 2024

Try #1683

@hiyouga hiyouga added the pending This problem is yet to be addressed label Mar 2, 2024
@Julylmm
Copy link
Author

Julylmm commented Mar 4, 2024

Try #1683

脚本里增加export NCCL_P2P_LEVEL=NVL也还是不行

@Julylmm
Copy link
Author

Julylmm commented Mar 4, 2024

Try #1683

脚本里增加export NCCL_P2P_LEVEL=NVL也还是不行

使用高版本的cuda,跑起来了,已解决

@Julylmm Julylmm closed this as completed Mar 4, 2024
@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Mar 4, 2024
@chensongcan
Copy link

Try #1683

脚本里增加export NCCL_P2P_LEVEL=NVL也还是不行

使用高版本的cuda,跑起来了,已解决
请问cuda版本多少

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

3 participants