-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
关于多卡训练的问题 #96
Comments
我也是报这个错:torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
@Tian14267 请问如何解决的 |
三张3090报了同样问题,请问如何解决? |
torch.distributed.elastic.multiprocessing.errors.ChildFailedError |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
大神好,我在进行单卡训练和多卡训练的时候,遇到很多很奇怪的问题,比如:
当我单卡训练的时候,需要修改代码为下面这个才能正常训练:
但是如果我用上图这个代码进行多卡训练,就会报错:
提示我设备不是同一个设备?这是啥情况?
device_map
的问题吗?另外,我采用上图注释掉的进行多卡训练的话,就会提示
CUDA out of memory
。我的batch_size以及调整到16了,还是会out of memory。这啥情况。(单卡的batch给到128都没问题)The text was updated successfully, but these errors were encountered: