Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About GPU utilization #965

Closed
Hzj199 opened this issue Nov 22, 2023 · 3 comments
Closed

About GPU utilization #965

Hzj199 opened this issue Nov 22, 2023 · 3 comments

Comments

@Hzj199
Copy link

Hzj199 commented Nov 22, 2023

In multi-machine, multi-GPU training, the InfiniBand (IB) network shows no traffic, suggesting that avg_loss = accelerator.gather(loss).mean() might not be used.

@Hzj199
Copy link
Author

Hzj199 commented Nov 23, 2023

Why is it necessary to use transform_models_if_DDP(models) after accelerator.prepare?

@Isotr0py
Copy link
Contributor

Isotr0py commented Dec 6, 2023

transform_models_if_DDP(models) is a wrong writing made by mistake and we haven't noticed it before. This will be removed in the new PR, because it breaks the gradients sync.

I think this may also break the multi-machine communication. Maybe the new PR can fix this.

I only have two GPUs on one machine and can't test in multi-machine training. Could you check if the update works on the multi-machine training?

@Hzj199
Copy link
Author

Hzj199 commented Dec 7, 2023

transform_models_if_DDP(models) is a wrong writing made by mistake and we haven't noticed it before. This will be removed in the new PR, because it breaks the gradients sync.

I think this may also break the multi-machine communication. Maybe the new PR can fix this.

I only have two GPUs on one machine and can't test in multi-machine training. Could you check if the update works on the multi-machine training?

I removed transform_models_if_DDP(models) and it now runs normally on four machines with 8 GPUSs, with GPU utilization of around 95%.

@Hzj199 Hzj199 closed this as completed Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants