Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch multi-GPU training, multi-node or single node, async or synced #1332

Closed
albertz opened this issue May 19, 2023 · 2 comments · Fixed by #1335
Closed

PyTorch multi-GPU training, multi-node or single node, async or synced #1332

albertz opened this issue May 19, 2023 · 2 comments · Fixed by #1335
Assignees

Comments

@albertz
Copy link
Member

albertz commented May 19, 2023

For the PyTorch engine (#1120).

Multi-GPU training. Multi-node or single node. Async vs sync. It should also work well with consumer GPUs, also in the multi-node case. Currently distributed data parallel synchronized training, using DistributedDataParallel, and using our own dataset sampling logic, extending Dataset._get_default_random_seed_offset.

This issue is to track the process, discuss anything relevant.

@albertz
Copy link
Member Author

albertz commented May 19, 2023

So far we mostly follow the PyTorch DDP tutorial, which uses synchronized accumulated gradients.

@albertz
Copy link
Member Author

albertz commented May 19, 2023

One issue was that when wrapping the module with DistributedDataParallel, it was important that we would call that forward function. But this was not really compatible with our train_step function where we want to pass the original (unwrapped) module. However, we solve this by just setting up the right context, what DistributedDataParallel.forward does internally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants