PyTorch multi-GPU training, multi-node or single node, async or synced #1332

albertz · 2023-05-19T10:21:13Z

For the PyTorch engine (#1120).

Multi-GPU training. Multi-node or single node. Async vs sync. It should also work well with consumer GPUs, also in the multi-node case. Currently distributed data parallel synchronized training, using DistributedDataParallel, and using our own dataset sampling logic, extending Dataset._get_default_random_seed_offset.

This issue is to track the process, discuss anything relevant.

The text was updated successfully, but these errors were encountered:

albertz · 2023-05-19T10:47:35Z

So far we mostly follow the PyTorch DDP tutorial, which uses synchronized accumulated gradients.

albertz · 2023-05-19T10:48:53Z

One issue was that when wrapping the module with DistributedDataParallel, it was important that we would call that forward function. But this was not really compatible with our train_step function where we want to pass the original (unwrapped) module. However, we solve this by just setting up the right context, what DistributedDataParallel.forward does internally.

albertz assigned Judyxujj May 19, 2023

albertz mentioned this issue May 19, 2023

Frontend API and PyTorch backend #1120

Open

albertz mentioned this issue Jun 5, 2023

PyTorch DistributedDataParallel Multi-GPU training #1335

Merged

albertz closed this as completed in #1335 Jun 9, 2023

albertz mentioned this issue Nov 28, 2023

Torch distributed: attribute num_iterations doesn't exist in torch version 2.1.0 anymore #1451

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch multi-GPU training, multi-node or single node, async or synced #1332

PyTorch multi-GPU training, multi-node or single node, async or synced #1332

albertz commented May 19, 2023

albertz commented May 19, 2023

albertz commented May 19, 2023

PyTorch multi-GPU training, multi-node or single node, async or synced #1332

PyTorch multi-GPU training, multi-node or single node, async or synced #1332

Comments

albertz commented May 19, 2023

albertz commented May 19, 2023

albertz commented May 19, 2023