You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Multi-GPU training. Multi-node or single node. Async vs sync. It should also work well with consumer GPUs, also in the multi-node case. Currently distributed data parallel synchronized training, using DistributedDataParallel, and using our own dataset sampling logic, extending Dataset._get_default_random_seed_offset.
This issue is to track the process, discuss anything relevant.
The text was updated successfully, but these errors were encountered:
One issue was that when wrapping the module with DistributedDataParallel, it was important that we would call that forward function. But this was not really compatible with our train_step function where we want to pass the original (unwrapped) module. However, we solve this by just setting up the right context, what DistributedDataParallel.forward does internally.
For the PyTorch engine (#1120).
Multi-GPU training. Multi-node or single node. Async vs sync. It should also work well with consumer GPUs, also in the multi-node case. Currently distributed data parallel synchronized training, using
DistributedDataParallel
, and using our own dataset sampling logic, extendingDataset._get_default_random_seed_offset
.This issue is to track the process, discuss anything relevant.
The text was updated successfully, but these errors were encountered: