You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is an issue in the implementation of DistributedDataParallel that triggers a deadlock of processes.
Specifically, in the method flat_dist_call, there is a for loop over a dictionary with calls to collective operations (like broadcasting) in the body. Since the ordering of the dictionary's keys is random, we obtain non-matching calls to the collective operations, which induce a deadlock of the processes.
I have fixed this issue and created a pull request.
The text was updated successfully, but these errors were encountered:
There is an issue in the implementation of DistributedDataParallel that triggers a deadlock of processes.
Specifically, in the method
flat_dist_call
, there is a for loop over a dictionary with calls to collective operations (like broadcasting) in the body. Since the ordering of the dictionary's keys is random, we obtain non-matching calls to the collective operations, which induce a deadlock of the processes.I have fixed this issue and created a pull request.
The text was updated successfully, but these errors were encountered: