You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In SDPA, dropout, right after softmax, is done head parallel. To ensure dropout works correctly, the dropout in different shards needs to be randomized in different patterns. One possible solution proposed by https://arxiv.org/pdf/1909.08053 Section B.2 is to maintain a separate random number generator for dropout within model parallel regions. This random number generator is uniquely seeded for each model parallel worker.
To work around this limitation, our tests currently set the dropout probably in SDPA to 0:
In SDPA, dropout, right after softmax, is done head parallel. To ensure dropout works correctly, the dropout in different shards needs to be randomized in different patterns. One possible solution proposed by https://arxiv.org/pdf/1909.08053 Section B.2 is to maintain a separate random number generator for dropout within model parallel regions. This random number generator is uniquely seeded for each model parallel worker.
To work around this limitation, our tests currently set the dropout probably in SDPA to 0:
Fuser/tests/cpp/test_multidevice_transformer.cpp
Line 27 in 70263ee
The text was updated successfully, but these errors were encountered: