Sharded dropout. #3123

wujingyue · 2024-10-07T21:23:47Z

In SDPA, dropout, right after softmax, is done head parallel. To ensure dropout works correctly, the dropout in different shards needs to be randomized in different patterns. One possible solution proposed by https://arxiv.org/pdf/1909.08053 Section B.2 is to maintain a separate random number generator for dropout within model parallel regions. This random number generator is uniquely seeded for each model parallel worker.

To work around this limitation, our tests currently set the dropout probably in SDPA to 0:

Fuser/tests/cpp/test_multidevice_transformer.cpp

Line 27 in 70263ee

constexpr double kDropoutProb = 0.1, kParamScale = 0.02, kSdpaProb = 0.0,

wujingyue added the Multidevice label Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharded dropout. #3123

Sharded dropout. #3123

wujingyue commented Oct 7, 2024

Sharded dropout. #3123

Sharded dropout. #3123

Comments

wujingyue commented Oct 7, 2024