Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharded dropout. #3123

Open
wujingyue opened this issue Oct 7, 2024 · 0 comments
Open

Sharded dropout. #3123

wujingyue opened this issue Oct 7, 2024 · 0 comments

Comments

@wujingyue
Copy link
Collaborator

In SDPA, dropout, right after softmax, is done head parallel. To ensure dropout works correctly, the dropout in different shards needs to be randomized in different patterns. One possible solution proposed by https://arxiv.org/pdf/1909.08053 Section B.2 is to maintain a separate random number generator for dropout within model parallel regions. This random number generator is uniquely seeded for each model parallel worker.

To work around this limitation, our tests currently set the dropout probably in SDPA to 0:

constexpr double kDropoutProb = 0.1, kParamScale = 0.02, kSdpaProb = 0.0,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant