-
Notifications
You must be signed in to change notification settings - Fork 27.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix -1e4 as attn mask #17306
Fix -1e4 as attn mask #17306
Conversation
The documentation is not available anymore as the PR was closed or merged. |
ff7b92a
to
a6fd049
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PyTorch implementation relies on self.device
which breaks the model parallelism for big model inference, so we should avoid using it (I actually removed lots of instance where we used it recently, and will hunt the other ones in another PR ;-) )
Generally, this looks good to me. I'd prefer though to not factor out a one-liner into a function (even if we have to add the one-liner 100+ times). It's not good for readability to have to jump to Also, I'd advocate to make three separate PRs (one for PT, one for TF, one for Flax). Think it should be both easier to maintain the PRs as well as review them. A first test should then be that all slow tests pass. After that it would indeed be nice if we could run some fine-tuning for the most important models (BERT on GLUE, GPT2 on causal LM, T5 on translation maybe). Maybe also not even necessary to verify that everything is correct with a training run if the slow tests all pass |
Hi,
transformers/src/transformers/models/gpt2/modeling_gpt2.py Lines 202 to 205 in 195ef42
Would this be a problem for model parallelism for big model inference? It is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good in general. I have pretty much the same comments as Patrick. I would advocate to do some fine-tuning even if the slow tests pass to make sure it doesn't break anything. Especially with models like T5 which have had issues with attention_mask
.
@ydshieh Using the weight device is perfectly fine, thanks for checking! |
Cool, exciting! |
40c0ce7
to
70eb792
Compare
Hi, @patrickvonplaten @patil-suraj @sgugger @LysandreJik This PR is ready for review.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing all of those!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @ydshieh ! Looks good to me
797fb16
to
e52f8f9
Compare
267912c
to
5dc2a3f
Compare
* Use torch.finfo(self.dtype).min * for GPTNeoX * for Albert * For Splinter * Update src/transformers/models/data2vec/modeling_data2vec_audio.py Co-authored-by: Patrick von Platen <[email protected]> * fix -inf used in Bart-like models * Fix a few remaining -inf * more fix * clean up * For CLIP * For FSMT * clean up * fix test * Add dtype argument and use it for LayoutLMv3 * update FlaxLongT5Attention Co-authored-by: ydshieh <[email protected]> Co-authored-by: Patrick von Platen <[email protected]>
What does this PR do?
Fix the issues regarding
-1e4
as attention mask.Fix #17215 #17121 #14859