Fix -1e4 as attn mask #17306

ydshieh · 2022-05-17T16:45:47Z

What does this PR do?

Fix the issues regarding -1e4 as attention mask.

Fix #17215 #17121 #14859

src/transformers/models/gpt2/modeling_gpt2.py

src/transformers/modeling_utils.py

HuggingFaceDocBuilderDev · 2022-05-17T17:08:28Z

The documentation is not available anymore as the PR was closed or merged.

tests/models/t5/test_modeling_tf_t5.py

sgugger

The PyTorch implementation relies on self.device which breaks the model parallelism for big model inference, so we should avoid using it (I actually removed lots of instance where we used it recently, and will hunt the other ones in another PR ;-) )

src/transformers/modeling_utils.py

src/transformers/modeling_flax_utils.py

src/transformers/modeling_tf_utils.py

src/transformers/modeling_utils.py

patrickvonplaten · 2022-05-17T21:49:20Z

Generally, this looks good to me. I'd prefer though to not factor out a one-liner into a function (even if we have to add the one-liner 100+ times). It's not good for readability to have to jump to modeling_utils.py and the code saved is not worth it for a one-liner.

Also, I'd advocate to make three separate PRs (one for PT, one for TF, one for Flax). Think it should be both easier to maintain the PRs as well as review them.

A first test should then be that all slow tests pass. After that it would indeed be nice if we could run some fine-tuning for the most important models (BERT on GLUE, GPT2 on causal LM, T5 on translation maybe). Maybe also not even necessary to verify that everything is correct with a training run if the slow tests all pass

ydshieh · 2022-05-18T08:29:48Z

Hi,

@patrickvonplaten:

I removed the new function.
I have to modify FlaxT5Attention otherwise the PT/Flax T5 equivalence tests will fail.

@sgugger:

since there is no more new function mask_value(), so no more device issue. There is one place I need to use tensor and device though:

transformers/src/transformers/models/gpt2/modeling_gpt2.py

Lines 202 to 205 in 195ef42

    
           mask_value = torch.finfo(attn_weights.dtype).min 
        
           # Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar type float but found double`. 
        
           # Need to be on the same device, otherwise `RuntimeError: ..., x and y to be on the same device` 
        
           mask_value = torch.tensor(mask_value, dtype=attn_weights.dtype).to(attn_weights.device)

Would this be a problem for model parallelism for big model inference? It is attn_weights.device instead of self.dtype though.

patil-suraj

It looks good in general. I have pretty much the same comments as Patrick. I would advocate to do some fine-tuning even if the slow tests pass to make sure it doesn't break anything. Especially with models like T5 which have had issues with attention_mask.

sgugger · 2022-05-18T13:00:29Z

@ydshieh Using the weight device is perfectly fine, thanks for checking!

LysandreJik · 2022-05-19T18:22:34Z

Cool, exciting!

ydshieh · 2022-05-25T09:13:52Z

Hi, @patrickvonplaten @patil-suraj @sgugger @LysandreJik

This PR is ready for review.

Only dealing with PyTorch models: but need to change FlaxT5 too to make the test pass.
In general, change to torch.finfo(correct-dtype).min instead of -10000, -1e9 etc.
In particular, changes in modeling_utils.py
Verified the change by training a T5 from scratch as well as finetuning the t5-small checkpoint

sgugger

Thanks for fixing all of those!

src/transformers/models/data2vec/modeling_data2vec_audio.py

patrickvonplaten

Great work @ydshieh ! Looks good to me

src/transformers/models/hubert/modeling_hubert.py

* Use torch.finfo(self.dtype).min * for GPTNeoX * for Albert * For Splinter * Update src/transformers/models/data2vec/modeling_data2vec_audio.py Co-authored-by: Patrick von Platen <[email protected]> * fix -inf used in Bart-like models * Fix a few remaining -inf * more fix * clean up * For CLIP * For FSMT * clean up * fix test * Add dtype argument and use it for LayoutLMv3 * update FlaxLongT5Attention Co-authored-by: ydshieh <[email protected]> Co-authored-by: Patrick von Platen <[email protected]>

ydshieh commented May 17, 2022

View reviewed changes

src/transformers/models/gpt2/modeling_gpt2.py Outdated Show resolved Hide resolved

ydshieh commented May 17, 2022

View reviewed changes

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved

ydshieh force-pushed the no_-1e4_for_attn_mask branch from ff7b92a to a6fd049 Compare May 17, 2022 18:53

ydshieh commented May 17, 2022

View reviewed changes

tests/models/t5/test_modeling_tf_t5.py Outdated Show resolved Hide resolved

ydshieh changed the title ~~[WIP] Fix -1e4 as attn mask~~ Fix -1e4 as attn mask May 17, 2022

ydshieh requested review from sgugger, patil-suraj, LysandreJik and patrickvonplaten May 17, 2022 20:22

ydshieh marked this pull request as ready for review May 17, 2022 20:31

sgugger reviewed May 17, 2022

View reviewed changes

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed May 17, 2022

View reviewed changes

src/transformers/modeling_flax_utils.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed May 17, 2022

View reviewed changes

src/transformers/modeling_tf_utils.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed May 17, 2022

View reviewed changes

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved

patil-suraj reviewed May 18, 2022

View reviewed changes

ydshieh force-pushed the no_-1e4_for_attn_mask branch 5 times, most recently from 40c0ce7 to 70eb792 Compare May 25, 2022 08:35

sgugger approved these changes May 25, 2022

View reviewed changes

patrickvonplaten reviewed May 25, 2022

View reviewed changes

src/transformers/models/data2vec/modeling_data2vec_audio.py Outdated Show resolved Hide resolved

patrickvonplaten approved these changes May 25, 2022

View reviewed changes

src/transformers/models/hubert/modeling_hubert.py Outdated Show resolved Hide resolved

ydshieh force-pushed the no_-1e4_for_attn_mask branch from 797fb16 to e52f8f9 Compare May 26, 2022 08:33

ydshieh added 11 commits June 20, 2022 11:17

Fix a few remaining -inf

217a1df

more fix

2e5c050

clean up

51a577d

For CLIP

3201a0f

For FSMT

3da95f9

clean up

b018a51

fix test

4faab4a

Add dtype argument and use it for LayoutLMv3

a2b419c

fix

bcf7536

fix

73b5e65

fix conflict

5dc2a3f

ydshieh force-pushed the no_-1e4_for_attn_mask branch from 267912c to 5dc2a3f Compare June 20, 2022 09:18

update FlaxLongT5Attention

25eac91

ydshieh merged commit d3cb288 into huggingface:main Jun 20, 2022

ydshieh mentioned this pull request Jul 1, 2022

Training with fp16 precision gives nan in Longt5 #17978

Closed

4 tasks

ydshieh mentioned this pull request Aug 23, 2022

Fix incorrect comments about atten mask for pytorch backend #18728

Merged

5 tasks

LysandreJik mentioned this pull request Aug 30, 2022

Training loss of BART is going to nan in transformers>=4.21.0 #18773

Closed

4 tasks

tianleiwu mentioned this pull request Sep 2, 2022

huggingface transformers 4.21.* and OnnxRuntime has different results on BERT/GPT-2 microsoft/onnxruntime#12843

Closed

This was referenced Nov 3, 2022

NaNs seen with transformers version >= 4.21.0 when running HF BERT fine-tuning with XLA_USE_BF16=1 pytorch/xla#4152

Closed

NaNs seen with transformers version >= 4.21.0 when running HF BERT fine-tuning with XLA_USE_BF16=1 aws-neuron/aws-neuron-sdk#593

Closed

This was referenced Dec 2, 2022

Clip floating point constants to bf16 range to avoid inf conversion #20562

Closed

Clip floating point constants to bf16 range to avoid inf conversion #20605

Merged

ankitvad mentioned this pull request Dec 20, 2022

Transformer ScaledDotProductAttention energy value on 16-bit Precision. bentrevett/pytorch-seq2seq#191

Closed

KJlaccHoeUM9l mentioned this pull request Jan 19, 2023

[Bug] Attention and QAttention don't work properly in some cases microsoft/onnxruntime#14363

Open

gante mentioned this pull request Jan 19, 2023

Flax dtype-dependent numerical masking #21197

Merged

geniki mentioned this pull request Feb 3, 2023

Longformer FP16 training broken since transformers 4.21 #21449

Closed

4 tasks

Eta0 mentioned this pull request Feb 16, 2023

Fix finetuner attention masking. coreweave/kubernetes-cloud#146

Merged

kyakuno mentioned this pull request Jun 19, 2023

ADD bert-base-japanese-char axinc-ai/ailia-models#1154

Closed

amyeroberts mentioned this pull request Apr 16, 2024

Super tiny fix unused masked_bias #30235

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix -1e4 as attn mask #17306

Fix -1e4 as attn mask #17306

ydshieh commented May 17, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented May 17, 2022 •

edited

Loading

sgugger left a comment

patrickvonplaten commented May 17, 2022 •

edited

Loading

ydshieh commented May 18, 2022 •

edited

Loading

patil-suraj left a comment

sgugger commented May 18, 2022

LysandreJik commented May 19, 2022

ydshieh commented May 25, 2022

sgugger left a comment

patrickvonplaten left a comment

Fix -1e4 as attn mask #17306

Fix -1e4 as attn mask #17306

Conversation

ydshieh commented May 17, 2022 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented May 17, 2022 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

patrickvonplaten commented May 17, 2022 • edited Loading

ydshieh commented May 18, 2022 • edited Loading

patil-suraj left a comment

Choose a reason for hiding this comment

sgugger commented May 18, 2022

LysandreJik commented May 19, 2022

ydshieh commented May 25, 2022

sgugger left a comment

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

ydshieh commented May 17, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented May 17, 2022 •

edited

Loading

patrickvonplaten commented May 17, 2022 •

edited

Loading

ydshieh commented May 18, 2022 •

edited

Loading