[GPT-Neo] Simplify local attention #13491

patil-suraj · 2021-09-09T07:31:52Z

What does this PR do?

Co-authored-by: finetuneanon [email protected]

This PR is a continuation of #11630 which simplifies GPT Neo's local attention implementation. All credit to @finetuneanon for finding this issue, providing the fix, and a detailed explanation. Thanks a lot for working on this :)

The issue is described in #11320 and performance evaluation results are available here:
#12106 (comment)

This PR does some cleanup and updates tests on top of @finetuneanon changes.

Fixes #11320, Fixes #11787, Fixes #12964, Fixes #11096

src/transformers/models/gpt_neo/modeling_gpt_neo.py

patrickvonplaten · 2021-09-09T10:18:25Z

src/transformers/models/gpt_neo/modeling_gpt_neo.py

-        # in the causal_mask.
-        causal_mask = causal_mask * attention_mask
+        self.register_buffer("bias", bias)
+        self.register_buffer("masked_bias", torch.tensor(-1e9))


this doesn't break in fp16? It's -1e4 in modeling_gpt2.py I think

patrickvonplaten · 2021-09-09T10:20:21Z

src/transformers/models/gpt_neo/modeling_gpt_neo.py

+
+        query_length, key_length = query.size(-2), key.size(-2)
+        causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length].bool()
+        attn_weights = torch.where(causal_mask, attn_weights, self.masked_bias.to(attn_weights.dtype))


Does this correctly convert to fp16?

torch.tensor(-1e9).to(torch.float16)

gives -inf -> which could later lead to problems in fp16 no?

@patrickvonplaten @sgugger

-1e9 is the value used in the original codebase and actually, the attention weights are always computed in fp32

transformers/src/transformers/models/gpt_neo/modeling_gpt_neo.py

Lines 274 to 286 in 09549aa

query = query.to(torch.float32)

key = key.to(torch.float32)

attn_weights = torch.matmul(query, key.transpose(-1, -2))

attn_weights = torch.where(causal_mask, attn_weights, masked_bias.to(attn_weights.dtype))

if attention_mask is not None:

# Apply the attention mask

attn_weights = attn_weights + attention_mask

attn_weights = nn.Softmax(dim=-1)(attn_weights)

attn_weights = attn_weights.to(value.dtype)

attn_weights = attn_dropout(attn_weights)

so the masked_bias is always fp32. The attn_weights are only cast back to the original dtype after softmax.

However, as the models are trained on TPU and possibly with bf16 (see #11076 (comment)) I'm not sure we can guarantee that the models will always work with fp16. See #11076.

patrickvonplaten · 2021-09-09T10:22:04Z

tests/test_modeling_gpt_neo.py

@@ -232,6 +230,86 @@ def create_and_check_gpt_neo_model_past(self, config, input_ids, input_mask, hea
        # test that outputs are equal for slice
        self.parent.assertTrue(torch.allclose(output_from_past_slice, output_from_no_past_slice, atol=1e-3))

+    def create_and_check_gpt_neo_model_attention_mask_past(


Awesome new tests! Could we also add one/two fp16 tests? To make sure generation and forward pass works correctly :-)

patrickvonplaten

Great clean-up! Think this should solve the vram memory issue

The only thing that would be good to check IMO is if this implementation is 100% fp16 compatible (would be great to add some tests as well for this)

sgugger

Thanks for working on this! Agreed with Patrick on the FP16 tests to make sure it all works fine.

src/transformers/models/gpt_neo/modeling_gpt_neo.py

LysandreJik

This looks good!

LysandreJik · 2021-09-09T17:14:40Z

src/transformers/models/gpt_neo/modeling_gpt_neo.py

+        self.register_buffer("bias", bias)
+        self.register_buffer("masked_bias", torch.tensor(-1e9))


This will save to the state dict - is that voluntary? If not, you can add the persistent=False flag (introduced in pytorch 1.6)

tianleiwu · 2021-09-10T18:47:14Z

I had error in running:

  File "src\transformers\models\gpt_neo\configuration_gpt_neo.py", line 218, in __init__
    from .modeling_gpt_neo import GPTNeoAttentionMixin
ImportError: cannot import name 'GPTNeoAttentionMixin'

Need a patch for this PR.

patil-suraj added 2 commits September 9, 2021 12:04

simplify local attention

002fb3e

update tests

6b9b398

patil-suraj requested review from LysandreJik, patrickvonplaten and sgugger September 9, 2021 07:36

patrickvonplaten reviewed Sep 9, 2021

View reviewed changes

src/transformers/models/gpt_neo/modeling_gpt_neo.py Show resolved Hide resolved

patrickvonplaten reviewed Sep 9, 2021

View reviewed changes

src/transformers/models/gpt_neo/modeling_gpt_neo.py Show resolved Hide resolved

patrickvonplaten reviewed Sep 9, 2021

View reviewed changes

patrickvonplaten approved these changes Sep 9, 2021

View reviewed changes

sgugger approved these changes Sep 9, 2021

View reviewed changes

src/transformers/models/gpt_neo/modeling_gpt_neo.py Outdated Show resolved Hide resolved

LysandreJik approved these changes Sep 9, 2021

View reviewed changes

add a comment and use torch.bitwise_xor

e3faae3

patil-suraj merged commit 010965d into huggingface:master Sep 10, 2021

patil-suraj deleted the fix-gpt-neo-local-attn branch September 10, 2021 17:22

patil-suraj mentioned this pull request Sep 11, 2021

Fix GPTNeo onnx export #13524

Merged

michaelbenayoun mentioned this pull request Sep 14, 2021

GPT-Neo ONNX Inference with past is broken #13175

Closed

4 tasks

michaelbenayoun mentioned this pull request Sep 30, 2021

ONNXRuntimeError]RUNTIME_EXCEPTION : Non-zero status code returned while running Reshape node. #13526

Closed

4 tasks

kingpalethe mentioned this pull request Dec 23, 2021

Irregular VRAM usage with gpt-neo inference with sequences longer than 250 tokens guillaume-be/rust-bert#207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPT-Neo] Simplify local attention #13491

[GPT-Neo] Simplify local attention #13491

patil-suraj commented Sep 9, 2021 •

edited

Loading

patrickvonplaten Sep 9, 2021 •

edited

Loading

patrickvonplaten Sep 9, 2021

patil-suraj Sep 10, 2021

patrickvonplaten Sep 9, 2021

patrickvonplaten left a comment

sgugger left a comment

LysandreJik left a comment

LysandreJik Sep 9, 2021

tianleiwu commented Sep 10, 2021

	query = query.to(torch.float32)
	key = key.to(torch.float32)

	attn_weights = torch.matmul(query, key.transpose(-1, -2))
	attn_weights = torch.where(causal_mask, attn_weights, masked_bias.to(attn_weights.dtype))

	if attention_mask is not None:
	# Apply the attention mask
	attn_weights = attn_weights + attention_mask

	attn_weights = nn.Softmax(dim=-1)(attn_weights)
	attn_weights = attn_weights.to(value.dtype)
	attn_weights = attn_dropout(attn_weights)

		self.register_buffer("bias", bias)
		self.register_buffer("masked_bias", torch.tensor(-1e9))

[GPT-Neo] Simplify local attention #13491

[GPT-Neo] Simplify local attention #13491

Conversation

patil-suraj commented Sep 9, 2021 • edited Loading

What does this PR do?

patrickvonplaten Sep 9, 2021 • edited Loading

Choose a reason for hiding this comment

patrickvonplaten Sep 9, 2021

Choose a reason for hiding this comment

patil-suraj Sep 10, 2021

Choose a reason for hiding this comment

patrickvonplaten Sep 9, 2021

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik Sep 9, 2021

Choose a reason for hiding this comment

tianleiwu commented Sep 10, 2021

patil-suraj commented Sep 9, 2021 •

edited

Loading

patrickvonplaten Sep 9, 2021 •

edited

Loading