[Flax] Add remat (gradient checkpointing) #17843

sanchit-gandhi · 2022-06-23T13:39:22Z

What does this PR do?

Adds gradient checkpointing in Flax (c.f. #17399). The API currently takes the form of a method:

from transformers import BertConfig, FlaxBertModel

model = FlaxBertModel(BertConfig())
model.enable_gradient_checkpointing()

Note: checkpointing has currently only been implemented for FlaxBert. Implementing for all Flax models is a TODO.

TODO:

Add checkpointing to init
Add checkpointing to from_pretrained
Add model tests for FlaxBert in test_modeling_flax_bert
Decide on API: checkpointing with a kwarg (gradient_checkpointing=True) or a method (model.gradient_checkpointing_enable())?
Add API functionality for remat policies (c.f. https://github.com/google/jax/blob/636345fd67758c19c5345bee2301df34b6f1c540/jax/_src/ad_checkpoint.py#L44)
Copy checkpointing logic to all Flax models
Move model tests to test_modeling_flax_common

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

cc @borisdayma

sanchit-gandhi · 2022-06-23T13:44:54Z

src/transformers/models/bert/modeling_flax_bert.py

-                layer_head_mask=head_mask[i] if head_mask is not None else None,
-                encoder_hidden_states=encoder_hidden_states,
-                encoder_attention_mask=encoder_attention_mask,
-                init_cache=init_cache,
-                deterministic=deterministic,
-                output_attentions=output_attentions,
+                head_mask[i] if head_mask is not None else None,
+                encoder_hidden_states,
+                encoder_attention_mask,
+                init_cache,
+                deterministic,
+                output_attentions,


Note: remat does not support kwargs, hence the need to change to args

HuggingFaceDocBuilderDev · 2022-06-23T13:52:18Z

The documentation is not available anymore as the PR was closed or merged.

borisdayma · 2022-06-27T13:51:30Z

Is there an inconvenient in adding it to all layers?

In my case I used it only on transformers blocks (attention + feed forward).

sanchit-gandhi · 2022-06-27T14:41:59Z

Is there an inconvenient in adding it to all layers?

By wrapping FlaxBertLayer in a remat operation, each Bert layer (attention, intermediate FF, final FF + optional cross-attention layers) has remat applied to it:

transformers/src/transformers/models/bert/modeling_flax_bert.py

Line 555 in ea8150a

    
           FlaxBertCheckpointLayer = remat(FlaxBertLayer, static_argnums=(5, 6, 7), policy=self.remat_policy)

We then use this remat'd layer to construct the Transformer block (layers collection):

transformers/src/transformers/models/bert/modeling_flax_bert.py

Lines 559 to 562 in ea8150a

    
           self.layers = [ 
        
               FlaxBertCheckpointLayer(self.config, name=str(i), dtype=self.dtype) 
        
               for i in range(self.config.num_hidden_layers) 
        
           ]

Meaning that each component of the Bert layer is checkpointed, and that all Bert layers in the Transformer block (layers collection) are checkpointed.

Would you like to see remat on the embeddings and pooler layers too? Imagine this wouldn't make a huge difference to performance at train time vs just checkpointing the entire Transformer block?

patrickvonplaten

Very nice! Also cc @younesbelkada

We could also look into implementing this for OPT and BLOOM in Flax :-) Great job @sanchit-gandhi

Only feedback from my side would be to remove the option to overwrite the policy (also since we don't test it)

patrickvonplaten · 2022-06-27T20:17:39Z

src/transformers/models/bert/modeling_flax_bert.py


    def setup(self):
+        if self.gradient_checkpointing:
+            FlaxBertCheckpointLayer = remat(FlaxBertLayer, static_argnums=(5, 6, 7), policy=self.remat_policy)


Suggested change

FlaxBertCheckpointLayer = remat(FlaxBertLayer, static_argnums=(5, 6, 7), policy=self.remat_policy)

FlaxBertLayer = remat(FlaxBertLayer, static_argnums=(5, 6, 7), policy=self.remat_policy)

(nit) I'd just leave the naming as is. IMO it's easier to read the code and compare to PyTorch this way, but also happy to leave as is

remat prevented re-use of the class name FlaxBertLayer:
google/flax#2251
Can re-name in a follow-up PR if we find a workaround!

patrickvonplaten · 2022-06-27T20:17:53Z

src/transformers/models/bert/modeling_flax_bert.py

-                layer_head_mask=head_mask[i] if head_mask is not None else None,
-                encoder_hidden_states=encoder_hidden_states,
-                encoder_attention_mask=encoder_attention_mask,
-                init_cache=init_cache,
-                deterministic=deterministic,
-                output_attentions=output_attentions,
+                head_mask[i] if head_mask is not None else None,
+                encoder_hidden_states,
+                encoder_attention_mask,
+                init_cache,
+                deterministic,
+                output_attentions,


patrickvonplaten · 2022-06-27T20:20:24Z

src/transformers/models/bert/modeling_flax_bert.py

@@ -617,9 +628,16 @@ def __call__(
 class FlaxBertEncoder(nn.Module):
    config: BertConfig
    dtype: jnp.dtype = jnp.float32  # the dtype of the computation
+    gradient_checkpointing: bool = False
+    remat_policy: Callable[..., bool] = (None,)  # the gradient checkpointing policy


Are there multiple policies? Would one every use another one then the default one? Wondering if allowing this parameter to be customizable might be a bit scary for the user and make the whole functionality less understandable. Think I'd prefer to just use the default here and not allow the user to configure it

The full list of remat policies can be found here. They dictate whether output value(s) are saved as a residual or whether they must be recomputed in the (co)tangent computation.

The advice for selecting an appropriate remat policy is empirically driven: try them all and see what works best! On paper, dot_with_no_batch_dims should work best for Transformer architectures, and indeed was the preference for T5x. However, for the Seq2Seq project, I found the default policy to be optimal!

I'm in agreement that including the remat_policy as an attribute is probably too heavy and clutters the code. It's straightforward to add one's own policy choice by overriding the policy arg to the remat method, and users who wish to do so can easily access this.

borisdayma · 2022-06-27T20:37:43Z

Would you like to see remat on the embeddings and pooler layers too? Imagine this wouldn't make a huge difference to performance at train time vs just checkpointing the entire Transformer block?

No actually I thought it was on all layers but the way you did is great!

patrickvonplaten · 2022-06-29T22:32:10Z

Cool! Once the tests are green, happy to merge it here :-)

* [Flax] Add remat (gradient checkpointing) * fix variable naming in test * flip: checkpoint using a method * fix naming * fix class naming * apply PVP's suggestions from code review * make fix-copies * fix big-bird, electra, roberta * cookie-cutter * fix flax big-bird * move test to common

sanchit-gandhi added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Jun 23, 2022

sanchit-gandhi commented Jun 23, 2022

View reviewed changes

sanchit-gandhi requested review from patil-suraj and patrickvonplaten June 27, 2022 10:54

sanchit-gandhi removed the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Jun 27, 2022

patrickvonplaten approved these changes Jun 27, 2022

View reviewed changes

sanchit-gandhi mentioned this pull request Jun 28, 2022

[RFC] Scan & Gradient checkpointing in Flax #17399

Open

sanchit-gandhi added 7 commits June 30, 2022 17:37

[Flax] Add remat (gradient checkpointing)

7dd4c58

fix variable naming in test

03606c1

flip: checkpoint using a method

9b6e164

fix naming

395221f

fix class naming

7b165d5

apply PVP's suggestions from code review

d6040e0

make fix-copies

9972f38

sanchit-gandhi force-pushed the flax-remat branch from c791574 to 9972f38 Compare June 30, 2022 16:38

sanchit-gandhi added 4 commits June 30, 2022 18:16

fix big-bird, electra, roberta

0a0b6bf

cookie-cutter

70b7175

fix flax big-bird

0665a33

move test to common

66c4b14

sanchit-gandhi merged commit 485bbe7 into huggingface:main Jul 1, 2022

sanchit-gandhi deleted the flax-remat branch July 1, 2022 17:33

This was referenced Jul 4, 2022

Flax Remat for LongT5 #17994

Merged

[WIP] Flax BLOOM implementation + demo #17761

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flax] Add remat (gradient checkpointing) #17843

[Flax] Add remat (gradient checkpointing) #17843

sanchit-gandhi commented Jun 23, 2022 •

edited

Loading

sanchit-gandhi Jun 23, 2022

patrickvonplaten Jun 27, 2022

HuggingFaceDocBuilderDev commented Jun 23, 2022 •

edited

Loading

borisdayma commented Jun 27, 2022

sanchit-gandhi commented Jun 27, 2022 •

edited

Loading

patrickvonplaten left a comment

patrickvonplaten Jun 27, 2022

sanchit-gandhi Jun 30, 2022

patrickvonplaten Jun 27, 2022

patrickvonplaten Jun 27, 2022

sanchit-gandhi Jun 28, 2022

borisdayma commented Jun 27, 2022

patrickvonplaten commented Jun 29, 2022

	FlaxBertCheckpointLayer = remat(FlaxBertLayer, static_argnums=(5, 6, 7), policy=self.remat_policy)
	FlaxBertLayer = remat(FlaxBertLayer, static_argnums=(5, 6, 7), policy=self.remat_policy)

[Flax] Add remat (gradient checkpointing) #17843

[Flax] Add remat (gradient checkpointing) #17843

Conversation

sanchit-gandhi commented Jun 23, 2022 • edited Loading

What does this PR do?

Before submitting

Who can review?

sanchit-gandhi Jun 23, 2022

Choose a reason for hiding this comment

patrickvonplaten Jun 27, 2022

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jun 23, 2022 • edited Loading

borisdayma commented Jun 27, 2022

sanchit-gandhi commented Jun 27, 2022 • edited Loading

patrickvonplaten left a comment

Choose a reason for hiding this comment

patrickvonplaten Jun 27, 2022

Choose a reason for hiding this comment

sanchit-gandhi Jun 30, 2022

Choose a reason for hiding this comment

patrickvonplaten Jun 27, 2022

Choose a reason for hiding this comment

patrickvonplaten Jun 27, 2022

Choose a reason for hiding this comment

sanchit-gandhi Jun 28, 2022

Choose a reason for hiding this comment

borisdayma commented Jun 27, 2022

patrickvonplaten commented Jun 29, 2022

sanchit-gandhi commented Jun 23, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 23, 2022 •

edited

Loading

sanchit-gandhi commented Jun 27, 2022 •

edited

Loading