Support for Training with BF16 #13207

JamesDeAntonis · 2021-08-20T21:55:11Z

What does this PR do?

As seen in this pr, there is demand for bf16 compatibility in training of transformers models. The pytorch folks just added this feature to their master branch, so we are now able to work on adding it to this repo. This pr follows from this issue.

Fixes #13170

(OP edited by @stas00)

Also merged here and adapted changes proposed by @manuelciosici at #14448

This PR:

adds helper utils: require_torch_bf16 and is_torch_bf16_available
modifies invert_attention_mask and one forward in t5 to include bf16 mode switches

HF Trainer:

adds --bf16 and --bf16_full_eval modes - same as fp16 equivalents
renames and deprecates --fp16_backend and replaces it with --half_precision_backend - since we now have more than one half precision mode

Tests:

adds --bf16 and --bf16_full_eval tests

@sgugger, @LysandreJik,

Also tagging @patrickvonplaten, @patil-suraj since once this is merged you can start sending users that have problems with bf16 pre-trained models and have Amphere hardware to use this --bf16 mode. Deepspeed bf16 support will follow soon.

stas00

Thanks for working on this, @JamesDeAntonis. Looks very solid.

I haven't had a chance to set up the latest pt-nightly to test it so I will follow up on the correctness once I did that.

A few general comments from a quick read:

It's not half-precision but mixed half-precision in most of the proposed changes. But as I propose at the end of this comment, let's discuss the naming with Sylvain.
this PR is not backward compatible, so the old args should still work and be deprecated instead. So basically keeping the old args entries as before, plus deprecating, adding new config options and converting from old to new if users used the old. But to save time - we can do it at the very end of this PR when the naming has been figured out.

and also added a few suggestions in the diff.

Plus we need tests. We have to come up with some way to test whether the underlying hardware supports bf16. Last I asked at pytorch they haven't quite had a way to do so. I suppose we could try to try/except autocast(bf16) and it should fail if the hardware doesn't support it and thus we can set a new flag is_bfloat16_hw_available, which can be used to skip tests. As a side-effect it'll also fail if pt < 1.10, so then it'll be is_bfloat16_available - in which case we don't know whether it's a hardware or software, but good enough for the test to be skipped either way.

Comments to @sgugger - should we rethink the amp args naming, removing the explicit --fp16 and --bf16 options and have one arg instead, e.g. --amp_dtype? Since there might be other formats coming in the future. or should we worry about that when we come to it?

Regardless of the above I propose that the fp16 or bf16 repetitive logic is best reduced to a single variable during TrainingArguments init.

And the --half_precision_backend vs. --mixed_half_precision_backend - and both are a way too long...

src/transformers/training_args.py

src/transformers/trainer.py

stas00 · 2021-08-20T23:03:56Z

src/transformers/trainer.py

@@ -2164,7 +2174,7 @@ def evaluation_loop(

        # if full fp16 is wanted on eval and this ``evaluation`` or ``predict`` isn't called while
        # ``train`` is running, halve it first and then put on device
-        if not self.is_in_train and self.args.fp16_full_eval:
+        if not self.is_in_train and self.args.half_precision_full_eval:


So if we name the new var "mixed_half_precision" - this one won't work anymore. as it can't be "half_precision_full"

If we name it "half_precision_mixed" then it works.

As we have 2 modes:

mixed half-precision - AMP

full half-precision - .half / .to(bf16)

I suppose the latter sounds confusing as well - full half is a bit of an oxymoron here.

What about simply half_precision_eval?

Perhaps then:

mixed_half_precision_backend - (instead of the current fp16_backend)

half_precision_eval - (instead of the current fp16_full_eval)

On the other hand AMP = automatic mixed precision, so there is no half in there, so may be these 2 then?

mixed_precision_backend - (instead of the current fp16_backend)

half_precision_eval- (instead of the current fp16_full_eval)

It'd be nice to find something less long to type, but a shorter mp_ would be too confusing because of model parallel, why not just use amp?

amp_backend - (instead of the current fp16_backend)

half_precision_eval- (instead of the current fp16_full_eval)

I agree with the last suggestion for naming.

src/transformers/training_args.py

stas00 · 2021-08-21T16:51:36Z

ok, pt-nightly installed so that I could test the new functionality.

so I added:

- `require_torch_bf16`
- `is_torch_bf16_available`
- super basic test that validates that `--bf16` doesn't fail
- placeholder for the future bf16 full eval test

So now the tests should be expanded to actually validate that bf16 is happening, with some number checking - e.g. we could check that the numbers are indeed bf16.

src/transformers/trainer.py

Co-authored-by: Stas Bekman <[email protected]>

…into bf16

JamesDeAntonis · 2021-08-23T23:45:15Z

src/transformers/modeling_utils.py

@@ -227,6 +227,7 @@ def invert_attention_mask(self, encoder_attention_mask: Tensor) -> Tensor:
        # /transformer/transformer_layers.py#L270
        # encoder_extended_attention_mask = (encoder_extended_attention_mask ==
        # encoder_extended_attention_mask.transpose(-1, -2))
+        ###TODO
        encoder_extended_attention_mask = encoder_extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility


@stas00 is self.dtype supposed to be 16-bit when using amp? My own print statements are saying float32

autocast keeps everything in the original dtype (usually fp32) and only does ops that are supported in half precision when they are called. It keeps a cache of the half-precision version of the original weights for the scope of autocast.

Also self.dtype in the context you highlighted only checks the first param.

Deepspeed does it differently. it keeps weights in fp16 and uses fp32 master weights only when needed. So that's why we have those checks in the code as we have "torch.float16" then - also when we use --fp16_full_eval

src/transformers/models/t5/modeling_t5.py

JamesDeAntonis · 2021-08-23T23:48:55Z

I'm not observing as much of a memory improvement as I expected.

Memory improvements I'm seeing are 0-15%, whereas I expected around 40% (per derrick's observation here). Is there anywhere in master where autocast is disabled for some reason? For example, that was going to be the case here, but that change is not currently in master.

The two questions I just commented are part of my digging into whether there is a bug somewhere.

EDIT: I found it interesting that fp16 was giving similarly lackluster gains as bf16. That suggests it's not a bf16-specific issue

stas00 · 2021-08-24T03:10:43Z

I saw some recent comments on the torch slack that suggestion that bf16 hasn't quite been figured out performance-wise and can actually be slower depending on the hardware. One issue being cuDNN has no bfloat16 at the moment, the other issue is that many bf16 kernels are simply not there, so it falls back to some slower functionality.

May I suggest to ask this question on https://discuss.pytorch.org/ and hopefully some knowledgeable devs with experience could give us a definitive answer. Please share the link if you do.

I think on our side the main priority for providing bf16 support is to overcome the overflow issue in models pretrained in mixed bf16, performance being secondary. But of course, it'd be great to actually benefit from the new Ampere cards which have a lot of power but almost a year passed and we still can't quite harness that power.

BTW, which card are you testing it with?

JamesDeAntonis · 2021-08-27T20:13:22Z

BTW, which card are you testing it with?

RTX A6000. I'm pretty sure it's not related to this pr, per this

stas00 · 2021-08-27T20:40:20Z

Thank you for posting there, @JamesDeAntonis. Let's see if Piotr has some encouraging feedback.

Otherwise the whole thing is very veiled at the moment as nobody wrote any definitive answers.

stas00 · 2021-11-30T04:01:41Z

@manuelciosici, FYI: we will deal with deepspeed in a separate PR #14569 - in particular since the ZeRO3 support hasn't been merged yet and we always need a new release from deepspeed to be able to update our integration side.

stas00 · 2021-11-30T05:08:07Z

@sgugger, please kindly have a look. I merged 2 PRs and cleaned things up and added a single deprecation.

I also reverted the earlier attempt to use a shared --half_precision_full_eval since it didn't make sense - --fp16_full_eval and --bf16_full_eval are similar but 2 different modes requiring different code. If we want a shared one then we have to additionally require either --fp16 or --bf16 and then adjust the logic accordingly. If you prefer that let me know.

Since bf16 has a much larger dynamic range most of the fp16 workarounds of that type aren't needed. So I grep'ed for if torch.float16 checks and I didn't see anything other 2 places. I'm sure I may have missed some, but it'll surely let itself known when we start using it.

Note, I've updated the OP with the up-to-date list of changes, so please refer to it for an overview.

So I think we just need a couple of tests and if everybody is happy this is good to go. (tests added)

The CI failure is unrelated.

stas00 · 2021-11-30T05:40:01Z

OK, a few tests added.

@JamesDeAntonis and @manuelciosici - please have a look - let me know if anything else is needed in your opinion. Thanks.

sgugger

Thanks for taking over @stas00 . I just have a few comments on the names/ args added.

Re- half_precision_full_eval, I still think it could be a better API if we find an intelligent default (always fp16? bfloat16 when it's supported?) but it can be done in another PR.

sgugger · 2021-11-30T20:02:11Z

src/transformers/training_args.py

+        bf16_full_eval (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to use full bfloat16 evaluation instead of 32-bit. This will be faster and save memory but can harm
+            metric values.


Still not overjoyed about adding that new argument but I understand your point on this.

I'm totally open to suggestions.

As I said the other approach is to use these 2 combinations instead:

--half_precision_full_eval + --fp16

--half_precision_full_eval + --bf16

always fp16? bfloat16 when it's supported?) but it can be done in another PR.

That's not something that should be automated - as the 2 modes are very different in many ways. The user needs to make a deliberate choice.

So let's keep the two separate flags for now, then.

+1 on @stas00 suggestion that the data format decisions are not automated.

sgugger · 2021-11-30T20:02:30Z

src/transformers/training_args.py

@@ -207,18 +207,26 @@ class TrainingArguments:
            Random seed that will be set at the beginning of training. To ensure reproducibility across runs, use the
            :func:`~transformers.Trainer.model_init` function to instantiate the model if it has some randomly
            initialized parameters.
+        bf16 (:obj:`bool`, `optional`, defaults to :obj:`False`):


Might be surprising as an argument name? Won't users expect --bfloat16?

this just mimics --fp16 style - how could the --bfloat16 style fit in here? i.e. if the other argument were --float16 then --bfloat16 fits.

I prefer --bfloat16, but @stas00 is right, --bf16 is consistent.

stas00 · 2021-11-30T21:03:36Z

@sgugger, would it help to document the bf16 API as experimental and a subject to change at a moment's notice?

sgugger · 2021-11-30T22:21:38Z

Yes please!

manuelciosici

I finished reading through the PR. I have small comments: a couple of typos and a couple of places where I think code can be simplified. Otherwise, it looks good.

src/transformers/trainer.py

manuelciosici · 2021-11-30T23:11:23Z

src/transformers/training_args.py

@@ -207,18 +207,26 @@ class TrainingArguments:
            Random seed that will be set at the beginning of training. To ensure reproducibility across runs, use the
            :func:`~transformers.Trainer.model_init` function to instantiate the model if it has some randomly
            initialized parameters.
+        bf16 (:obj:`bool`, `optional`, defaults to :obj:`False`):


I prefer --bfloat16, but @stas00 is right, --bf16 is consistent.

src/transformers/training_args.py

manuelciosici · 2021-11-30T23:13:07Z

src/transformers/training_args.py

+        bf16_full_eval (:obj:`bool`, `optional`, defaults to :obj:`False`):
+            Whether to use full bfloat16 evaluation instead of 32-bit. This will be faster and save memory but can harm
+            metric values.


+1 on @stas00 suggestion that the data format decisions are not automated.

src/transformers/training_args.py

stas00 · 2021-12-01T01:21:06Z

Thanks a lot for the review and the suggestions, @manuelciosici - all integrated, plus added a warning that this API is experimental, so if once we start using it we find that we could improve it we can.

stas00 · 2021-12-01T03:01:33Z

I have been working on a guide to all these new modes including tf32, @manuelciosici, et al - if you get a chance to proofread, please have a look at #14579

Thank you!

patrickvonplaten · 2021-12-01T13:44:12Z

src/transformers/trainer.py

            loss_mb = smp_forward_backward(model, inputs, self.args.gradient_accumulation_steps, scaler=scaler)
            return loss_mb.reduce_mean().detach().to(self.args.device)

        if self.use_amp:
-            with autocast():
+            with autocast(dtype=self.amp_dtype):


I think this broke Trainer. autocast() doesn't have dtype=self.amp_dtype for older version I think.

@sgugger @stas00 @LysandreJik

File "/home/patrick/python_bin/transformers/trainer.py", line 1860, in training_step with autocast(dtype=self.amp_dtype): TypeError: __init__() got an unexpected keyword argument 'dtype'

Thank you, @patrickvonplaten - pushed a workaround here: 14cc50d

jamie and others added 4 commits August 13, 2021 23:37

started bf16 integration

1efca16

minor changes

6e61b38

code now runs

f1d4996

Merge branch 'huggingface:master' into bf16

27d4d6a

stas00 reviewed Aug 20, 2021

View reviewed changes

stas00 self-assigned this Aug 20, 2021

stas00 added 4 commits August 21, 2021 09:45

style

b3c5d24

lay foundation for bf16 testing

1b129fd

lay foundation for bf16 testing

b57e02b

start the tests

1917883

stas00 reviewed Aug 21, 2021

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

stas00 and others added 10 commits August 22, 2021 08:13

better bf16 check

26eb566

style

c60f30d

2 separate checkers - one for bf16 support, another for bf16+autocast

068357c

Update src/transformers/training_args.py

5276abf

Co-authored-by: Stas Bekman <[email protected]>

a couple of comment resolutions

88d7cca

Merge branch 'bf16' of https://github.com/JamesDeAntonis/transformers …

8650aa8

…into bf16

more comment resolutions

6d86a8f

resolved a small bug

c060e4c

just some print statemtns

7d40298

added todo marking

6822ee3

JamesDeAntonis commented Aug 23, 2021

View reviewed changes

added a todo

82700fc

JamesDeAntonis commented Aug 23, 2021

View reviewed changes

src/transformers/models/t5/modeling_t5.py Outdated Show resolved Hide resolved

adjust for API change s/fast_dtype/dtype/

1b37466

stas00 added 5 commits November 29, 2021 19:17

merge parts of huggingface#14448

7ffd2c0

reformer / bf16

aaf2435

cleanup

242368e

require at least pt-1.10

b52e030

fix

d599389

stas00 mentioned this pull request Nov 30, 2021

[Deepspeed] add support for bf16 mode #14569

Merged

4 tasks

will deal with deepspeed separately

27f2d73

stas00 added 6 commits November 29, 2021 20:09

cleanup

db7d640

revert

2f235b3

cleanup

4f90b27

fp16_full_eval and bf16_full_eval are separate modes

6b44900

proper deprecation

2f826b7

cleanup

ecef0ae

test and fixes

ec12058

sgugger approved these changes Nov 30, 2021

View reviewed changes

manuelciosici approved these changes Nov 30, 2021

View reviewed changes

stas00 added 3 commits November 30, 2021 17:14

spelling

02d07b8

cleanup

3daf2ee

add a note that this API is experimental

18dae9e

stas00 merged commit 70996a5 into huggingface:master Dec 1, 2021

stas00 mentioned this pull request Dec 1, 2021

[doc] bf16/tf32 guide #14579

Merged

patrickvonplaten reviewed Dec 1, 2021

View reviewed changes

stas00 changed the title ~~WIP: Support for Training with BF16~~ Support for Training with BF16 Dec 1, 2021

Support for Training with BF16 #13207

Support for Training with BF16 #13207

Conversation

JamesDeAntonis commented Aug 20, 2021 • edited by stas00 Loading

What does this PR do?

stas00 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 commented Aug 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesDeAntonis commented Aug 23, 2021 • edited Loading

stas00 commented Aug 24, 2021 • edited Loading

JamesDeAntonis commented Aug 27, 2021

stas00 commented Aug 27, 2021

stas00 commented Nov 30, 2021 • edited Loading

stas00 commented Nov 30, 2021 • edited Loading

stas00 commented Nov 30, 2021

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 commented Nov 30, 2021

sgugger commented Nov 30, 2021

manuelciosici left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 commented Dec 1, 2021

stas00 commented Dec 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesDeAntonis commented Aug 20, 2021 •

edited by stas00

Loading

stas00 left a comment •

edited

Loading

JamesDeAntonis commented Aug 23, 2021 •

edited

Loading

stas00 commented Aug 24, 2021 •

edited

Loading

stas00 commented Nov 30, 2021 •

edited

Loading

stas00 commented Nov 30, 2021 •

edited

Loading