Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Training with BF16 #13207

Merged
merged 48 commits into from
Dec 1, 2021
Merged

Support for Training with BF16 #13207

merged 48 commits into from
Dec 1, 2021

Conversation

JamesDeAntonis
Copy link
Contributor

@JamesDeAntonis JamesDeAntonis commented Aug 20, 2021

What does this PR do?

As seen in this pr, there is demand for bf16 compatibility in training of transformers models. The pytorch folks just added this feature to their master branch, so we are now able to work on adding it to this repo. This pr follows from this issue.

Fixes #13170


(OP edited by @stas00)

Also merged here and adapted changes proposed by @manuelciosici at #14448

This PR:

  • adds helper utils: require_torch_bf16 and is_torch_bf16_available
  • modifies invert_attention_mask and one forward in t5 to include bf16 mode switches

HF Trainer:

  • adds --bf16 and --bf16_full_eval modes - same as fp16 equivalents
  • renames and deprecates --fp16_backend and replaces it with --half_precision_backend - since we now have more than one half precision mode

Tests:

  • adds --bf16 and --bf16_full_eval tests

@sgugger, @LysandreJik,

Also tagging @patrickvonplaten, @patil-suraj since once this is merged you can start sending users that have problems with bf16 pre-trained models and have Amphere hardware to use this --bf16 mode. Deepspeed bf16 support will follow soon.

Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, @JamesDeAntonis. Looks very solid.

I haven't had a chance to set up the latest pt-nightly to test it so I will follow up on the correctness once I did that.

A few general comments from a quick read:

  1. It's not half-precision but mixed half-precision in most of the proposed changes. But as I propose at the end of this comment, let's discuss the naming with Sylvain.

  2. this PR is not backward compatible, so the old args should still work and be deprecated instead. So basically keeping the old args entries as before, plus deprecating, adding new config options and converting from old to new if users used the old. But to save time - we can do it at the very end of this PR when the naming has been figured out.

and also added a few suggestions in the diff.

Plus we need tests. We have to come up with some way to test whether the underlying hardware supports bf16. Last I asked at pytorch they haven't quite had a way to do so. I suppose we could try to try/except autocast(bf16) and it should fail if the hardware doesn't support it and thus we can set a new flag is_bfloat16_hw_available, which can be used to skip tests. As a side-effect it'll also fail if pt < 1.10, so then it'll be is_bfloat16_available - in which case we don't know whether it's a hardware or software, but good enough for the test to be skipped either way.

Comments to @sgugger - should we rethink the amp args naming, removing the explicit --fp16 and --bf16 options and have one arg instead, e.g. --amp_dtype? Since there might be other formats coming in the future. or should we worry about that when we come to it?

Regardless of the above I propose that the fp16 or bf16 repetitive logic is best reduced to a single variable during TrainingArguments init.

And the --half_precision_backend vs. --mixed_half_precision_backend - and both are a way too long...

src/transformers/training_args.py Outdated Show resolved Hide resolved
src/transformers/trainer.py Outdated Show resolved Hide resolved
src/transformers/trainer.py Outdated Show resolved Hide resolved
src/transformers/trainer.py Outdated Show resolved Hide resolved
@@ -2164,7 +2174,7 @@ def evaluation_loop(

# if full fp16 is wanted on eval and this ``evaluation`` or ``predict`` isn't called while
# ``train`` is running, halve it first and then put on device
if not self.is_in_train and self.args.fp16_full_eval:
if not self.is_in_train and self.args.half_precision_full_eval:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we name the new var "mixed_half_precision" - this one won't work anymore. as it can't be "half_precision_full"

If we name it "half_precision_mixed" then it works.

As we have 2 modes:

  • mixed half-precision - AMP
  • full half-precision - .half / .to(bf16)

I suppose the latter sounds confusing as well - full half is a bit of an oxymoron here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about simply half_precision_eval?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps then:

  1. mixed_half_precision_backend - (instead of the current fp16_backend)
  2. half_precision_eval - (instead of the current fp16_full_eval)

On the other hand AMP = automatic mixed precision, so there is no half in there, so may be these 2 then?

  1. mixed_precision_backend - (instead of the current fp16_backend)
  2. half_precision_eval- (instead of the current fp16_full_eval)

It'd be nice to find something less long to type, but a shorter mp_ would be too confusing because of model parallel, why not just use amp?

  1. amp_backend - (instead of the current fp16_backend)
  2. half_precision_eval- (instead of the current fp16_full_eval)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the last suggestion for naming.

src/transformers/training_args.py Outdated Show resolved Hide resolved
@stas00 stas00 self-assigned this Aug 20, 2021
@stas00
Copy link
Contributor

stas00 commented Aug 21, 2021

ok, pt-nightly installed so that I could test the new functionality.

so I added:

- `require_torch_bf16`
- `is_torch_bf16_available`
- super basic test that validates that `--bf16` doesn't fail
- placeholder for the future bf16 full eval test

So now the tests should be expanded to actually validate that bf16 is happening, with some number checking - e.g. we could check that the numbers are indeed bf16.

src/transformers/trainer.py Outdated Show resolved Hide resolved
@@ -227,6 +227,7 @@ def invert_attention_mask(self, encoder_attention_mask: Tensor) -> Tensor:
# /transformer/transformer_layers.py#L270
# encoder_extended_attention_mask = (encoder_extended_attention_mask ==
# encoder_extended_attention_mask.transpose(-1, -2))
###TODO
encoder_extended_attention_mask = encoder_extended_attention_mask.to(dtype=self.dtype) # fp16 compatibility
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stas00 is self.dtype supposed to be 16-bit when using amp? My own print statements are saying float32

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autocast keeps everything in the original dtype (usually fp32) and only does ops that are supported in half precision when they are called. It keeps a cache of the half-precision version of the original weights for the scope of autocast.

Also self.dtype in the context you highlighted only checks the first param.

Deepspeed does it differently. it keeps weights in fp16 and uses fp32 master weights only when needed. So that's why we have those checks in the code as we have "torch.float16" then - also when we use --fp16_full_eval

@JamesDeAntonis
Copy link
Contributor Author

JamesDeAntonis commented Aug 23, 2021

I'm not observing as much of a memory improvement as I expected.

Memory improvements I'm seeing are 0-15%, whereas I expected around 40% (per derrick's observation here). Is there anywhere in master where autocast is disabled for some reason? For example, that was going to be the case here, but that change is not currently in master.

The two questions I just commented are part of my digging into whether there is a bug somewhere.

EDIT: I found it interesting that fp16 was giving similarly lackluster gains as bf16. That suggests it's not a bf16-specific issue

@stas00
Copy link
Contributor

stas00 commented Aug 24, 2021

I saw some recent comments on the torch slack that suggestion that bf16 hasn't quite been figured out performance-wise and can actually be slower depending on the hardware. One issue being cuDNN has no bfloat16 at the moment, the other issue is that many bf16 kernels are simply not there, so it falls back to some slower functionality.

May I suggest to ask this question on https://discuss.pytorch.org/ and hopefully some knowledgeable devs with experience could give us a definitive answer. Please share the link if you do.

I think on our side the main priority for providing bf16 support is to overcome the overflow issue in models pretrained in mixed bf16, performance being secondary. But of course, it'd be great to actually benefit from the new Ampere cards which have a lot of power but almost a year passed and we still can't quite harness that power.

BTW, which card are you testing it with?

@JamesDeAntonis
Copy link
Contributor Author

BTW, which card are you testing it with?

RTX A6000. I'm pretty sure it's not related to this pr, per this

@stas00
Copy link
Contributor

stas00 commented Aug 27, 2021

Thank you for posting there, @JamesDeAntonis. Let's see if Piotr has some encouraging feedback.

Otherwise the whole thing is very veiled at the moment as nobody wrote any definitive answers.

@stas00
Copy link
Contributor

stas00 commented Nov 30, 2021

@manuelciosici, FYI: we will deal with deepspeed in a separate PR #14569 - in particular since the ZeRO3 support hasn't been merged yet and we always need a new release from deepspeed to be able to update our integration side.

@stas00
Copy link
Contributor

stas00 commented Nov 30, 2021

@sgugger, please kindly have a look. I merged 2 PRs and cleaned things up and added a single deprecation.

I also reverted the earlier attempt to use a shared --half_precision_full_eval since it didn't make sense - --fp16_full_eval and --bf16_full_eval are similar but 2 different modes requiring different code. If we want a shared one then we have to additionally require either --fp16 or --bf16 and then adjust the logic accordingly. If you prefer that let me know.

Since bf16 has a much larger dynamic range most of the fp16 workarounds of that type aren't needed. So I grep'ed for if torch.float16 checks and I didn't see anything other 2 places. I'm sure I may have missed some, but it'll surely let itself known when we start using it.

Note, I've updated the OP with the up-to-date list of changes, so please refer to it for an overview.

So I think we just need a couple of tests and if everybody is happy this is good to go. (tests added)

The CI failure is unrelated.

@stas00
Copy link
Contributor

stas00 commented Nov 30, 2021

OK, a few tests added.

@JamesDeAntonis and @manuelciosici - please have a look - let me know if anything else is needed in your opinion. Thanks.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking over @stas00 . I just have a few comments on the names/ args added.

Re- half_precision_full_eval, I still think it could be a better API if we find an intelligent default (always fp16? bfloat16 when it's supported?) but it can be done in another PR.

Comment on lines 224 to 226
bf16_full_eval (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to use full bfloat16 evaluation instead of 32-bit. This will be faster and save memory but can harm
metric values.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not overjoyed about adding that new argument but I understand your point on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm totally open to suggestions.

As I said the other approach is to use these 2 combinations instead:

  1. --half_precision_full_eval + --fp16
  2. --half_precision_full_eval + --bf16

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always fp16? bfloat16 when it's supported?) but it can be done in another PR.

That's not something that should be automated - as the 2 modes are very different in many ways. The user needs to make a deliberate choice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So let's keep the two separate flags for now, then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on @stas00 suggestion that the data format decisions are not automated.

@@ -207,18 +207,26 @@ class TrainingArguments:
Random seed that will be set at the beginning of training. To ensure reproducibility across runs, use the
:func:`~transformers.Trainer.model_init` function to instantiate the model if it has some randomly
initialized parameters.
bf16 (:obj:`bool`, `optional`, defaults to :obj:`False`):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be surprising as an argument name? Won't users expect --bfloat16?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this just mimics --fp16 style - how could the --bfloat16 style fit in here? i.e. if the other argument were --float16 then --bfloat16 fits.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer --bfloat16, but @stas00 is right, --bf16 is consistent.

@stas00
Copy link
Contributor

stas00 commented Nov 30, 2021

@sgugger, would it help to document the bf16 API as experimental and a subject to change at a moment's notice?

@sgugger
Copy link
Collaborator

sgugger commented Nov 30, 2021

Yes please!

Copy link
Contributor

@manuelciosici manuelciosici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finished reading through the PR. I have small comments: a couple of typos and a couple of places where I think code can be simplified. Otherwise, it looks good.

src/transformers/trainer.py Outdated Show resolved Hide resolved
src/transformers/trainer.py Show resolved Hide resolved
src/transformers/trainer.py Show resolved Hide resolved
@@ -207,18 +207,26 @@ class TrainingArguments:
Random seed that will be set at the beginning of training. To ensure reproducibility across runs, use the
:func:`~transformers.Trainer.model_init` function to instantiate the model if it has some randomly
initialized parameters.
bf16 (:obj:`bool`, `optional`, defaults to :obj:`False`):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer --bfloat16, but @stas00 is right, --bf16 is consistent.

src/transformers/training_args.py Outdated Show resolved Hide resolved
Comment on lines 224 to 226
bf16_full_eval (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether to use full bfloat16 evaluation instead of 32-bit. This will be faster and save memory but can harm
metric values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on @stas00 suggestion that the data format decisions are not automated.

src/transformers/training_args.py Outdated Show resolved Hide resolved
@stas00
Copy link
Contributor

stas00 commented Dec 1, 2021

Thanks a lot for the review and the suggestions, @manuelciosici - all integrated, plus added a warning that this API is experimental, so if once we start using it we find that we could improve it we can.

@stas00 stas00 merged commit 70996a5 into huggingface:master Dec 1, 2021
@stas00 stas00 mentioned this pull request Dec 1, 2021
@stas00
Copy link
Contributor

stas00 commented Dec 1, 2021

I have been working on a guide to all these new modes including tf32, @manuelciosici, et al - if you get a chance to proofread, please have a look at #14579

Thank you!

loss_mb = smp_forward_backward(model, inputs, self.args.gradient_accumulation_steps, scaler=scaler)
return loss_mb.reduce_mean().detach().to(self.args.device)

if self.use_amp:
with autocast():
with autocast(dtype=self.amp_dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this broke Trainer. autocast() doesn't have dtype=self.amp_dtype for older version I think.

@sgugger @stas00 @LysandreJik

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  File "/home/patrick/python_bin/transformers/trainer.py", line 1860, in training_step
    with autocast(dtype=self.amp_dtype):
TypeError: __init__() got an unexpected keyword argument 'dtype'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @patrickvonplaten - pushed a workaround here: 14cc50d

@stas00 stas00 changed the title WIP: Support for Training with BF16 Support for Training with BF16 Dec 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using bf16 instead of fp16
8 participants