-
Notifications
You must be signed in to change notification settings - Fork 27.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Training with BF16 #13207
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this, @JamesDeAntonis. Looks very solid.
I haven't had a chance to set up the latest pt-nightly to test it so I will follow up on the correctness once I did that.
A few general comments from a quick read:
-
It's not half-precision but mixed half-precision in most of the proposed changes. But as I propose at the end of this comment, let's discuss the naming with Sylvain.
-
this PR is not backward compatible, so the old args should still work and be deprecated instead. So basically keeping the old args entries as before, plus deprecating, adding new config options and converting from old to new if users used the old. But to save time - we can do it at the very end of this PR when the naming has been figured out.
and also added a few suggestions in the diff.
Plus we need tests. We have to come up with some way to test whether the underlying hardware supports bf16. Last I asked at pytorch they haven't quite had a way to do so. I suppose we could try to try/except autocast(bf16)
and it should fail if the hardware doesn't support it and thus we can set a new flag is_bfloat16_hw_available
, which can be used to skip tests. As a side-effect it'll also fail if pt < 1.10, so then it'll be is_bfloat16_available
- in which case we don't know whether it's a hardware or software, but good enough for the test to be skipped either way.
Comments to @sgugger - should we rethink the amp args naming, removing the explicit --fp16 and --bf16 options and have one arg instead, e.g. --amp_dtype
? Since there might be other formats coming in the future. or should we worry about that when we come to it?
Regardless of the above I propose that the fp16 or bf16 repetitive logic is best reduced to a single variable during TrainingArguments init.
And the --half_precision_backend
vs. --mixed_half_precision_backend
- and both are a way too long...
src/transformers/trainer.py
Outdated
@@ -2164,7 +2174,7 @@ def evaluation_loop( | |||
|
|||
# if full fp16 is wanted on eval and this ``evaluation`` or ``predict`` isn't called while | |||
# ``train`` is running, halve it first and then put on device | |||
if not self.is_in_train and self.args.fp16_full_eval: | |||
if not self.is_in_train and self.args.half_precision_full_eval: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we name the new var "mixed_half_precision" - this one won't work anymore. as it can't be "half_precision_full"
If we name it "half_precision_mixed" then it works.
As we have 2 modes:
- mixed half-precision - AMP
- full half-precision -
.half
/.to(bf16)
I suppose the latter sounds confusing as well - full half is a bit of an oxymoron here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about simply half_precision_eval
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps then:
mixed_half_precision_backend
- (instead of the currentfp16_backend
)half_precision_eval
- (instead of the currentfp16_full_eval
)
On the other hand AMP = automatic mixed precision, so there is no half in there, so may be these 2 then?
mixed_precision_backend
- (instead of the currentfp16_backend
)half_precision_eval
- (instead of the currentfp16_full_eval
)
It'd be nice to find something less long to type, but a shorter mp_
would be too confusing because of model parallel, why not just use amp
?
amp_backend
- (instead of the currentfp16_backend
)half_precision_eval
- (instead of the currentfp16_full_eval
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the last suggestion for naming.
ok, pt-nightly installed so that I could test the new functionality. so I added:
So now the tests should be expanded to actually validate that bf16 is happening, with some number checking - e.g. we could check that the numbers are indeed bf16. |
Co-authored-by: Stas Bekman <[email protected]>
src/transformers/modeling_utils.py
Outdated
@@ -227,6 +227,7 @@ def invert_attention_mask(self, encoder_attention_mask: Tensor) -> Tensor: | |||
# /transformer/transformer_layers.py#L270 | |||
# encoder_extended_attention_mask = (encoder_extended_attention_mask == | |||
# encoder_extended_attention_mask.transpose(-1, -2)) | |||
###TODO | |||
encoder_extended_attention_mask = encoder_extended_attention_mask.to(dtype=self.dtype) # fp16 compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stas00 is self.dtype
supposed to be 16-bit when using amp? My own print statements are saying float32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
autocast keeps everything in the original dtype (usually fp32) and only does ops that are supported in half precision when they are called. It keeps a cache of the half-precision version of the original weights for the scope of autocast.
Also self.dtype
in the context you highlighted only checks the first param.
Deepspeed does it differently. it keeps weights in fp16 and uses fp32 master weights only when needed. So that's why we have those checks in the code as we have "torch.float16" then - also when we use --fp16_full_eval
I'm not observing as much of a memory improvement as I expected. Memory improvements I'm seeing are 0-15%, whereas I expected around 40% (per derrick's observation here). Is there anywhere in master where autocast is disabled for some reason? For example, that was going to be the case here, but that change is not currently in master. The two questions I just commented are part of my digging into whether there is a bug somewhere. EDIT: I found it interesting that |
I saw some recent comments on the torch slack that suggestion that bf16 hasn't quite been figured out performance-wise and can actually be slower depending on the hardware. One issue being cuDNN has no May I suggest to ask this question on https://discuss.pytorch.org/ and hopefully some knowledgeable devs with experience could give us a definitive answer. Please share the link if you do. I think on our side the main priority for providing bf16 support is to overcome the overflow issue in models pretrained in mixed bf16, performance being secondary. But of course, it'd be great to actually benefit from the new Ampere cards which have a lot of power but almost a year passed and we still can't quite harness that power. BTW, which card are you testing it with? |
RTX A6000. I'm pretty sure it's not related to this pr, per this |
Thank you for posting there, @JamesDeAntonis. Let's see if Piotr has some encouraging feedback. Otherwise the whole thing is very veiled at the moment as nobody wrote any definitive answers. |
@manuelciosici, FYI: we will deal with deepspeed in a separate PR #14569 - in particular since the ZeRO3 support hasn't been merged yet and we always need a new release from deepspeed to be able to update our integration side. |
@sgugger, please kindly have a look. I merged 2 PRs and cleaned things up and added a single deprecation. I also reverted the earlier attempt to use a shared Since bf16 has a much larger dynamic range most of the fp16 workarounds of that type aren't needed. So I grep'ed for Note, I've updated the OP with the up-to-date list of changes, so please refer to it for an overview. So I think we just need a couple of tests and if everybody is happy this is good to go. (tests added) The CI failure is unrelated. |
OK, a few tests added. @JamesDeAntonis and @manuelciosici - please have a look - let me know if anything else is needed in your opinion. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking over @stas00 . I just have a few comments on the names/ args added.
Re- half_precision_full_eval
, I still think it could be a better API if we find an intelligent default (always fp16? bfloat16 when it's supported?) but it can be done in another PR.
src/transformers/training_args.py
Outdated
bf16_full_eval (:obj:`bool`, `optional`, defaults to :obj:`False`): | ||
Whether to use full bfloat16 evaluation instead of 32-bit. This will be faster and save memory but can harm | ||
metric values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not overjoyed about adding that new argument but I understand your point on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm totally open to suggestions.
As I said the other approach is to use these 2 combinations instead:
--half_precision_full_eval
+--fp16
--half_precision_full_eval
+--bf16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
always fp16? bfloat16 when it's supported?) but it can be done in another PR.
That's not something that should be automated - as the 2 modes are very different in many ways. The user needs to make a deliberate choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So let's keep the two separate flags for now, then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on @stas00 suggestion that the data format decisions are not automated.
@@ -207,18 +207,26 @@ class TrainingArguments: | |||
Random seed that will be set at the beginning of training. To ensure reproducibility across runs, use the | |||
:func:`~transformers.Trainer.model_init` function to instantiate the model if it has some randomly | |||
initialized parameters. | |||
bf16 (:obj:`bool`, `optional`, defaults to :obj:`False`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be surprising as an argument name? Won't users expect --bfloat16
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this just mimics --fp16
style - how could the --bfloat16
style fit in here? i.e. if the other argument were --float16
then --bfloat16
fits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer --bfloat16
, but @stas00 is right, --bf16
is consistent.
@sgugger, would it help to document the bf16 API as experimental and a subject to change at a moment's notice? |
Yes please! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finished reading through the PR. I have small comments: a couple of typos and a couple of places where I think code can be simplified. Otherwise, it looks good.
@@ -207,18 +207,26 @@ class TrainingArguments: | |||
Random seed that will be set at the beginning of training. To ensure reproducibility across runs, use the | |||
:func:`~transformers.Trainer.model_init` function to instantiate the model if it has some randomly | |||
initialized parameters. | |||
bf16 (:obj:`bool`, `optional`, defaults to :obj:`False`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer --bfloat16
, but @stas00 is right, --bf16
is consistent.
src/transformers/training_args.py
Outdated
bf16_full_eval (:obj:`bool`, `optional`, defaults to :obj:`False`): | ||
Whether to use full bfloat16 evaluation instead of 32-bit. This will be faster and save memory but can harm | ||
metric values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on @stas00 suggestion that the data format decisions are not automated.
Thanks a lot for the review and the suggestions, @manuelciosici - all integrated, plus added a warning that this API is experimental, so if once we start using it we find that we could improve it we can. |
I have been working on a guide to all these new modes including tf32, @manuelciosici, et al - if you get a chance to proofread, please have a look at #14579 Thank you! |
loss_mb = smp_forward_backward(model, inputs, self.args.gradient_accumulation_steps, scaler=scaler) | ||
return loss_mb.reduce_mean().detach().to(self.args.device) | ||
|
||
if self.use_amp: | ||
with autocast(): | ||
with autocast(dtype=self.amp_dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this broke Trainer. autocast()
doesn't have dtype=self.amp_dtype
for older version I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File "/home/patrick/python_bin/transformers/trainer.py", line 1860, in training_step
with autocast(dtype=self.amp_dtype):
TypeError: __init__() got an unexpected keyword argument 'dtype'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @patrickvonplaten - pushed a workaround here: 14cc50d
What does this PR do?
As seen in this pr, there is demand for bf16 compatibility in training of transformers models. The pytorch folks just added this feature to their master branch, so we are now able to work on adding it to this repo. This pr follows from this issue.
Fixes #13170
(OP edited by @stas00)
Also merged here and adapted changes proposed by @manuelciosici at #14448
This PR:
require_torch_bf16
andis_torch_bf16_available
invert_attention_mask
and oneforward
in t5 to include bf16 mode switchesHF Trainer:
--bf16
and--bf16_full_eval
modes - same as fp16 equivalents--fp16_backend
and replaces it with--half_precision_backend
- since we now have more than one half precision modeTests:
--bf16
and--bf16_full_eval
tests@sgugger, @LysandreJik,
Also tagging @patrickvonplaten, @patil-suraj since once this is merged you can start sending users that have problems with bf16 pre-trained models and have Amphere hardware to use this
--bf16
mode. Deepspeed bf16 support will follow soon.