-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ALiBi Implementation #101
ALiBi Implementation #101
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks @ofirpress.
I left a couple of questions, mainly wondering if certain constructs are safe to use in the model parallelism setup.
@@ -660,11 +684,18 @@ def build_layer(layer_number): | |||
get_cuda_rng_tracker = deepspeed.checkpointing.get_cuda_rng_tracker | |||
checkpoint = deepspeed.checkpointing.checkpoint | |||
|
|||
if args.position_embedding_type == PositionEmbeddingType.alibi: | |||
self.alibi = self._build_alibi_tensor(args.seq_length, args.num_attention_heads, args.micro_batch_size).to(torch.cuda.current_device()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is .to(torch.cuda.current_device())
safe in the multi-gpu model parallelism setup?
@stas00, @slippylolo do you know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be threaded with care. But on the other hand things quickly break if it's not done right and pytorch let's you know quickly ;)
It should already allocate the tensor on the current device in __init__()
. does it break if don't to()
explicity, @ofirpress?
Note that matmul_result
is created during forward
which indeed has the correct device set.
The other approach to not needing to use torch.cuda.current_device()
in forward is to use .device
of one of the inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They key is to add a good test and we have a multi-gpu CI, so it should be easy to validate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it break if don't to() explicity, @ofirpress?
Yes it says that the alibi tensor is on the CPU if I don't to() it.
The other approach to not needing to use torch.cuda.current_device() in forward is to use .device of one of the inputs.
But during init do we have any of the inputs yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In init you typically have other params, but it's probably OK the way you did it.
Usually the model is made of registered params and buffers and those get automatically switched to the right device with the sub-module, so custom tensors not attached to the model are tricky. and typically you have to update them in forward
to be on the same device as inputs or params.
I haven't looked at the full context, so let's revisit this when the tests are added if you're running into problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it can be done in forward, then that's where we want it done I believe. Because only in forward
you know which device you're on. You can't rely on the where it was init'ed. If it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm not sure what the elegant way to do it is.
I guess we can do something like:
if self.alibi.device != some_input.device
self.alibi.to(some_input.device)
but I'm not sure if that's OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no need for if. It's a noop if it's already on the right device. So just the to call (plus assignment) as in the example I linked to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy merging the PR as is then fix this later if it ends up being a problem. As @stas00, if it is wrong, pytorch will complain.
It's always good to start simple and then improve. So you could complete this work using your expertise, write good tests and make it a solid component and merge it. Then post an issue inviting someone with CUDA knowledge to further improve upon your work. This is just a suggestion, of course. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! I have some questions concerning current implementation:
- We usually have a
--reset-position-ids
to reset the position ids within a row when we hit of end of document token, ie in a row we could have [0,1,2,3,4,0,1,2,3] (in the absolute embedding implementation). This does mean that it will conflict with the notion of static alibi matrix though. I don't think current implementation handles this case? - We have two options
--reset-position-ids
and--reset-attention-mask
which up until now were quite distinct, but it starts to get mixed up with alibi. I'm not sure what the behaviour of alibi should be for cases where one is True the other one is False ...
@@ -594,6 +597,27 @@ def forward(self, inputs, **kwargs): | |||
class ParallelTransformer(MegatronModule): | |||
"""Transformer class.""" | |||
|
|||
@staticmethod | |||
def _build_alibi_tensor(max_seq_len, num_attention_heads, batch_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we build a seperate class for this? something like class AlibiPositionEmbedding
. There's a position_embedding.py
file where there's a rotary embedding implementation.
closest_power_of_2 = 2 ** math.floor(math.log2(n)) | ||
return get_slopes_power_of_2(closest_power_of_2) + get_slopes(2 * closest_power_of_2)[0::2][ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it important to check that n
is always a power of 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explained in the comment here: https://github.com/ofirpress/attention_with_linear_biases/blob/master/fairseq/models/transformer.py#L749
#In the paper, we only train models that have 2^a heads for some a. This function has
#some good properties that only occur when the input is a power of 2. To maintain that even
#when the number of heads is not a power of 2, we use this workaround.
return get_slopes_power_of_2(closest_power_of_2) + get_slopes(2 * closest_power_of_2)[0::2][ | ||
:n - closest_power_of_2] | ||
slopes = torch.Tensor(get_slopes(num_attention_heads)) | ||
alibi = slopes.unsqueeze(1).unsqueeze(1) * torch.arange(max_seq_len).unsqueeze(0).unsqueeze(0).expand(num_attention_heads, -1, -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alibi = slopes.unsqueeze(1).unsqueeze(1) * torch.arange(max_seq_len).unsqueeze(0).unsqueeze(0).expand(num_attention_heads, -1, -1) | |
alibi = slopes[:, None, None] * torch.arange(max_seq_len)[None, None, :].expand(num_attention_heads, -1, -1) |
Also can you import the same comment as your original implementation? Typically where the matrix here doesn't match the one on the paper?
dtype=query_layer.dtype, | ||
device=torch.cuda.current_device()) | ||
else: | ||
matmul_result = alibi[:output_size[0]*output_size[1], :, :output_size[3]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using baddmm
shouldn't you set beta to 1? otherwise alibi should be ignored no?
https://pytorch.org/docs/stable/generated/torch.baddbmm.html
If you end up modifying beta
then you probably have to replace empty
with zeros
in if alibi is None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, you are right. beta here
beta=0.0, alpha=(1.0/self.norm_factor)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the paper handle the normalizing factor? Is it applied after the sum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using baddmm shouldn't you set beta to 1?
Yes!!! I meant to do that and then totally forgot! Thanks so much for pointing this out!!!
If you end up modifying beta then you probably have to replace empty with zeros in if alibi is None
Wouldn't it be better to do beta = 1 if alibi is not None else 0
?
How does the paper handle the normalizing factor? Is it applied after the sum?
Which normalizing factor are you referring to? The softmax? If so, we apply the softmax after adding the ALiBi bias.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the 1.0 / self.norm_factor
part. Is it applied only on QK
or on the unormalized entire attention matrix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, I understand. We apply it before the sum, see https://github.com/ofirpress/attention_with_linear_biases/blob/master/fairseq/modules/multihead_attention.py#L225
(This means alpha will remain unchanged here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed now :)
We do not need to worry about this right now. We can do something like |
I'm not quite sure why you would want to reset the absolute position embeddings when starting a new document, but with relative position embeddings there is no need for that I'm pretty sure. If you have an attention mask that masks out other documents from the current one, then that will directly also work with ALiBi and any other relative position method, no modifications needed there. |
Co-authored-by: Thomas Wang <[email protected]>
A quick note to reviewers - please make sure that all new features include extensive multi-gpu testing before the feature is merged. You can of course develop the tests on a single gpu since most people have only one. You will find the existing tests under |
I just fixed the bug with the beta value in the attention computation. I've tested this by running the pretrain_gpt example twice, one with ALiBi and once without, and they both train until the end, with the ALiBi model achieving better performance which means its actually working (unlike before where the ALiBi matrix was being ignored). Thanks so much @thomasw21 for pointing out that bug! |
Hum smells like a bad merge. Any chance you merged with |
I'm not great with git so that's definitely possible. What can I do to fix this? Thanks! |
Can you allow me to write on your fork? I can try to fix it. |
It looks like the last change you made to this PR is the If this is the case, what you should do is undo revert all subsequent commits. There are two ways to do this:
To use the second method, by noting that the first and last hash commits to be cancelled are
[EDIT] Sorry @thomasw21, I didn't get the update of your new reply when I sent my message! It's very kind of you! |
This looks about right, it's just that you're going to need to force push which is always a bit risky. |
@thomasw21, if the second option is chosen, it seems to me that there will be no need for a force push (which is perhaps the most "reassuring" thing to do) 🙂 |
Thanks so much @SaulLu and @thomasw21! |
Can we merge this @SaulLu ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge, the main thing is this to(device)
that we needed to check, but it seems to have worked so far.
I've implemented ALiBi efficiently, by just modifying the matmul_result matrix in transformer.py
On my hardware, with the pretrain_gpt.py example, ALiBi runs just as fast as sinusoidal and uses 20MB of extra memory (that is 0.2% of the total memory usage).
I think this implementation could be made even more efficient by modifying the mask instead of matmul_result, but this would require modifying the cuda code in scaled_upper_triang_masked_softmax and I don't have that knowledge.