Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALiBi Implementation #101

Merged
merged 7 commits into from
Sep 24, 2021
Merged

ALiBi Implementation #101

merged 7 commits into from
Sep 24, 2021

Conversation

ofirpress
Copy link
Contributor

I've implemented ALiBi efficiently, by just modifying the matmul_result matrix in transformer.py

On my hardware, with the pretrain_gpt.py example, ALiBi runs just as fast as sinusoidal and uses 20MB of extra memory (that is 0.2% of the total memory usage).

I think this implementation could be made even more efficient by modifying the mask instead of matmul_result, but this would require modifying the cuda code in scaled_upper_triang_masked_softmax and I don't have that knowledge.

Copy link
Member

@ibeltagy ibeltagy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @ofirpress.

I left a couple of questions, mainly wondering if certain constructs are safe to use in the model parallelism setup.

megatron/model/transformer.py Show resolved Hide resolved
@@ -660,11 +684,18 @@ def build_layer(layer_number):
get_cuda_rng_tracker = deepspeed.checkpointing.get_cuda_rng_tracker
checkpoint = deepspeed.checkpointing.checkpoint

if args.position_embedding_type == PositionEmbeddingType.alibi:
self.alibi = self._build_alibi_tensor(args.seq_length, args.num_attention_heads, args.micro_batch_size).to(torch.cuda.current_device())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is .to(torch.cuda.current_device()) safe in the multi-gpu model parallelism setup?
@stas00, @slippylolo do you know?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be threaded with care. But on the other hand things quickly break if it's not done right and pytorch let's you know quickly ;)

It should already allocate the tensor on the current device in __init__(). does it break if don't to() explicity, @ofirpress?

Note that matmul_result is created during forward which indeed has the correct device set.

The other approach to not needing to use torch.cuda.current_device() in forward is to use .device of one of the inputs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They key is to add a good test and we have a multi-gpu CI, so it should be easy to validate.

Copy link
Contributor Author

@ofirpress ofirpress Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it break if don't to() explicity, @ofirpress?

Yes it says that the alibi tensor is on the CPU if I don't to() it.

The other approach to not needing to use torch.cuda.current_device() in forward is to use .device of one of the inputs.

But during init do we have any of the inputs yet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In init you typically have other params, but it's probably OK the way you did it.

Usually the model is made of registered params and buffers and those get automatically switched to the right device with the sub-module, so custom tensors not attached to the model are tricky. and typically you have to update them in forward to be on the same device as inputs or params.

I haven't looked at the full context, so let's revisit this when the tests are added if you're running into problems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it can be done in forward, then that's where we want it done I believe. Because only in forward you know which device you're on. You can't rely on the where it was init'ed. If it makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@ofirpress ofirpress Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm not sure what the elegant way to do it is.
I guess we can do something like:

if self.alibi.device != some_input.device
    self.alibi.to(some_input.device)

but I'm not sure if that's OK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no need for if. It's a noop if it's already on the right device. So just the to call (plus assignment) as in the example I linked to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy merging the PR as is then fix this later if it ends up being a problem. As @stas00, if it is wrong, pytorch will complain.

megatron/model/transformer.py Outdated Show resolved Hide resolved
@stas00
Copy link
Contributor

stas00 commented Sep 16, 2021

I think this implementation could be made even more efficient by modifying the mask instead of matmul_result, but this would require modifying the cuda code in scaled_upper_triang_masked_softmax and I don't have that knowledge.

It's always good to start simple and then improve. So you could complete this work using your expertise, write good tests and make it a solid component and merge it. Then post an issue inviting someone with CUDA knowledge to further improve upon your work.

This is just a suggestion, of course.

Copy link
Member

@thomasw21 thomasw21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! I have some questions concerning current implementation:

  • We usually have a --reset-position-ids to reset the position ids within a row when we hit of end of document token, ie in a row we could have [0,1,2,3,4,0,1,2,3] (in the absolute embedding implementation). This does mean that it will conflict with the notion of static alibi matrix though. I don't think current implementation handles this case?
  • We have two options --reset-position-ids and --reset-attention-mask which up until now were quite distinct, but it starts to get mixed up with alibi. I'm not sure what the behaviour of alibi should be for cases where one is True the other one is False ...

@@ -594,6 +597,27 @@ def forward(self, inputs, **kwargs):
class ParallelTransformer(MegatronModule):
"""Transformer class."""

@staticmethod
def _build_alibi_tensor(max_seq_len, num_attention_heads, batch_size):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we build a seperate class for this? something like class AlibiPositionEmbedding. There's a position_embedding.py file where there's a rotary embedding implementation.

Comment on lines +613 to +614
closest_power_of_2 = 2 ** math.floor(math.log2(n))
return get_slopes_power_of_2(closest_power_of_2) + get_slopes(2 * closest_power_of_2)[0::2][
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it important to check that n is always a power of 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explained in the comment here: https://github.com/ofirpress/attention_with_linear_biases/blob/master/fairseq/models/transformer.py#L749

#In the paper, we only train models that have 2^a heads for some a. This function has
#some good properties that only occur when the input is a power of 2. To maintain that even
#when the number of heads is not a power of 2, we use this workaround. 

return get_slopes_power_of_2(closest_power_of_2) + get_slopes(2 * closest_power_of_2)[0::2][
:n - closest_power_of_2]
slopes = torch.Tensor(get_slopes(num_attention_heads))
alibi = slopes.unsqueeze(1).unsqueeze(1) * torch.arange(max_seq_len).unsqueeze(0).unsqueeze(0).expand(num_attention_heads, -1, -1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
alibi = slopes.unsqueeze(1).unsqueeze(1) * torch.arange(max_seq_len).unsqueeze(0).unsqueeze(0).expand(num_attention_heads, -1, -1)
alibi = slopes[:, None, None] * torch.arange(max_seq_len)[None, None, :].expand(num_attention_heads, -1, -1)

Also can you import the same comment as your original implementation? Typically where the matrix here doesn't match the one on the paper?

megatron/model/transformer.py Outdated Show resolved Hide resolved
dtype=query_layer.dtype,
device=torch.cuda.current_device())
else:
matmul_result = alibi[:output_size[0]*output_size[1], :, :output_size[3]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/101/files#diff-46c4c76deb18adf1de8e0be6d4229baed5f1f0308e141479f5a993b3d83dd445R307

When using baddmm shouldn't you set beta to 1? otherwise alibi should be ignored no?
https://pytorch.org/docs/stable/generated/torch.baddbmm.html

If you end up modifying beta then you probably have to replace empty with zeros in if alibi is None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you are right. beta here

beta=0.0, alpha=(1.0/self.norm_factor))
should be 1.0 instead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the paper handle the normalizing factor? Is it applied after the sum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using baddmm shouldn't you set beta to 1?

Yes!!! I meant to do that and then totally forgot! Thanks so much for pointing this out!!!

If you end up modifying beta then you probably have to replace empty with zeros in if alibi is None

Wouldn't it be better to do beta = 1 if alibi is not None else 0?

How does the paper handle the normalizing factor? Is it applied after the sum?

Which normalizing factor are you referring to? The softmax? If so, we apply the softmax after adding the ALiBi bias.

Copy link
Member

@thomasw21 thomasw21 Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the 1.0 / self.norm_factor part. Is it applied only on QK or on the unormalized entire attention matrix?

Copy link
Contributor Author

@ofirpress ofirpress Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, I understand. We apply it before the sum, see https://github.com/ofirpress/attention_with_linear_biases/blob/master/fairseq/modules/multihead_attention.py#L225

(This means alpha will remain unchanged here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed now :)

@ibeltagy
Copy link
Member

--reset-position-ids and --reset-attention-mask

We do not need to worry about this right now. We can do something like if alibi and (reset position id or resent attention mask) raise error

@ofirpress
Copy link
Contributor Author

Thanks ! I have some questions concerning current implementation:

  • We usually have a --reset-position-ids to reset the position ids within a row when we hit of end of document token, ie in a row we could have [0,1,2,3,4,0,1,2,3] (in the absolute embedding implementation). This does mean that it will conflict with the notion of static alibi matrix though. I don't think current implementation handles this case?
  • We have two options --reset-position-ids and --reset-attention-mask which up until now were quite distinct, but it starts to get mixed up with alibi. I'm not sure what the behaviour of alibi should be for cases where one is True the other one is False ...

I'm not quite sure why you would want to reset the absolute position embeddings when starting a new document, but with relative position embeddings there is no need for that I'm pretty sure. If you have an attention mask that masks out other documents from the current one, then that will directly also work with ALiBi and any other relative position method, no modifications needed there.

Co-authored-by: Thomas Wang <[email protected]>
@stas00
Copy link
Contributor

stas00 commented Sep 16, 2021

A quick note to reviewers - please make sure that all new features include extensive multi-gpu testing before the feature is merged. You can of course develop the tests on a single gpu since most people have only one.

You will find the existing tests under tests/, that you can use to build upon and if you need help with anything testing-wise or just want to validate the test looks good please don't hesitate to tag me, thank you!

@ofirpress
Copy link
Contributor Author

I just fixed the bug with the beta value in the attention computation. I've tested this by running the pretrain_gpt example twice, one with ALiBi and once without, and they both train until the end, with the ALiBi model achieving better performance which means its actually working (unlike before where the ALiBi matrix was being ignored).

Thanks so much @thomasw21 for pointing out that bug!

@thomasw21
Copy link
Member

Hum smells like a bad merge. Any chance you merged with master instead of main?

@ofirpress
Copy link
Contributor Author

I'm not great with git so that's definitely possible. What can I do to fix this? Thanks!

@thomasw21
Copy link
Member

Can you allow me to write on your fork? I can try to fix it.

@SaulLu
Copy link
Collaborator

SaulLu commented Sep 24, 2021

It looks like the last change you made to this PR is the bugfix with beta value in attention computation commit with the 5b5afb2 hash.

If this is the case, what you should do is undo revert all subsequent commits. There are two ways to do this:

  1. by rewriting history
  2. by creating commits that will undo one by one the commits you want to remove.

To use the second method, by noting that the first and last hash commits to be cancelled are 22a6e17 and a49461c, it seems to me that the appropriate command would be:

git revert 22a6e17...a49461c

[EDIT] Sorry @thomasw21, I didn't get the update of your new reply when I sent my message! It's very kind of you!

@thomasw21
Copy link
Member

This looks about right, it's just that you're going to need to force push which is always a bit risky.

@SaulLu
Copy link
Collaborator

SaulLu commented Sep 24, 2021

@thomasw21, if the second option is chosen, it seems to me that there will be no need for a force push (which is perhaps the most "reassuring" thing to do) 🙂

@ofirpress
Copy link
Contributor Author

Thanks so much @SaulLu and @thomasw21!

@thomasw21
Copy link
Member

Can we merge this @SaulLu ?

Copy link
Collaborator

@TevenLeScao TevenLeScao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge, the main thing is this to(device) that we needed to check, but it seems to have worked so far.

adammoody pushed a commit to adammoody/Megatron-DeepSpeed that referenced this pull request Jun 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants