ALiBi Implementation #101

ofirpress · 2021-09-16T00:05:04Z

I've implemented ALiBi efficiently, by just modifying the matmul_result matrix in transformer.py

On my hardware, with the pretrain_gpt.py example, ALiBi runs just as fast as sinusoidal and uses 20MB of extra memory (that is 0.2% of the total memory usage).

I think this implementation could be made even more efficient by modifying the mask instead of matmul_result, but this would require modifying the cuda code in scaled_upper_triang_masked_softmax and I don't have that knowledge.

ibeltagy

Looks great, thanks @ofirpress.

I left a couple of questions, mainly wondering if certain constructs are safe to use in the model parallelism setup.

megatron/model/transformer.py

ibeltagy · 2021-09-16T01:04:13Z

megatron/model/transformer.py

@@ -660,11 +684,18 @@ def build_layer(layer_number):
            get_cuda_rng_tracker = deepspeed.checkpointing.get_cuda_rng_tracker
            checkpoint = deepspeed.checkpointing.checkpoint

+        if args.position_embedding_type == PositionEmbeddingType.alibi:
+            self.alibi = self._build_alibi_tensor(args.seq_length, args.num_attention_heads, args.micro_batch_size).to(torch.cuda.current_device())


Is .to(torch.cuda.current_device()) safe in the multi-gpu model parallelism setup?
@stas00, @slippylolo do you know?

This needs to be threaded with care. But on the other hand things quickly break if it's not done right and pytorch let's you know quickly ;)

It should already allocate the tensor on the current device in __init__(). does it break if don't to() explicity, @ofirpress?

Note that matmul_result is created during forward which indeed has the correct device set.

The other approach to not needing to use torch.cuda.current_device() in forward is to use .device of one of the inputs.

They key is to add a good test and we have a multi-gpu CI, so it should be easy to validate.

does it break if don't to() explicity, @ofirpress?

Yes it says that the alibi tensor is on the CPU if I don't to() it.

The other approach to not needing to use torch.cuda.current_device() in forward is to use .device of one of the inputs.

But during init do we have any of the inputs yet?

In init you typically have other params, but it's probably OK the way you did it.

Usually the model is made of registered params and buffers and those get automatically switched to the right device with the sub-module, so custom tensors not attached to the model are tricky. and typically you have to update them in forward to be on the same device as inputs or params.

I haven't looked at the full context, so let's revisit this when the tests are added if you're running into problems.

If it can be done in forward, then that's where we want it done I believe. Because only in forward you know which device you're on. You can't rely on the where it was init'ed. If it makes sense.

for example:
https://github.com/huggingface/transformers/blob/master/src/transformers/models/fsmt/modeling_fsmt.py#L1276

I guess I'm not sure what the elegant way to do it is.
I guess we can do something like:

if self.alibi.device != some_input.device self.alibi.to(some_input.device)

but I'm not sure if that's OK.

there is no need for if. It's a noop if it's already on the right device. So just the to call (plus assignment) as in the example I linked to.

I am happy merging the PR as is then fix this later if it ends up being a problem. As @stas00, if it is wrong, pytorch will complain.

megatron/model/transformer.py

stas00 · 2021-09-16T05:18:28Z

I think this implementation could be made even more efficient by modifying the mask instead of matmul_result, but this would require modifying the cuda code in scaled_upper_triang_masked_softmax and I don't have that knowledge.

It's always good to start simple and then improve. So you could complete this work using your expertise, write good tests and make it a solid component and merge it. Then post an issue inviting someone with CUDA knowledge to further improve upon your work.

This is just a suggestion, of course.

thomasw21

Thanks ! I have some questions concerning current implementation:

We usually have a --reset-position-ids to reset the position ids within a row when we hit of end of document token, ie in a row we could have [0,1,2,3,4,0,1,2,3] (in the absolute embedding implementation). This does mean that it will conflict with the notion of static alibi matrix though. I don't think current implementation handles this case?
We have two options --reset-position-ids and --reset-attention-mask which up until now were quite distinct, but it starts to get mixed up with alibi. I'm not sure what the behaviour of alibi should be for cases where one is True the other one is False ...

thomasw21 · 2021-09-16T07:53:47Z

megatron/model/transformer.py

@@ -594,6 +597,27 @@ def forward(self, inputs, **kwargs):
 class ParallelTransformer(MegatronModule):
    """Transformer class."""

+    @staticmethod
+    def _build_alibi_tensor(max_seq_len, num_attention_heads, batch_size):


Can we build a seperate class for this? something like class AlibiPositionEmbedding. There's a position_embedding.py file where there's a rotary embedding implementation.

thomasw21 · 2021-09-16T08:11:59Z

megatron/model/transformer.py

+                closest_power_of_2 = 2 ** math.floor(math.log2(n))
+                return get_slopes_power_of_2(closest_power_of_2) + get_slopes(2 * closest_power_of_2)[0::2][


Why is it important to check that n is always a power of 2?

Explained in the comment here: https://github.com/ofirpress/attention_with_linear_biases/blob/master/fairseq/models/transformer.py#L749

#In the paper, we only train models that have 2^a heads for some a. This function has #some good properties that only occur when the input is a power of 2. To maintain that even #when the number of heads is not a power of 2, we use this workaround.

thomasw21 · 2021-09-16T08:14:41Z

megatron/model/transformer.py

+                return get_slopes_power_of_2(closest_power_of_2) + get_slopes(2 * closest_power_of_2)[0::2][
+                                                                   :n - closest_power_of_2]
+        slopes = torch.Tensor(get_slopes(num_attention_heads))
+        alibi = slopes.unsqueeze(1).unsqueeze(1) * torch.arange(max_seq_len).unsqueeze(0).unsqueeze(0).expand(num_attention_heads, -1, -1)


Suggested change

alibi = slopes.unsqueeze(1).unsqueeze(1) * torch.arange(max_seq_len).unsqueeze(0).unsqueeze(0).expand(num_attention_heads, -1, -1)

alibi = slopes[:, None, None] * torch.arange(max_seq_len)[None, None, :].expand(num_attention_heads, -1, -1)

Also can you import the same comment as your original implementation? Typically where the matrix here doesn't match the one on the paper?

megatron/model/transformer.py

thomasw21 · 2021-09-16T08:27:51Z

megatron/model/transformer.py

+                dtype=query_layer.dtype,
+                device=torch.cuda.current_device())
+        else:
+            matmul_result = alibi[:output_size[0]*output_size[1], :, :output_size[3]]


https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/101/files#diff-46c4c76deb18adf1de8e0be6d4229baed5f1f0308e141479f5a993b3d83dd445R307

When using baddmm shouldn't you set beta to 1? otherwise alibi should be ignored no?
https://pytorch.org/docs/stable/generated/torch.baddbmm.html

If you end up modifying beta then you probably have to replace empty with zeros in if alibi is None

I see, you are right. beta here

Megatron-DeepSpeed/megatron/model/transformer.py

Line 307 in 3350322

beta=0.0, alpha=(1.0/self.norm_factor))

should be 1.0 instead

How does the paper handle the normalizing factor? Is it applied after the sum?

When using baddmm shouldn't you set beta to 1?

Yes!!! I meant to do that and then totally forgot! Thanks so much for pointing this out!!!

If you end up modifying beta then you probably have to replace empty with zeros in if alibi is None

Wouldn't it be better to do beta = 1 if alibi is not None else 0?

How does the paper handle the normalizing factor? Is it applied after the sum?

Which normalizing factor are you referring to? The softmax? If so, we apply the softmax after adding the ALiBi bias.

I mean the 1.0 / self.norm_factor part. Is it applied only on QK or on the unormalized entire attention matrix?

Ah ok, I understand. We apply it before the sum, see https://github.com/ofirpress/attention_with_linear_biases/blob/master/fairseq/modules/multihead_attention.py#L225

(This means alpha will remain unchanged here)

Fixed now :)

ibeltagy · 2021-09-16T11:07:56Z

--reset-position-ids and --reset-attention-mask

We do not need to worry about this right now. We can do something like if alibi and (reset position id or resent attention mask) raise error

ofirpress · 2021-09-16T15:59:18Z

Thanks ! I have some questions concerning current implementation:

We usually have a --reset-position-ids to reset the position ids within a row when we hit of end of document token, ie in a row we could have [0,1,2,3,4,0,1,2,3] (in the absolute embedding implementation). This does mean that it will conflict with the notion of static alibi matrix though. I don't think current implementation handles this case?

We have two options --reset-position-ids and --reset-attention-mask which up until now were quite distinct, but it starts to get mixed up with alibi. I'm not sure what the behaviour of alibi should be for cases where one is True the other one is False ...

I'm not quite sure why you would want to reset the absolute position embeddings when starting a new document, but with relative position embeddings there is no need for that I'm pretty sure. If you have an attention mask that masks out other documents from the current one, then that will directly also work with ALiBi and any other relative position method, no modifications needed there.

Co-authored-by: Thomas Wang <[email protected]>

stas00 · 2021-09-16T17:32:21Z

A quick note to reviewers - please make sure that all new features include extensive multi-gpu testing before the feature is merged. You can of course develop the tests on a single gpu since most people have only one.

You will find the existing tests under tests/, that you can use to build upon and if you need help with anything testing-wise or just want to validate the test looks good please don't hesitate to tag me, thank you!

ofirpress · 2021-09-16T22:41:02Z

I just fixed the bug with the beta value in the attention computation. I've tested this by running the pretrain_gpt example twice, one with ALiBi and once without, and they both train until the end, with the ALiBi model achieving better performance which means its actually working (unlike before where the ALiBi matrix was being ignored).

Thanks so much @thomasw21 for pointing out that bug!

thomasw21 · 2021-09-24T07:03:49Z

Hum smells like a bad merge. Any chance you merged with master instead of main?

ofirpress · 2021-09-24T07:25:41Z

I'm not great with git so that's definitely possible. What can I do to fix this? Thanks!

thomasw21 · 2021-09-24T08:13:58Z

Can you allow me to write on your fork? I can try to fix it.

SaulLu · 2021-09-24T08:17:01Z

It looks like the last change you made to this PR is the bugfix with beta value in attention computation commit with the 5b5afb2 hash.

If this is the case, what you should do is undo revert all subsequent commits. There are two ways to do this:

by rewriting history
by creating commits that will undo one by one the commits you want to remove.

To use the second method, by noting that the first and last hash commits to be cancelled are 22a6e17 and a49461c, it seems to me that the appropriate command would be:

git revert 22a6e17...a49461c

[EDIT] Sorry @thomasw21, I didn't get the update of your new reply when I sent my message! It's very kind of you!

thomasw21 · 2021-09-24T08:22:37Z

This looks about right, it's just that you're going to need to force push which is always a bit risky.

SaulLu · 2021-09-24T08:27:33Z

@thomasw21, if the second option is chosen, it seems to me that there will be no need for a force push (which is perhaps the most "reassuring" thing to do) 🙂

ofirpress · 2021-09-24T08:38:44Z

Thanks so much @SaulLu and @thomasw21!

thomasw21 · 2021-09-24T08:40:25Z

Can we merge this @SaulLu ?

TevenLeScao

Let's merge, the main thing is this to(device) that we needed to check, but it seems to have worked so far.

ofirpress and others added 3 commits September 10, 2021 14:31

alibi implementation

9aa94ab

works but slow

4792218

efficient implementation

7ad5b48

ibeltagy requested review from thomasw21, TevenLeScao, slippylolo and ibeltagy September 16, 2021 00:30

ibeltagy approved these changes Sep 16, 2021

View reviewed changes

ofirpress added 2 commits September 15, 2021 22:40

Update transformer.py

85c73f2

Update transformer.py

3350322

thomasw21 reviewed Sep 16, 2021

View reviewed changes

Remove unnecessary .view()

06dc2d7

Co-authored-by: Thomas Wang <[email protected]>

bugfix with beta value in attention computation

5b5afb2

thomasw21 force-pushed the main branch from a49461c to 5b5afb2 Compare September 24, 2021 08:37

TevenLeScao approved these changes Sep 24, 2021

View reviewed changes

thomasw21 merged commit c839a8a into bigscience-workshop:main Sep 24, 2021

thomasw21 mentioned this pull request Feb 14, 2022

[low CPU RAM] allocate on gpu directly, stagger checkpoint load/save #248

Closed

adammoody pushed a commit to adammoody/Megatron-DeepSpeed that referenced this pull request Jun 21, 2023

fix floating point in script (bigscience-workshop#101)

515798f

		closest_power_of_2 = 2 ** math.floor(math.log2(n))
		return get_slopes_power_of_2(closest_power_of_2) + get_slopes(2 * closest_power_of_2)[0::2][

	alibi = slopes.unsqueeze(1).unsqueeze(1) * torch.arange(max_seq_len).unsqueeze(0).unsqueeze(0).expand(num_attention_heads, -1, -1)
	alibi = slopes[:, None, None] * torch.arange(max_seq_len)[None, None, :].expand(num_attention_heads, -1, -1)

ALiBi Implementation #101

ALiBi Implementation #101

Conversation

ofirpress commented Sep 16, 2021

ibeltagy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ofirpress Sep 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ofirpress Sep 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 commented Sep 16, 2021

thomasw21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasw21 Sep 16, 2021 • edited Loading

Choose a reason for hiding this comment

ofirpress Sep 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ibeltagy commented Sep 16, 2021

ofirpress commented Sep 16, 2021

stas00 commented Sep 16, 2021 • edited Loading

ofirpress commented Sep 16, 2021

thomasw21 commented Sep 24, 2021

ofirpress commented Sep 24, 2021

thomasw21 commented Sep 24, 2021

SaulLu commented Sep 24, 2021 • edited Loading

thomasw21 commented Sep 24, 2021

SaulLu commented Sep 24, 2021

ofirpress commented Sep 24, 2021

thomasw21 commented Sep 24, 2021

TevenLeScao left a comment

Choose a reason for hiding this comment

ofirpress Sep 16, 2021 •

edited

Loading

ofirpress Sep 16, 2021 •

edited

Loading

thomasw21 Sep 16, 2021 •

edited

Loading

ofirpress Sep 16, 2021 •

edited

Loading

stas00 commented Sep 16, 2021 •

edited

Loading

SaulLu commented Sep 24, 2021 •

edited

Loading