Ring attention #181

zzhhjjj · 2024-05-23T13:11:52Z

Ring attention for training on long sequences. Similar to Megatron context parallel. Idea from https://github.com/zhuzilin/ring-flash-attention

3outeille · 2024-07-24T09:27:19Z

src/nanotron/models/llama.py


+## Copy from transformers. Non interleaved version of RoPE. Will be refactored later
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+
+
+class LlamaRotaryEmbedding(nn.Module):
+    def __init__(self, dim: int, end: int, theta: float = 500000.0):
+        super().__init__()
+        self.dim = dim
+        self.end = end
+        self.theta = theta
+        self.init_rotary_embeddings()
+
+    def init_rotary_embeddings(self):
+        inv_freq = 1.0 / (self.theta ** (torch.arange(0, self.dim, 2, dtype=torch.float, device="cuda") / self.dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+
+    @torch.no_grad()
+    def forward(
+        self,
+        x: torch.Tensor,  # [batch_size, seq_length, num_heads, d_qk]
+        position_ids: Optional[torch.LongTensor],  # [batch_size, seq_length]
+    ):
+        # x: [bs, num_attention_heads, seq_len, head_size]
+        # print("rotary")
+        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
+        position_ids_expanded = position_ids[:, None, :].float()
+        # Force float32 since bfloat16 loses precision on long contexts
+        # See https://github.com/huggingface/transformers/pull/29285
+        device_type = x.device.type
+        device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
+        with torch.autocast(device_type=device_type, enabled=False):
+            freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
+            emb = torch.cat((freqs, freqs), dim=-1)
+            cos = emb.cos()
+            sin = emb.sin()
+        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
+
+
+def apply_rotary_pos_emb(q, k, cos, sin, unsqueeze_dim=2):
+    """Applies Rotary Position Embedding to the query and key tensors.
+
+    Args:
+        q (`torch.Tensor`): The query tensor.
+        k (`torch.Tensor`): The key tensor.
+        cos (`torch.Tensor`): The cosine part of the rotary embedding.
+        sin (`torch.Tensor`): The sine part of the rotary embedding.
+        unsqueeze_dim (`int`, *optional*, defaults to 1):
+            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
+            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
+            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
+            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
+            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
+            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
+    Returns:
+        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
+    """
+    cos = cos.unsqueeze(unsqueeze_dim)
+    sin = sin.unsqueeze(unsqueeze_dim)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+


can we open a separate PR first to replace current RotaryEmbedding

3outeille · 2024-07-24T09:39:07Z

tests/test_serialize.py

-    init_distributed(tp=tp, dp=dp, pp=pp)(_test_save_zero_optimizer_and_load_optimizer)(test_context=test_context)
+    # Currently SP doesn't support zero.
+    if sp != 1:
+        return


print with message

3outeille · 2024-07-24T09:39:13Z

tests/test_serialize.py

-    # We use DP=2 as we're interested in testing that one
-    init_distributed(tp=tp, dp=dp, pp=pp)(_test_save_zero_optimizer_and_load_data_parallel_optimizer)(
+    if sp != 1:
+        return


print with message

xrsrke

Resolve merge conflicts!

first commit for ring attention

471e91b

zzhhjjj changed the title ~~first commit for ring attention~~ Ring attention May 23, 2024

zzhhjjj added 20 commits June 11, 2024 08:39

still testing

111c4fc

refactor some code

3ef191d

add SP to the tests

019d263

ddp process group for backwawrd

07ebf2c

Merge branch 'main' into ring_attention

beefa30

clean code

3f504b3

update tests to add SP process group

3f841ba

clean code

9208928

refactor code

dbb8eb7

Merge remote-tracking branch 'upstream/main' into ring_attention

e750a8f

fix tests

16600f0

fix tests

175a076

another test

61486e7

import flash attention

d5e8332

add readme for training

2a82f5c

correct link

87649df

evaluation script

c6bbeda

change foler location

72d9488

update Readme

5907339

update README

c81601a

3outeille reviewed Jul 24, 2024

View reviewed changes

xrsrke requested changes Sep 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ring attention #181

Ring attention #181

zzhhjjj commented May 23, 2024

3outeille Jul 24, 2024

3outeille Jul 24, 2024

3outeille Jul 24, 2024

xrsrke left a comment

Ring attention #181

Are you sure you want to change the base?

Ring attention #181

Conversation

zzhhjjj commented May 23, 2024

3outeille Jul 24, 2024

Choose a reason for hiding this comment

3outeille Jul 24, 2024

Choose a reason for hiding this comment

3outeille Jul 24, 2024

Choose a reason for hiding this comment

xrsrke left a comment

Choose a reason for hiding this comment