A fused `apply_rotary_pos_emb` implementation for Megatron-Core #1746

yaox12 · 2023-11-09T06:52:52Z

This is a fused apply_rotary_pos_emb implementation for Megatron-Core.

In my preliminary benchmark, it gives 2x - 4x speedup over the unfused version. batch_size=2 and head_num=64 are fixed.

dtype=torch.float32, seq_length=2048, hidden_size=128, rotary_percent=0.5
unfused rope: 0.45 ms
fused rope: 0.14 ms

dtype=torch.float32, seq_length=2048, hidden_size=128, rotary_percent=1.0
unfused rope: 0.67 ms
fused rope: 0.15 ms

dtype=torch.float32, seq_length=2048, hidden_size=256, rotary_percent=0.5
unfused rope: 0.84 ms
fused rope: 0.27 ms

dtype=torch.float32, seq_length=2048, hidden_size=256, rotary_percent=1.0
unfused rope: 1.3 ms
fused rope: 0.3 ms

dtype=torch.float32, seq_length=4096, hidden_size=128, rotary_percent=0.5
unfused rope: 0.85 ms
fused rope: 0.23 ms

dtype=torch.float32, seq_length=4096, hidden_size=128, rotary_percent=1.0
unfused rope: 1.3 ms
fused rope: 0.3 ms

dtype=torch.float32, seq_length=4096, hidden_size=256, rotary_percent=0.5
unfused rope: 1.6 ms
fused rope: 0.75 ms

dtype=torch.float32, seq_length=4096, hidden_size=256, rotary_percent=1.0
unfused rope: 2.6 ms
fused rope: 0.58 ms

Signed-off-by: Xin Yao <[email protected]>

crcrpar · 2023-11-09T13:15:44Z

csrc/megatron/fused_rotary_positional_embedding.h

how about using .cuh instead of .h for clarity?

I just followed the naming convention under csrc/megatron directory, e.g., generic_scaled_masked_softmax.h, scaled_masked_softmax.h, etc.
If you feel .cuh is better, I'm OK to change it.

crcrpar · 2023-11-09T13:17:41Z

tests/L0/run_transformer/test_fused_rope.py

+class FusedRoPEFunc(torch.autograd.Function):
+    @staticmethod
+    def forward(
+        ctx, t: torch.Tensor, cos_: torch.Tensor, sin_: torch.Tensor
+    ) -> torch.Tensor:
+        import fused_rotary_positional_embedding
+
+        output = fused_rotary_positional_embedding.forward(t, cos_, sin_)
+        ctx.save_for_backward(cos_, sin_)
+
+        return output
+
+    @staticmethod
+    def backward(
+        ctx, grad_output: torch.Tensor
+    ) -> Tuple[Union[torch.Tensor, None], ...]:
+        import fused_rotary_positional_embedding
+
+        cos_, sin_ = ctx.saved_tensors
+        grad_q = fused_rotary_positional_embedding.backward(grad_output, cos_, sin_)
+
+        return grad_q, None, None
+
+
+def apply_rotary_pos_emb_fused(t: torch.Tensor, freqs: torch.Tensor) -> torch.Tensor:
+    cos_ = torch.cos(freqs).to(t.dtype)
+    sin_ = torch.sin(freqs).to(t.dtype)
+    return FusedRoPEFunc.apply(t, cos_, sin_)


out of curiosity, wouldn't it be useful to have this in apex.transformer.functiona namespace?

Thanks for your review. I have added two functions to the apex.transformer.functional namespace.

fused_apply_rotary_pos_emb, which is a drop-in replacement for the current apply_rotary_pos_emb in Megatron Core.

fused_apply_rotary_pos_emb_cached, which would be beneficial when MCore implements caching for the rotary positional embedding.

Signed-off-by: Xin Yao <[email protected]>

yaox12 added 3 commits November 8, 2023 22:46

fused rope

535faf1

Signed-off-by: Xin Yao <[email protected]>

add checks and a unit test

f881562

Signed-off-by: Xin Yao <[email protected]>

use better block size

82a3231

Signed-off-by: Xin Yao <[email protected]>

yaox12 changed the title ~~A fused apply_rotary_pos_emb implementation~~ A fused apply_rotary_pos_emb implementation for Megatron-Core Nov 9, 2023

crcrpar reviewed Nov 9, 2023

View reviewed changes

add fused_rope to functional

89a410f

Signed-off-by: Xin Yao <[email protected]>

This was referenced Nov 11, 2023

[ENHANCEMENT] Port new fused rotary embedding kernel into MLM Zyphra/Megatron-LM#14

Open

Apply new fused rotary embedding EleutherAI/gpt-neox#1077

Closed

yaox12 requested a review from crcrpar November 14, 2023 02:26

crcrpar approved these changes Nov 14, 2023

View reviewed changes

crcrpar merged commit 08f7402 into NVIDIA:master Nov 14, 2023

crcrpar added this to the 23.12 milestone Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A fused `apply_rotary_pos_emb` implementation for Megatron-Core #1746

A fused `apply_rotary_pos_emb` implementation for Megatron-Core #1746

yaox12 commented Nov 9, 2023 •

edited

Loading

crcrpar Nov 9, 2023

yaox12 Nov 10, 2023

crcrpar Nov 9, 2023

yaox12 Nov 10, 2023

A fused apply_rotary_pos_emb implementation for Megatron-Core #1746

A fused apply_rotary_pos_emb implementation for Megatron-Core #1746

Conversation

yaox12 commented Nov 9, 2023 • edited Loading

crcrpar Nov 9, 2023

Choose a reason for hiding this comment

yaox12 Nov 10, 2023

Choose a reason for hiding this comment

crcrpar Nov 9, 2023

Choose a reason for hiding this comment

yaox12 Nov 10, 2023

Choose a reason for hiding this comment

A fused `apply_rotary_pos_emb` implementation for Megatron-Core #1746

A fused `apply_rotary_pos_emb` implementation for Megatron-Core #1746

yaox12 commented Nov 9, 2023 •

edited

Loading