Add RoPE positional encoding #714

gordicaleksa · 2024-07-28T10:51:47Z

Implemented RoPE - rotary position embedding from the RoFormer paper.

Note:

I do not conditionally remove the allocation of our learnable position embedding buffer (wpe) as that would require touching many parts of the codebase that rely on the particular order inside the parameter buffer (e.g. wpe has index 1).
I do turn off fwd/bwd computation / grad norm computation / update for the wpe buffer.
The explicit tradeoff is: suffer a minimal memory bloat (maxT * C) but the PR has minimal impact on the readability of the codebase.

Tests:
I ran an A/B experiment: trained a 124M GPT-2 on 10B tokens (FineWeb subset) with:
a) learnable positional embeddings (default, -er == 0)
b) RoPE (-er == 1)
c) no positional embedding at all
all other settings being the same same.

Results:

Conclusions:

The validation loss is significantly better with RoPE
RoPE implementation slightly decresed the performance (consistent with what EleutherAI folks observed). I observed a drop ~1_632_000 -> ~1_603_000 tok/s (~1.7% perf hit).

karpathy · 2024-07-30T20:22:13Z

llmc/encoder.cuh

@@ -17,7 +17,7 @@ In the backward pass, the gradients flow to both, handled by different kernels
 // CUDA kernels

 __global__ void encoder_forward_kernel3(floatX* out,
-                               const int* inp, const floatX* wte, const floatX* wpe,
+                               const int* inp, const floatX* wte, const floatX* wpe, int use_rope,


I don't like this because use_rope has nothing to do with this encoding function, which we want to be nice and modular and self-contained. Maybe something like use_positional or something like that.

agree, can refactor a few of those bits quickly

OwenSanzas · 2024-10-08T15:03:30Z

I tried this branch on HPRC, it did not reach the lowest loss you had.

#!/bin/bash
#SBATCH --job-name=gpt2_train
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --time=26:00:00 #Request 26 hours (2 extra hours)
#SBATCH --mem=128GB #Request 128GB per node
#SBATCH --partition=gpu #Request the GPU partition/queue
#SBATCH --gres=gpu:a100:1 #Request one A100 GPU to use

#SBATCH --output=gpt2_train.%j.log #Redirect stdout/err to file

Run the training script

./train_gpt2cu -i "dev/data/fineweb10B/fineweb_train_.bin" -j "dev/data/fineweb10B/fineweb_val_.bin" -o log124M -e "d12" -b 32 -t 1024 -d 524288 -r 0 -z 1 -c 0.1 -l 0.0006 -q 0.0 -u 700 -n 5000 -v 250 -s 20000 -h 1

gordicaleksa added 12 commits July 28, 2024 12:51

Add RoPE - support on cmdline

242566b

Add RoPE init kernel

a36bc62

Use float buffer for rope freqs; tested against python ref

c08540d

Do not use WPE when RoPE enabled

61a0376

Add initial RoPE kernel

46babfe

Reduce freqs table 2x

e90062c

Use x128 loads for RoPE fwd kernel

8516330

Implement rope bwd kernel

0f27b28

Change default rope value

96222c6

Minor refactor + fix fwd enc bug

841e229

Remove wpe grad communication

3fda17b

Bug fix: missing /2 in freq table in the kernel

7e0c497

karpathy reviewed Jul 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RoPE positional encoding #714

Add RoPE positional encoding #714

gordicaleksa commented Jul 28, 2024 •

edited

Loading

karpathy Jul 30, 2024

gordicaleksa Jul 30, 2024

OwenSanzas commented Oct 8, 2024

Add RoPE positional encoding #714

Are you sure you want to change the base?

Add RoPE positional encoding #714

Conversation

gordicaleksa commented Jul 28, 2024 • edited Loading

karpathy Jul 30, 2024

Choose a reason for hiding this comment

gordicaleksa Jul 30, 2024

Choose a reason for hiding this comment

OwenSanzas commented Oct 8, 2024

Run the training script

gordicaleksa commented Jul 28, 2024 •

edited

Loading