Improving Performance Using Moving Window Approach ? #69

PhilBerent · 2024-08-02T03:40:53Z

PhilBerent
Aug 2, 2024

Hi Andrej

Thanks so much for the amazing tutorials! I worked through both the lets "build GPT from scratch" (Tiny Shakespeare Model) and the "lets build GPT2" tutorial.
I found that I could significantly improve the performance of the Tiny Shakespeare Model by doing away with positional encoding and instead using a moving window approach (admittedly with slightly more parameters). That is to say each presented batch has a shape of Bx(2T-1)xC and each element in a block always considers the previous T-1 elements and the current element - thus no need to use only the the lower triangular portion of the weight matrix.
Results were as follows after 5000 steps. Both models have 4 layers each with 4 heads (n_layers=4, n_heads=4, batch_size=16, block_size=32, n_embd=64)

Model Train Loss Val Loss Number of parameters
Original Model 1.66 1.82 .209 M
Moving Window Model 1.49 1.69 .267 M

As the basic structure of GPT2 is very similar to Tiny Shakespeare I would be very interested to test out if the same approach results in better results on a full size (or at least of the size you reviewed in the tutorial) GPT2 model as well.

Before spending the time (and money) to do this, I was hoping that you (or anyone else with thoughts on this) might tell me if I am missing something obvious and would be embarking on a wild goose chase.

Thanks in advance

Phil

(Messy version of code is here)
https://github.com/PhilBerent/TinyShakespearMvWindow/blob/main/TSMvWindow.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Performance Using Moving Window Approach ? #69

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Improving Performance Using Moving Window Approach ? #69

PhilBerent Aug 2, 2024

Replies: 0 comments

PhilBerent
Aug 2, 2024