Improving Performance Using Moving Window Approach ? #69
PhilBerent
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi Andrej
Thanks so much for the amazing tutorials! I worked through both the lets "build GPT from scratch" (Tiny Shakespeare Model) and the "lets build GPT2" tutorial.
I found that I could significantly improve the performance of the Tiny Shakespeare Model by doing away with positional encoding and instead using a moving window approach (admittedly with slightly more parameters). That is to say each presented batch has a shape of Bx(2T-1)xC and each element in a block always considers the previous T-1 elements and the current element - thus no need to use only the the lower triangular portion of the weight matrix.
Results were as follows after 5000 steps. Both models have 4 layers each with 4 heads (n_layers=4, n_heads=4, batch_size=16, block_size=32, n_embd=64)
Model Train Loss Val Loss Number of parameters
Original Model 1.66 1.82 .209 M
Moving Window Model 1.49 1.69 .267 M
As the basic structure of GPT2 is very similar to Tiny Shakespeare I would be very interested to test out if the same approach results in better results on a full size (or at least of the size you reviewed in the tutorial) GPT2 model as well.
Before spending the time (and money) to do this, I was hoping that you (or anyone else with thoughts on this) might tell me if I am missing something obvious and would be embarking on a wild goose chase.
Thanks in advance
Phil
(Messy version of code is here)
https://github.com/PhilBerent/TinyShakespearMvWindow/blob/main/TSMvWindow.py
Beta Was this translation helpful? Give feedback.
All reactions