-
Notifications
You must be signed in to change notification settings - Fork 0
Home
This is mainly here to share results and more details on various experiments. Here are the main graphs. They show all the currently running experiments in one place:
These are 'diff' plots where all runs are compared to one run. In this case the 'GPT2ish' 12 layer model.
The experiment name gives some useful information. It is broken into <name>_<layers>L_<context size>_<training set>. Sometimes an experiment name will have a version or other numeric information in it. In case of the 'ue' (Unified Embeddings) experiments the name is broken into ue_<version>_<internal embedding multiple>.
So these plots hopefully make it easy to see that the 16x UE is beating the 8x UE by a large amount but the UEs in general are massively ahead of the GPTish models even though the 8x and 16x UE models have exactly the same parameter count and structure as the GPT2ish 6 layer model. The only difference between a UE 6L and a GPT2ish 6L is the UE training. These graphs show the 16x UE learning nearly 10x faster than the equivalent GPT2ish model and that gap is rapidly increasing. It is looking likely that it will learn not only faster, but much deeper. In fact, it is looking like it may end up learning much faster and deeper than even the 12 layer GPT2 ish model despite having half the layers. Much longer training runs are required to really prove this out, but the current results are encouraging so far.