You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for your awesome work! I noticed that there is a optimized weights called configs/pile_doremi_r1_120M_ref:pile_baseline_50kvocab_nopack_120M.json as shown in README. Can we consider this domain weights as the result of the first round of doremi?
By comparing with the results shown in the paper, we can find that these optimized weights are far from the one reported in the paper. For example, the domain weight of Pile-CC is 0.13788709, but the result in the paper is 0.6057. And if 0.13788709 is the result of the first round, we can conclude that the increase domain weight in Pile-CC is about 0.028861896. Then we can estimate that it would take approximately 21 rounds to converge to 0.6057.
P.S. Thanks for your reply in this issue: #11. I also want to ask how many rounds do we need to converge the domain weights on RedPajama?
Thanks.
The text was updated successfully, but these errors were encountered:
Yes, you can consider it to be the results of the first round, for a 50k vocab size (GPT-NeoX tokenizer) and a 120M proxy model. The script for running it is in scripts/run_pile.sh. The results in the paper are for a 256k vocab size (a Google internal tokenizer) and a 280M proxy model, and the dynamics turn out to be different, but that is also the result of 1 round of DoReMi. The 50k/120M results are more similar to the 1B proxy model results in the paper.
We only tried 2 rounds starting from uniform domain weights on RedPajama. This paper (https://arxiv.org/abs/2310.06694) uses a variant of DoReMi on RedPajama as well, with a similar resulting data balance (where C4 becomes highest).
Thanks for your awesome work! I noticed that there is a optimized weights called
configs/pile_doremi_r1_120M_ref:pile_baseline_50kvocab_nopack_120M.json
as shown in README. Can we consider this domain weights as the result of the first round of doremi?By comparing with the results shown in the paper, we can find that these optimized weights are far from the one reported in the paper. For example, the domain weight of
Pile-CC
is 0.13788709, but the result in the paper is 0.6057. And if 0.13788709 is the result of the first round, we can conclude that the increase domain weight inPile-CC
is about 0.028861896. Then we can estimate that it would take approximately 21 rounds to converge to 0.6057.P.S. Thanks for your reply in this issue: #11. I also want to ask how many rounds do we need to converge the domain weights on RedPajama?
Thanks.
The text was updated successfully, but these errors were encountered: