Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How many rounds do we need to converge domain weights on The Pile? #15

Open
ouyangliqi opened this issue Sep 27, 2023 · 1 comment
Open

Comments

@ouyangliqi
Copy link

Thanks for your awesome work! I noticed that there is a optimized weights called configs/pile_doremi_r1_120M_ref:pile_baseline_50kvocab_nopack_120M.json as shown in README. Can we consider this domain weights as the result of the first round of doremi?

By comparing with the results shown in the paper, we can find that these optimized weights are far from the one reported in the paper. For example, the domain weight of Pile-CC is 0.13788709, but the result in the paper is 0.6057. And if 0.13788709 is the result of the first round, we can conclude that the increase domain weight in Pile-CC is about 0.028861896. Then we can estimate that it would take approximately 21 rounds to converge to 0.6057.

P.S. Thanks for your reply in this issue: #11. I also want to ask how many rounds do we need to converge the domain weights on RedPajama?

Thanks.

@sangmichaelxie
Copy link
Owner

Yes, you can consider it to be the results of the first round, for a 50k vocab size (GPT-NeoX tokenizer) and a 120M proxy model. The script for running it is in scripts/run_pile.sh. The results in the paper are for a 256k vocab size (a Google internal tokenizer) and a 280M proxy model, and the dynamics turn out to be different, but that is also the result of 1 round of DoReMi. The 50k/120M results are more similar to the 1B proxy model results in the paper.

We only tried 2 rounds starting from uniform domain weights on RedPajama. This paper (https://arxiv.org/abs/2310.06694) uses a variant of DoReMi on RedPajama as well, with a similar resulting data balance (where C4 becomes highest).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants