Code and data for "Data Mixing Laws: Optimizing Data Mixture by Predicting Language Modeling Performance"
We include the codes to reproduce experiments and figures to discover data mixing laws in
mix_2_domains.ipynb
: two training domains, single validation domainmix_3_domains.ipynb
: multiple training domains, single validation domainmix_5_domains.ipynb
: multiple training domains, multiple validation domains
Our full prediction pipeline can be reproduced with
cd pipeline
bash run.sh
@article{ye2024datamixinglaws,
title={Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance},
author={Ye, Jiasheng and Liu, Peiju and Sun, Tianxiang and Zhou, Yunhua and Zhan, Jun and Qiu, Xipeng},
journal={arXiv preprint arXiv:2403.16952},
year={2024}
}