Skip to content

yegcjs/mixinglaws

Repository files navigation

Data Mixing Laws: Optimizing Data Mixture by Predicting Language Modeling Performance

Code and data for "Data Mixing Laws: Optimizing Data Mixture by Predicting Language Modeling Performance"

Data Mixing Laws

We include the codes to reproduce experiments and figures to discover data mixing laws in

  • mix_2_domains.ipynb: two training domains, single validation domain
  • mix_3_domains.ipynb: multiple training domains, single validation domain
  • mix_5_domains.ipynb: multiple training domains, multiple validation domains

Prediction Pipeline

Our full prediction pipeline can be reproduced with

cd pipeline
bash run.sh

Citation

@article{ye2024datamixinglaws,
  title={Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance},
  author={Ye, Jiasheng and Liu, Peiju and Sun, Tianxiang and Zhou, Yunhua and Zhan, Jun and Qiu, Xipeng},
  journal={arXiv preprint arXiv:2403.16952},
  year={2024}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages