InstructLab by IBM: https://research.ibm.com/blog/LLM-generated-data

Cosmopedia and synthetic datasets https://huggingface.co/blog/cosmopedia

Hugging Face synthetic datasets https://huggingface.co/blog/davanstrien/self-instruct

Repository with synthetic datasets https://github.com/davanstrien/awesome-synthetic-datasets

Youtube video for StarCode and StarCode2 https://www.youtube.com/watch?v=IyI8pXbQzbw

GitHub repo for collection of resources about synthetic data https://github.com/wasiahmad/Awesome-LLM-Synthetic-Data

Synthetic data generation is not a new technique in the AI world. Early methods relied on statistical techniques like bootstrapping, smoothing, and imputation. With the advent of machine learning techniques, more sophisticated methods emerged in the 2010s: the Generative Adversarial Networks (GANs) and Variation Autoencoders. Now, with the advent of LLMs, synthetic data generation has further advanced with innovative and more effective techniques. This GitHub repository (https://github.com/wasiahmad/Awesome-LLM-Synthetic-Data) is a true gem. It offers a great selection of resources like methods surveys, relevant blog posts, and relevant papers to read when working on use cases (like math reasoning, code generation, vision, and language, etc.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

synthetic_data.md

synthetic_data.md

Files

synthetic_data.md

Latest commit

History

synthetic_data.md

File metadata and controls