InstructLab by IBM: https://research.ibm.com/blog/LLM-generated-data
Cosmopedia and synthetic datasets https://huggingface.co/blog/cosmopedia
Hugging Face synthetic datasets https://huggingface.co/blog/davanstrien/self-instruct
Repository with synthetic datasets https://github.com/davanstrien/awesome-synthetic-datasets
Youtube video for StarCode and StarCode2 https://www.youtube.com/watch?v=IyI8pXbQzbw
GitHub repo for collection of resources about synthetic data https://github.com/wasiahmad/Awesome-LLM-Synthetic-Data
Synthetic data generation is not a new technique in the AI world. Early methods relied on statistical techniques like bootstrapping, smoothing, and imputation. With the advent of machine learning techniques, more sophisticated methods emerged in the 2010s: the Generative Adversarial Networks (GANs) and Variation Autoencoders. Now, with the advent of LLMs, synthetic data generation has further advanced with innovative and more effective techniques. This GitHub repository (https://github.com/wasiahmad/Awesome-LLM-Synthetic-Data) is a true gem. It offers a great selection of resources like methods surveys, relevant blog posts, and relevant papers to read when working on use cases (like math reasoning, code generation, vision, and language, etc.)