Skip to content

Latest commit

 

History

History
13 lines (7 loc) · 1.25 KB

synthetic_data.md

File metadata and controls

13 lines (7 loc) · 1.25 KB

InstructLab by IBM: https://research.ibm.com/blog/LLM-generated-data

Cosmopedia and synthetic datasets https://huggingface.co/blog/cosmopedia

Hugging Face synthetic datasets https://huggingface.co/blog/davanstrien/self-instruct

Repository with synthetic datasets https://github.com/davanstrien/awesome-synthetic-datasets

Youtube video for StarCode and StarCode2 https://www.youtube.com/watch?v=IyI8pXbQzbw

GitHub repo for collection of resources about synthetic data https://github.com/wasiahmad/Awesome-LLM-Synthetic-Data

Synthetic data generation is not a new technique in the AI world. Early methods relied on statistical techniques like bootstrapping, smoothing, and imputation. With the advent of machine learning techniques, more sophisticated methods emerged in the 2010s: the Generative Adversarial Networks (GANs) and Variation Autoencoders. Now, with the advent of LLMs, synthetic data generation has further advanced with innovative and more effective techniques. This GitHub repository (https://github.com/wasiahmad/Awesome-LLM-Synthetic-Data) is a true gem. It offers a great selection of resources like methods surveys, relevant blog posts, and relevant papers to read when working on use cases (like math reasoning, code generation, vision, and language, etc.)