git clone https://github.com/voidful/awesome-chatgpt-dataset.git cd awesome-chatgpt-dataset/mixed/dataset
pick whatever dataset you want to use, then merge and upload:
python preprocess.py your_dataset_name_to_HuggingFaceHub
Dataset Name | Size | Languages | Source | License |
---|---|---|---|---|
TheoremQA | 1K | English | We annotated 800 QA pairs covering 350+ theorems spanning across Math, EE&CS, Physics and Finance. | mit |
lima | 1K | English | LIMA: Less Is More for Alignment | CC BY-NC-SA |
im-feeling-curious | 3K | English | This public dataset is an extract from Google's "i'm feeling curious" feature. To learn more about this feature, search for "i'm feeling curious" on Google. | - |
Puffin | 3K | English | Puffin dataset. Exactly 3,000 examples with each response created using GPT-4. | apache-2.0 |
cc_sbu_align | 4K | English | MiniGPT-4 datadset | BSD 3-Clause License |
qa_feedback | 4K | English | we re-construct the ASQA data and collect human feedback for it. We name the resulting dataset as qa-feedback. | - |
SLF5K | 5K | English | The Summarization with Language Feedback (SLF5K) dataset is an English-language dataset containing 5K unique samples that can be used for the task of abstraction summarization. | apache-2.0 |
blended_skill_talk | 7K | English | A dataset of 7k conversations explicitly designed to exhibit multiple conversation modes: displaying personality, having empathy, and demonstrating knowledge. | - |
GSM-IC | 8K | English | Grade-School Math with Irrelevant Context (GSM-IC) | - |
ChatAlpaca | 10K | English | The data currently contain a total of 10,000 conversations with 95,558 utterances. | Apache-2.0 license |
PKU-SafeRLHF-10K | 10K | English | PKU-SafeRLHF-10K, which is the first dataset of its kind and contains 10k instances with safety preferences. | - |
Dolly | 15K | English | databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. | CC 3.0 |
WebGPT | 20K | English | This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. | - |
Code Alpaca | 20K | English | Code generation task involving 20,022 samples | - |
openapi-function-invocations-25k | 25K | English | The construction of this dataset involved a systematic procedure combining manual extraction and AI-assisted synthesis. | mit |
LongForm | 28K | English | The LongForm dataset is created by leveraging English corpus examples with augmented instructions. | The LongForm project is subject to a MIT License with custom limitations for restrictions imposed by OpenAI (for the instruction generation part), as well as the license of language models (OPT, LLaMA, and T5). |
chatbot_arena_conversations | 33K | English | This dataset contains 33K cleaned conversations with pairwise human preferences. It is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023. | |
HC3 | 37K | English, Chinese | 37,175 instructions generated by ChatGPT and human | - |
Anthropic_HH_Golden | 45K | English | This repository contains a new preference dataset extending the harmless dataset of Anthropic's Helpful and Harmless (HH) datasets. The origin positive response in HH is generated by a supervised fined-tuned model of Anthropic, where harmful and unhelpful responses are freqently encountered. In this dataset, the positive responses are replaced by re-rewritten responses generated by GPT4. | |
Mol-Instructions | 48K | English | An open, large-scale biomolecular instruction dataset for large language models. | CC BY 4.0 |
RefGPT | 50K | English,chinese | we introduce a cost-effective method called RefGPT, which generates a vast amount of high-quality multi-turn Q&A content. | - |
arxiv-math-instruct-50k | 50K | English | Dataset consists of question-answer pairs derived from ArXiv abstracts from math categories | - |
arxiv-math-instruct-50k | 51K | English | The "ArtifactAI/arxiv-math-instruct-50k" dataset consists of question-answer pairs derived from ArXiv abstracts from the math categories,Questions are generated using the t5-base model, while the answers are generated using the GPT-3.5-turbo model. | |
Traditional Chinese Alpaca Dataset | 52K | Traditional Chinese | Translated from Alpaca Data by ChatGPT API | Apache-2.0 license |
Cabrita Dataset | 52K | Portuguese | Translated from Alpaca Data | |
Japanese Alpaca Dataset | 52K | Japanese | Translated from Alpaca Data by ChatGPT API | CC By NC 4.0; OpenAI terms of use |
Alpaca Dataset | 52K | English | 175 seed instructions by OpenAI API | CC By NC 4.0; OpenAI terms of use |
Alpaca Data Cleaned | 52K | English | Revised version of Alpaca Dataset | - |
Alpaca GPT-4 Data | 52K | English | Generated by GPT-4 using Alpaca prompts | - |
Alpaca GPT-4 Data (Chinese) | 52K | Chinese | Generated by GPT-4 using Chinese prompts translated from Alpaca by ChatGPT | - |
Dynosaur | 66K | English | Dynosaur, a dynamic growth paradigm for instruction-tuning data curation. | Apache-2.0 license |
Finance | 69K | English | 68,912 financial related instructions | - |
evol | 70K | English | This is the training data of WizardLM. | - |
Vicuna Dataset | 75K | English | ~100k ShareGPT conversations | - |
InstructionTranslation | 80K | Multi-lingual | Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G). | MIT |
Self-Instruct | 82K | English | We release a dataset that contains 52k instructions, paired with 82K instance inputs and outputs. | - |
OASST1 | 89K | Multi-lingual | a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. | apache-2.0 |
HH-RLHF | 91K | English | The data are described in the paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. | MIT |
Guanaco Dataset | 98K | English, Simplified Chinese, Traditional Chinese HK & TW, Japanese | 175 tasks from the Alpaca model | GPLv3 |
InstructionWild | 104K | English, Chinese | 429 seed instructions and follow Alpaca to generate 52K | Research only; OpenAI terms of use |
Camel Dataset | 107K | Multi-lingual | Role-playing between AIs (Open AI API) | - |
Tapir-Cleaned | 117K | English | This is a revised version of the DAISLab dataset of IFTTT rules, which has been thoroughly cleaned, scored, and adjusted for the purpose of instruction-tuning. | CC BY-NC 4.0 |
WizardLM_evol_instruct_V2_196k | 143K | English | This datasets contains 143K mixture evolved data of Alpaca and ShareGPT. | - |
LLaVA Visual Instruct | 150K | English | LLaVA Visual Instruct 150K is a set of GPT-generated multimodal instruction-following data. It is constructed for visual instruction tuning and for building large multimodal towards GPT-4 vision/language capability. | cc-by-nc-4.0 |
Prosocial Dialog | 166K | English | 165,681 instructions produced by GPT-3 rewrites questions and human feedback | - |
COIG | 191K | Chinese | Chinese Open Instruction Generalist (COIG) project to maintain a harmless, helpful, and diverse set of Chinese instruction corpora. | apache-2.0 |
orca-chat | 198K | English | This is a cleaned, pruned, and clustered version of orca to form a conversation-style dataset. The the process involves removing samples with very high similarity and also grouping instructions to form conversation. | |
Unnatural Instructions | 241K | English | a large dataset of cre- ative and diverse instructions, collected with virtually no human labor. | MIT |
SHP | 358K | English | SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. | Reddit non-exclusive, non-transferable, non-sublicensable, and revocable license |
dromedary | 361K | English | Dromedary-Verbose-Clone is a synthetic dataset of 360k instructions and demonstrations. | cc-by-nc-4.0 |
ultrachat | 404K | English | To ensure generation quality, two separate ChatGPT Turbo APIs are adopted in generation, where one plays the role of the user to generate queries and the other generates the response. | cc-by-nc-4.0 |
ign_clean_instruct_dataset_500k | 509K | English | This dataset contains ~508k prompt-instruction pairs with high quality responses. It was synthetically created from a subset of Ultrachat prompts. It does not contain any alignment focused responses or NSFW content. | apache-2.0 |
ELI5 | 559K | English | The ELI5 dataset is an English-language dataset of questions and answers gathered from three subreddits where users ask factual questions requiring paragraph-length or longer answers. | - |
GPT4All Dataset | 806K | Multi-lingual | Subset of LAION OIG, StackOverflow Question, BigSciense/p3 dataset. Answered by OpenAI API. | - |
Instruct | 889K | English | 888,969 English instructions, augmentation using AllenAI NLP tools | MIT |
MOSS | 1M | Chinese | Generated by gpt-3.5-turbo | Apache-2.0, AGPL-3.0 licenses |
LaMini-Instruction | 3M | English | a total of 2.58M pairs of instructions and responses using gpt-3.5-turbo based on several existing resources of prompts | cc-by-nc-4.0 |
OpenOrca | 3M | English | The OpenOrca dataset is a collection of augmented FLAN Collection data. Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions. | |
Natural Instructions | 5M | Multi-lingual | 5,040,134 instructions collected from diverse NLP tasks | - |
BELLE | 10M | Chinese | The 10M Chinese dataset is composed of subsets spanning multiple (instruction) types and multiple fields. | Research only; OpenAI terms of use |
Firefly | 16M | Chinese | 1,649,398 Chinese instructions in 23 NLP tasks | - |
OIG-43M Dataset | 43M | Multi-lingual | Together, LAION, and Ontocord.ai. | - |
xP3 | 79M | Multi-lingual | 78,883,588 instructions collected by prompts & datasets across 46 languages & 16 NLP tasks | - |
CodeParrot | - | python | The database was queried for all Python files with less than 1MB in size resulting in a 180GB dataset with over 20M files. | - |
Alpaca-CoT Dataset | - | Multi-lingual | Instruction Data Collection | ODC-By |
stack-exchange-paired | - | English | This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. | cc-by-sa-4.0 |
LangChainDatasets | - | English | This is a community-drive dataset repository for datasets that can be used to evaluate LangChain chains and agents. | - |
ParlAI | - | English | 100+ popular datasets available all in one place, dialogue models, from open-domain chitchat, to task-oriented dialogue, to visual question answering. | - |
GPTeacher | - | English | A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer | - |
silk-road/Wizard-LM-Chinese-instruct-evol | - | chinese | Wizard-LM-Chinese | - |
MultiWOZ | - | English | Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. | apache-2.0 |