TLDR; this repository maintains a community effort to create a large collection of tasks and their natural language definitions/instructions. We're looking for more contributions to make this data bigger! 🙌 We invite submission of new tasks to this benchmark by way of GitHub pull request, through September 15, 2021. The contributors with meaningful contribution to our tasks will be included as co-authors on a paper that will announce the benchmark as well as analysis/results on it.
While the current dominant paradigm (supervised learning with task-specific labeled examples) has been successful in building task-specific models, such models can't generalize to unseen tasks; for example, a model that is supervised to solve questions cannot solve a classification task. We hypothesize that a model equipped with understanding and reasoning with natural language instructions should be able to generalize to any task that can be defined in terms of natural language.
In our earlier effort, we built a smaller data (61 tasks) and
observed that language models benefit from language instructions, i.e., their generalization to unseen tasks when they were provided with more instructions.
Also, generalization to unseen tasks improves as the model is trained on more tasks.
We believe that our earlier work is just scratching the surface and there is probably so much that be studied in this setup. We hope to put together a much larger dataset that cover a wider range of reasoning abilities. We believe that this expanded dataset will serve as a useful playground for the community to study and build the next generation of AI/NLP models.
Each consists of input/output. For example, think of the task of sentiment classification:
- Input:
I thought the Spiderman animation was good, but the movie disappointed me.
- Output:
Mixed
Here is another example from the same task:
- Input:
The pumpkin was one of the worst that I've had in my life.
- Output:
Negative
Additionally, each ask contains a task definition:
Given a tweet, classify it into one of 4 categories: Positive, Negative, Neutral, or Mixed.
Overall, each tasks follows this schema:
Or if you're comfortable with json files, here is how it would look like:
{
"Contributors": [""],
"Source": [""],
"Categories": [""],
"Definition": "",
"Positive Examples": [ { "input": "", "output": "", "explanation": ""} ],
"Negative Examples": [ { "input": "", "output": "", "explanation": ""} ],
"Instances": [ { "input": "", "output": [""]} ],
}
We would appreciate any external contributions! 🙏
- All submissions must be submitted via Github pull requests. These submissions will undergo a review before being merged.
- Each task must contain contain a
.json
file that contains the task content. You can look inside thetasks/
directory for several examples.- Make sure that your json is human readable (use proper indentation; e.g., in Python:
json.dumps(your_json_string, indent=4)
) - Make sure that you json file is not bigger than 50MB.
- Make sure your task has no more 6.5k instances (input/output pairs).
- Make sure to number your task json correctly (Look at the task number in the latest pull request, task number in your submission should be the next number). Make sure to include the source dataset name and the task category name while creating the json file name. You can use this format: taskabc_
- Make sure to create a pull request after creating all possible tasks from a dataset. You should have one pull request per dataset. Name your pull request as Task <start_task_number>-<end_task_number>: e.g. Task 101-107: SQuAD Dataset.
- If you're building your tasks based existing datasets and their crowdsourcing templates, see these guidelines.
- Make sure that your json is human readable (use proper indentation; e.g., in Python:
- Add your task to our list of tasks.
- To make sure that your addition is formatted correctly, run the tests:
> python src/test_all.py
If you have any questions or suggestions, please use the issues feature.
Yes! We welcome submission from any of the languages.
Yes! just make sure that the quality of instructions is good enough for a human to understand the task just based on instructions. You can take a different route than the guidelines.
Anything north of 100 is a safe number. The more, the merrier! Also, you should not have more than 6500 instances. Make sure to shuffle instances before selecting 6500 of those. In case of classifcation tasks, make sure that the instances and positive examples are not skewed towards a class.
If you have contributed at least 25 tasks to the repository or if you're among top 20 contributors, we view that as meaningful contribution. This also involves some lightweight responsibilities such as reviewing pull requests.
Make sure that your email is set in your git environment and it is also mentioned in your github profile. See this and this.