-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SemEval 2016/2017 Task 3, English Subtask A unannotated datasets and English Subtask B datasets #18
Comments
Nice! :) |
@menshikh-iv I pushed an updated semeval-2016_2017-task3-subtaskB-english.json.gz, which now contains the |
Can this dataset be used directly as follows below?
I am getting a strange output on checking the vocab
|
@AMR-KELEG The dataset is not a corpus. You will need to extract the text data you are interested in: import gensim
import gensim.downloader as api
questions = api.load('semeval-2016-2017-task3-subtaskA-unannotated')
corpus = [question["RelQuestion"]["RelQBody"] for question in questions] |
Introduction
I converted SemEval 2016 and 2017 question answering datasets into JSON for ease of use. The original datasets are in XML and scattered across several ZIP archives. The JSON files are going to be immediately used in the Gensim documentation for the Soft Cosine Measure (see the respective pull request).
Description
Datasets
Papers
Code
License
These are the licensing notices found in the individual ZIP files with the original XML datasets:
The text was updated successfully, but these errors were encountered: