SemEval 2016/2017 Task 3, English Subtask A unannotated datasets and English Subtask B datasets #18

Witiko · 2018-01-30T18:10:03Z

Introduction

I converted SemEval 2016 and 2017 question answering datasets into JSON for ease of use. The original datasets are in XML and scattered across several ZIP archives. The JSON files are going to be immediately used in the Gensim documentation for the Soft Cosine Measure (see the respective pull request).

Description

Community Question Answering (CQA) forums are gaining popularity online. They are seldom moderated, rather open, and thus they have few restrictions, if any, on who can post and who can answer a question. On the positive side, this means that one can freely ask any question and expect some good, honest answers. On the negative side, it takes effort to go through all possible answers and to make sense of them. For example, it is not unusual for a question to have hundreds of answers, which makes it very time consuming to the user to inspect and to winnow. The challenge we propose may help automate the process of finding good answers to new questions in a community-created discussion forum (e.g., by retrieving similar questions in the forum and identifying the posts in the answer threads of those questions that answer the question well).

We build on the success of the previous editions of our SemEval tasks on CQA, SemEval-2015 Task 3 and SemEval-2016 Task 3, and present an extended edition for SemEval- 2017, which incorporates several novel facets.

Datasets

semeval-2016_2017-task3-subtaskA-unannotated-english.json.gz (231M) – Example:

    [
        {

            "THREAD_SEQUENCE": "Q1", 
            "RelQuestion": {
                "RELQ_CATEGORY": "Politics", 
                "RELQ_DATE": "2009-12-14 11:58:33", 
                "RELQ_ID": "Q1", 
                "RELQ_USERID": "U2", 
                "RELQ_USERNAME": "anonymous", 
                "RelQBody": "The state of Internet in Thailand:IT Minsitry blocks CNN; Facebook; Yahoo; Flickr Thai Immigration website listed as dangerousFull story: http://www.thaivisa.com/forum/Thai-Govt-Blocks-Cnn-Yahoo-Financ-t321851.html", 
                "RelQSubject": "Thailand:IT Minsitry blocks CNN; Facebook;"
            }, 
            "RelComments": [
                {
                    "RELC_DATE": "2009-12-14 12:00:59", 
                    "RELC_ID": "Q1_C1", 
                    "RELC_USERID": "U210", 
                    "RELC_USERNAME": "DaRuDe", 
                    "RelCText": "have they blocked porn??? <img src=\"http://www.qatarliving.com/files/images/Da.gif\">"
                }, 
                {
                    "RELC_DATE": "2009-12-14 12:07:04", 
                    "RELC_ID": "Q1_C2", 
                    "RELC_USERID": "U2", 
                    "RELC_USERNAME": "anonymous", 
                    "RelCText": "like trying to contain a tsunami with a hand towel ************************************ I'm Jack's complete lack of surprise"
                }, 
                {
                    "RELC_DATE": "2009-12-14 12:09:23", 
                    "RELC_ID": "Q1_C3", 
                    "RELC_USERID": "U114", 
                    "RELC_USERNAME": "GodFather.", 
                    "RelCText": "oops double post.. ----------------- \"HE WHO DARES WINS\" Derek Edward Trotter"
                },

semeval-2016_2017-task3-subtaskB-english.json.gz (6.05M) – Example:

{
    "2016-dev": [
        {
            "ORGQ_ID": "Q268", 
            "OrgQBody": "Which is a good bank as per your experience in Doha", 
            "OrgQSubject": "Good Bank", 
            "Threads": [
                { 
                    "THREAD_SEQUENCE": "Q268_R4",
                    "RelQuestion": {
                        "RELQ_CATEGORY": "Advice and Help", 
                        "RELQ_DATE": "2013-05-02 19:43:00", 
                        "RELQ_ID": "Q268_R4", 
                        "RELQ_RANKING_ORDER": "4", 
                        "RELQ_RELEVANCE2ORGQ": "PerfectMatch", 
                        "RELQ_USERID": "U4882", 
                        "RELQ_USERNAME": "ankukuma", 
                        "RelQBody": "Hi Guys; I need to open a new bank accoount. Which is the best bank in Qatar ? I assume all of them will roughly be the same; but stll which has a slight edge (Money transfer; benifits etc) Thanks !!!", 
                        "RelQSubject": "Best Bank"
                    }, 
                    "RelComments": [
                        {
                            "RELC_DATE": "2013-05-03 07:23:20", 
                            "RELC_ID": "Q268_R4_C1", 
                            "RELC_RELEVANCE2ORGQ": "Good", 
                            "RELC_RELEVANCE2RELQ": "Good", 
                            "RELC_USERID": "U594", 
                            "RELC_USERNAME": "Dilgeer", 
                            "RelCText": "Commercial bank/IBQ"
                        }, 
                        {
                            "RELC_DATE": "2013-05-03 12:58:13", 
                            "RELC_ID": "Q268_R4_C2", 
                            "RELC_RELEVANCE2ORGQ": "Good", 
                            "RELC_RELEVANCE2RELQ": "Good", 
                            "RELC_USERID": "U979", 
                            "RELC_USERNAME": "Speedysid", 
                            "RelCText": "The best bank in Qatar for you would be the one that fits in your requirements.I suggest you visit the major banks here; and approach the Customer Relations person there to guide you with the facilities the bank offers. They include: -Current Accounts facilities -Savings Account facilities - Money Transfer (However; I highly recommend using the bank transfer only in emergency cases. There are money transfer agents which offer better exchange rates; and lower service fees) - Tie-ups with any bank in your home country to ease transfers"
                        },

Papers

Code

License

These are the licensing notices found in the individual ZIP files with the original XML datasets:

semeval2016-task3-cqa-ql-traindev-v3.2.zip

These datasets are free for general research use.

semeval2017_task3_test.zip

the scripts and all files released for the task are free for general research use

you should use the following citation in your publications whenever using these resources:

@InProceedings{SemEval-2017:task3,
   author    = {Nakov, Preslav and Hoogeveen, Doris and M\`{a}rquez, Llu\'{i}s and Moschitti, Alessandro and Mubarak, Hamdy and Baldwin, Timothy and Verspoor, Karin},
   title     = {{SemEval}-2017 Task 3: Community Question Answering},
   booktitle = {Proceedings of the 11th International Workshop on Semantic Evaluation},
   series    = {SemEval '17},
   month     = {August},
   year      = {2017},
   address   = {Vancouver, Canada},
   publisher = {Association for Computational Linguistics},
 }

The text was updated successfully, but these errors were encountered:

piskvorky · 2018-01-30T20:24:09Z

Nice! :)

Witiko · 2018-02-06T21:13:42Z

@menshikh-iv I pushed an updated semeval-2016_2017-task3-subtaskB-english.json.gz, which now contains the RELQ_RANKING_ORDER field as an integer rather than a string. It is a minor but convenient change.

AMR-KELEG · 2019-08-05T15:20:17Z

Can this dataset be used directly as follows below?

import gensim
import gensim.downloader as api

corpus = api.load('semeval-2016-2017-task3-subtaskA-unannotated')
word2vec = gensim.models.Word2Vec(corpus)

I am getting a strange output on checking the vocab word2vec.wv.vocab:

{'RelComments': <gensim.models.keyedvectors.Vocab at 0x7f1740e26a90>,
 'RelQuestion': <gensim.models.keyedvectors.Vocab at 0x7f16fad64cf8>,
 'THREAD_SEQUENCE': <gensim.models.keyedvectors.Vocab at 0x7f173ee5f128>}

Witiko · 2019-08-08T15:36:47Z

@AMR-KELEG The dataset is not a corpus. You will need to extract the text data you are interested in:

import gensim
import gensim.downloader as api

questions = api.load('semeval-2016-2017-task3-subtaskA-unannotated')
corpus = [question["RelQuestion"]["RelQBody"] for question in questions]

This was referenced Feb 2, 2018

Implement Soft Cosine Measure piskvorky/gensim#1827

Merged

Add semeval datasets. Fix #18 #19

Merged

menshikh-iv closed this as completed in a2cc165 Feb 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SemEval 2016/2017 Task 3, English Subtask A unannotated datasets and English Subtask B datasets #18

SemEval 2016/2017 Task 3, English Subtask A unannotated datasets and English Subtask B datasets #18

Witiko commented Jan 30, 2018

piskvorky commented Jan 30, 2018 •

edited

Loading

Witiko commented Feb 6, 2018

AMR-KELEG commented Aug 5, 2019

Witiko commented Aug 8, 2019

SemEval 2016/2017 Task 3, English Subtask A unannotated datasets and English Subtask B datasets #18

SemEval 2016/2017 Task 3, English Subtask A unannotated datasets and English Subtask B datasets #18

Comments

Witiko commented Jan 30, 2018

Introduction

Description

Datasets

Papers

Code

License

piskvorky commented Jan 30, 2018 • edited Loading

Witiko commented Feb 6, 2018

AMR-KELEG commented Aug 5, 2019

Witiko commented Aug 8, 2019

piskvorky commented Jan 30, 2018 •

edited

Loading