Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SICK dataset #33

Open
aneesh-joshi opened this issue Aug 13, 2018 · 0 comments
Open

SICK dataset #33

aneesh-joshi opened this issue Aug 13, 2018 · 0 comments

Comments

@aneesh-joshi
Copy link

SICK(Sentences Involving Compositional Knowledge)

Dataset website: http://clic.cimec.unitn.it/composes/sick.html

Dataset download link: http://clic.cimec.unitn.it/composes/materials/SICK.zip

License:

distributed under a Creative Commons Attribution-NonCommercial-ShareAlike license.

Example:

pair_ID sentence_A sentence_B entailment_label relatedness_score entailment_AB entailment_BA sentence_A_original sentence_B_original sentence_A_dataset sentence_B_dataset SemEval_set
1 A group of kids is playing in a yard and an old man is standing in the background A group of boys in a yard is playing and a man is standing in the background NEUTRAL 4.5 A_neutral_B B_neutral_A A group of children playing in a yard, a man in the background. A group of children playing in a yard, a man in the background. FLICKR FLICKR TRAIN
2 A group of children is playing in the house and there is no man standing in the background A group of kids is playing in a yard and an old man is standing in the background NEUTRAL 3.2 A_contradicts_B B_neutral_A A group of children playing in a yard, a man in the background. A group of children playing in a yard, a man in the background. FLICKR FLICKR TRAIN

Description:
The dataset contains:
- two sentences, SentenceA and SentenceB
- thier relation to each other (entailment, contradiction, neutral)
- their relatedness score (sentA and sentB have a relatedness score of 3.4)

Dataset Reader:

import re
import os

class SickReader:
    """Reader object to provide training data from the SICK dataset
    More details can be found here : http://clic.cimec.unitn.it/composes/sick.html

    The dataset contains:
    - two sentences, SentenceA and SentenceB
    - thier relation to each other (entailment, contradiction, neutral)
    - their relatedness score (sentA and sentB have a relatedness score of 3.4)
    
    Example row in the dataset:
    pair_ID sentence_A  sentence_B  entailment_label    relatedness_score   entailment_AB   entailment_BA   sentence_A_original sentence_B_original sentence_A_dataset  sentence_B_dataset  SemEval_set
    1   A group of kids is playing in a yard and an old man is standing in the background   A group of boys in a yard is playing and a man is standing in the background    NEUTRAL 4.5 A_neutral_B B_neutral_A A group of children playing in a yard, a man in the background. A group of children playing in a yard, a man in the background. FLICKR  FLICKR  TRAIN

    You can get the entailment data with labels with `get_entailment_data`
    You can get the relatedness data with labels with `get_relatedness_data`


    Parameters
    ----------
    filepath : str
        path to folder with SICK.txt
    preprocess_fn : function
        function to preprocess sentences.
        If None, will use `self._preprocess_fn`

    """
    def __init__(self, filepath, preprocess_fn=None):
        if preprocess_fn != None:
            self.preprocess_fn = preprocess_fn
        else:
            self.preprocess_fn = self._preprocess_fn
        SENTA_INDEX, SENTB_INDEX, ENTAILMENT_INDEX, RELATEDNESS_INDEX = 1, 2, 3, 4
        SPLIT_INDEX = 11
        self.sentenceA, self.sentenceB, self.entailment_label, self.relatedness_score = {}, {}, {}, {}

        for s in [self.sentenceA, self.sentenceB, self.entailment_label, self.relatedness_score]:
            s['TRAIN'], s['TEST'], s['TRIAL'] = [], [], []


        self.entailment_label2index = {'CONTRADICTION': 0, 'ENTAILMENT': 1, 'NEUTRAL': 2}
        splits = []
        with open(os.path.join(filepath, 'SICK.txt'), 'r') as f:
            for i, line in enumerate(f):
                if i == 0:
                    continue  # to skip the header
                line = line[:-1]
                split_line = line.split('\t')
                self.sentenceA[split_line[SPLIT_INDEX]].append(self.preprocess_fn(split_line[SENTA_INDEX]))
                self.sentenceB[split_line[SPLIT_INDEX]].append(self.preprocess_fn(split_line[SENTB_INDEX]))
                self.entailment_label[split_line[SPLIT_INDEX]].append(self.entailment_label2index[split_line[ENTAILMENT_INDEX]])
                self.relatedness_score[split_line[SPLIT_INDEX]].append(float(split_line[RELATEDNESS_INDEX])) 

    def _preprocess_fn(self, sent):
        """Utility function to lower, strip and tokenize each sentence(on spaces)
        Replace this function if you want to handle preprocessing differently
        
        Parameters
        ----------
        sent : str
            The string sentence
        """
        return re.sub("[^a-zA-Z0-9]", " ", sent.strip().lower()).split()

    def get_entailment_data(self):
        """Returns data in the format: SentA, SentB, Entailment Label

        where each of them is a dict which contains the train, test and trial data in 
        'TRAIN', 'TEST' and 'TRIAL' keys respectively"""
        return self.sentenceA, self.sentenceB, self.entailment_label

    def get_relatedness_data(self):
        """Returns data in the format: SentA, SentB, RelatednessScore
        where each of them is a dict which contains the train, test and trial data in 
        'TRAIN', 'TEST' and 'TRIAL' keys respectively"""
        return self.sentenceA, self.sentenceB, self.relatedness_score

    def get_entailment_label_dict(self):
        """Returns the mapping from int to label(str)
        {'CONTRADICTION': 0, 'ENTAILMENT': 1, 'NEUTRAL': 2}"""
        return self.entailment_label2index
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant