- lxml==4.3.2
- tqdm=4.56.0
- stanza==1.1.1
Go to the MPQA 2.0 website, agree to the license and download the corpus. Put the zipped archive in /mpqa. Finally, run the extraction script.
bash process_mpqa.sh
Go to the Darmstadt Service Review Corpus website, agree to the license and download the corpus. Put the zipped archive in /darmstadt_unis and finally, run the extraction script.
bash process_darmstadt.sh
This track assumes that we train and test on the same languages. For this we will use the following datasets:
- norec (Norwegian professional reviews in multiple domains)
- multibooked_ca (Catalan hotel reviews)
- multibooked_eu (Basque hotel reviews)
- opener_en (English hotel reviews)
- opener_es (Spanish hotel reviews)
- darmstadt_unis (English online university reviews)
- MPQA
This track will instead train only on a high-resource language (English) and test on several languages.
For training, you can use any of the other datasets, as well as any other resource that does not contain sentiment annotations in the target language.
Test:
- opener_es
- multibooked_ca
- multibooked_eu
That means that the cross-lingual models should be able to adapt quickly to new languages.
We provide the data in json lines format.
Each line is an annotated sentence, represented as a dictionary with the following keys and values:
-
'sent_id': unique sentence identifiers
-
'text': raw text
-
'opinions': list of all opinions (dictionaries) in the sentence
Additionally, each opinion in a sentence is a dictionary with the following keys and values:
-
'Source': a list of text and character offsets for the opinion holder
-
'Target': a list of text and character offsets for the opinion target
-
'Polar_expression': a list of text and character offsets for the opinion expression
-
'Polarity': sentiment label ('negative', 'positive', 'neutral')
-
'Intensity': sentiment intensity ('average', 'strong', 'weak')
{
"sent_id": "../opener/en/kaf/hotel/english00164_c6d60bf75b0de8d72b7e1c575e04e314-6",
"text": "Even though the price is decent for Paris , I would not recommend this hotel .",
"opinions": [
{
"Source": [["I"], ["44:45"]],
"Target": [["this hotel"], ["66:76"]],
"Polar_expression": [["would not recommend"], ["46:65"]],
"Polarity": "negative",
"Intensity": "average"
},
{
"Source": [[], []],
"Target": [["the price"], ["12:21"]],
"Polar_expression": [["decent"], ["25:31"]],
"Polarity": "positive",
"Intensity": "average"}
]
}
You can import the data by using the json library in python:
>>> import json
>>> with open("data/norec/train.json") as infile:
norec_train = json.load(infile)