NoReC_neg

This dataset annotates negation cues and their related scopes for Norwegian. The data comprises more than 11,000 sentences across more than 400 reviews and 9 different thematic categories (literature, products, restaurants, etc.), taken from a subset of the Norwegian Review Corpus (NoReC). The annotation effort is described in the forthcoming NoDaLiDa 2021 paper Negation in Norwegian: an annotated dataset, by P. Mæhlum, J. Barnes, R. Kurtz, L. Øvrelid, and E. Velldal.

Overview

This dataset is based on the same documents as those annotated for fine-grained sentiment in the extended version of NoReC_fine, a subset of NoReC. (Four documents were removed as they were found to contain formatting errors that interferred with the annotations.) All sentences have been annotated, regardless of whether they were sentiment-bearing or not. The data comes with a predefined train/dev/test split (inherited from NoReC), and some key statistics are summarized in the table below, including frequency counts and average token lengths.

Type	Train	Dev	Test	Total
Sentences	8,543	1531	1272	11346
– with negation	1768	301	263	2332
Cues	2025	342	305	2672
– min length	1	1	1	1
– max length	3	2	2	3
– avg length	1.0	1.0	1.0	1.0
– discontinuous	19	0	2	21
– multiple per sent	228	39	37	304
– affixal	505	88	69	662
Scopes	1995	339	301	2635
– min length	1	1	1	1
– max length	44	53	27	53
– avg length	6.9	7.1	6.5	6.9
– null	30	3	4	37
– discontinuous	1403	236	203	1842
– true discontinuous	423	85	58	566

Null scopes are usually sentences such as "Nei." ("No"), which contain only a negation cue. Many of the negation annotations contain discontinuous scopes because of a negation cue breaking up the scope. True discontinuous scopes, on the other hand, are less common, and are often caused by an adverbial which is out of scope but embedded within a negation scope.

Annotation guidelines

The full annotation guidelines are distributed with this repo and can be found here. A summary can also be found in the accompanying paper (NoDaLiDa 2021; forthcoming).

Terms of use

NoReC_neg inherits the license of the underlying NoReC corpus, copied here for convenience:

The data is distributed under a Creative Commons Attribution-NonCommercial licence (CC BY-NC 4.0), access the full license text here: https://creativecommons.org/licenses/by-nc/4.0/

The licence is motivated by the need to block the possibility of third parties redistributing the orignal reviews for commercial purposes. Note that machine learned models, extracted lexicons, embeddings, and similar resources that are created on the basis of NoReC are not considered to contain the original data and so can be freely used also for commercial purposes despite the non-commercial condition.

JSON format

Each sentence has a dictionary with the following keys and values:

'sent_id': unique NoReC identifier for document + paragraph + sentence which lines up with the identifiers from the document and sentence-level NoReC data
'text': raw text
'negations': list of all negations (dictionaries) in the sentence

Additionally, each negation in a sentence is a dictionary with the following keys and values:

'Cue': a list of text and character offsets for the negation cues
'Scope': a list of text and character offsets for the negation scopes
'Affixal': (True, False). Indicating whether the cue is affixal (prefix,suffix) or not.

{"sent_id": "400620-03-06",
 "text": "Den så jeg ikke komme .",
 "negations": [{"Cue": [["ikke"],
	  	        ["11:15"]],
		"Scope": [["Den så jeg",
			    "komme"],
			   ["0:10",
			    "16:21"]],
			  "Affixal": false}]
},

Note that a single sentence may contain several annotated negations. All negations must contain at least one cue, but it is possible for a negation to be without scope. These cases are mostly short comments with references to earlier sentences. Both cues and scopes can be multiword and discontinuous.

Importing the data

>>> import json
>>> data = {}
>>> for name in ["train", "dev", "test"]:
        with open("data/{0}.json".format(name)) as infile:
            data[name] = json.load(infile)

Requirements for negation graph model

Python >= 3.6.1
Requires PyTorch version 1.2.0. This needs to be installed first and also depends on whether you would like to install the GPU or CPU version, see the following to install Pytorch 1.2.0 and it's variants of GPU or CPU version.
sklearn
tabulate
h5py
tqdm
nltk
stanza
matplotlib

Training model

The scripts required to train the dependency graph model for predicting negation is found in the 'modeling' directory.

The first thing to do is to download the Norwegian word embeddings from NLPL

wget http://vectors.nlpl.eu/repository/20/58.zip

You will then need to set the 'EXTERNAL' variable in 'src/run_neggraph.sh' to point to the downloaded vectors. There is no need to unzip the file.

EXTERNAL=../58.zip

Next, you will need to convert the json data to CONLLU negation graph format. You can use the script in the 'data' directory, which takes a single argument '--setup' (point_to_root, head_final, head_first, head_final-inside_label, head_first-inside_label). We set point_to_root as the default, as this gives the best results:

cd data
python3 convert_to_graph.py --setup point_to_root
cd ..

Now you can train a model from the 'modeling' directory using the 'run_neggraph.sh' script. It takes three arguments: the graph setup (point_to_root), the run number (0-5), and the random seed number.

cd modeling
./scripts/run_neggraph.sh point_to_root 0 1234

Alternatively, you can run the script 'run_experiments.sh' to repeat the experiments from the paper.

cd modeling
./scripts/run_experiments.sh

Cite

@inproceedings{MaeBarKur21,
    title = "Negation in {N}orwegian: an annotated dataset",
    author = "M{\ae}hlum, Petter and Barnes, Jeremy  and
      Kurtz, Robin and {\O}vrelid, Lilja  and Velldal, Erik",
    booktitle = "Proceedings of the 23rd Nordic Conference on Computational Linguistics",
    year = "2021"
}

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
annotation_guidelines		annotation_guidelines
data		data
modeling		modeling
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NoReC_neg

Overview

Annotation guidelines

Terms of use

JSON format

Importing the data

Requirements for negation graph model

Training model

Cite

About

Releases

Packages

Contributors 3

Languages

ltgoslo/norec_neg

Folders and files

Latest commit

History

Repository files navigation

NoReC_neg

Overview

Annotation guidelines

Terms of use

JSON format

Importing the data

Requirements for negation graph model

Training model

Cite

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages