ERRANT v2.0.0

Giovanni-Alzetta · Dec 10, 2019 · 9901a97 · 9901a97
1 parent 0671210
commit 9901a97
Show file tree

Hide file tree

Showing 32 changed files with 1,905 additions and 1,983 deletions.
diff --git a/changelog.md → CHANGELOG.md b/changelog.md → CHANGELOG.md
@@ -1,8 +1,20 @@
 # Changelog
 
-This document contains descriptions of all the significant changes made to ERRANT since its release.
+This log describes all the significant changes made to ERRANT since its release.
 
-## 16-11-18
+## v2.0.0 (10-12-19)
+
+1. ERRANT has been significantly refactored to accommodate a new API (see README). It should now also be much easier to extend to other languages.
+
+2. Added a `setup.py` script to make ERRANT `pip` installable.
+
+3. The Damerau-Levenshtein alignment code has been rewritten in a much cleaner Python implementation. This also makes ERRANT ~20% faster. 
+
+Note: All these changes do **not** affect system output compared with the previous version. For the first `pip` release, we wanted to make sure v2.0.0 was fully compatible with the [BEA-2019 shared task](https://www.cl.cam.ac.uk/research/nl/bea2019st/) on Grammatical Error Correction.
+
+Thanks to [@sai-prasanna](https://github.com/sai-prasanna) for inspiring some of these changes!
+
+## v1.4 (16-11-18)
 
 1. The `compare_m2.py` evaluation script was refactored to make it easier to use.
 
@@ -24,7 +36,7 @@ The differences between the old and new version are summarised in the following
 | CoNLL-2014.1 |  1312 | Old<br>New | 82.50<br>84.04 | 82.73<br>82.85 | 82.61<br>**83.44** |   385<br>**50** |
 | NUCLE        | 57151 | Old<br>New | 70.14<br>73.20 | 80.27<br>81.16 | 71.95<br>**76.97** | 7565<br>**725** |
 
-## 23-08-18
+## v1.3 (23-08-18)
 
 Fix arbitrary reordering of edits with the same start and end span; e.g.  
 S I am happy .  
@@ -37,21 +49,21 @@ S I am happy .
 A 2 2|||M:ADV|||very|||REQUIRED|||-NONE-|||0  
 A 2 2|||M:ADV|||really|||REQUIRED|||-NONE-|||0  
 
-## 10-08-18
+## v1.2 (10-08-18)
 
 Added support for multiple annotators in `parallel_to_m2.py`.  
 Before: `python3 parallel_to_m2.py -orig <orig_file> -cor <cor_file> -out <out_file>`  
 After: `python3 parallel_to_m2.py -orig <orig_file> -cor <cor_file1> [<cor_file2> ...] -out <out_file>`  
 This is helpful if you have multiple annotations for the same orig file.  
 
-## 17-12-17
+## News (17-12-17)
 
 In November, spaCy changed significantly when it became version 2.0.0. Although we have not tested ERRANT with this new version, the main change seemed to be a slight increase in performance (pos tagging and parsing etc.) at a significant cost to speed. Consequently, we still recommend spaCy 1.9.0 for use with ERRANT.
 
-## 22-11-17
+## v1.1 (22-11-17)
 
 ERRANT would sometimes run into memory problems if sentences were long and very different. We hence changed the default alignment from breadth-first to depth-first. This bypassed the memory problems, made ERRANT faster and barely affected results.
 
-## 10-05-17 
+## v1.0 (10-05-17)
 
 ERRANT v1.0 released.
diff --git a/LICENSE.md b/LICENSE.md
@@ -0,0 +1,21 @@
+# MIT License
+
+Copyright (c) 2017 Christopher Bryant, Mariano Felice
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1 @@
+include errant/en/resources/*
diff --git a/README.md b/README.md
@@ -0,0 +1,233 @@
+# ERRANT v2.0.0
+
+This repository contains the grammatical ERRor ANnotation Toolkit (ERRANT) described in:
+
+> Christopher Bryant, Mariano Felice, and Ted Briscoe. 2017. [**Automatic annotation and evaluation of error types for grammatical error correction**](https://www.aclweb.org/anthology/P17-1074/). In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada.
+
+> Mariano Felice, Christopher Bryant, and Ted Briscoe. 2016. [**Automatic extraction of learner errors in ESL sentences using linguistically enhanced alignments**](https://www.aclweb.org/anthology/C16-1079/). In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka, Japan.
+
+If you make use of this code, please cite the above papers. More information about ERRANT can be found [here](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-938.html).
+
+# Overview
+
+The main aim of ERRANT is to automatically annotate parallel English sentences with error type information. Specifically, given an original and corrected sentence pair, ERRANT will extract the edits that transform the former to the latter and classify them according to a rule-based error type framework. This can be used to standardise parallel datasets or facilitate detailed error type evaluation. Annotated output files are in M2 format and an evaluation script is provided.
+
+### Example:  
+**Original**: This are gramamtical sentence .  
+**Corrected**: This is a grammatical sentence .  
+**Output M2**:  
+S This are gramamtical sentence .  
+A 1 2|||R:VERB:SVA|||is|||REQUIRED|||-NONE-|||0  
+A 2 2|||M:DET|||a|||REQUIRED|||-NONE-|||0  
+A 2 3|||R:SPELL|||grammatical|||REQUIRED|||-NONE-|||0  
+A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||1
+
+In M2 format, a line preceded by S denotes an original sentence while a line preceded by A indicates an edit annotation. Each edit line consists of the start and end token offset of the edit, the error type, and the tokenized correction string. The next two fields are included for historical reasons (see the CoNLL-2014 shared task) while the last field is the annotator id.
+
+A "noop" edit is a special kind of edit that explicitly indicates an annotator/system made no changes to the original sentence. If there is only one annotator, noop edits are optional, otherwise a noop edit should be included whenever at least 1 out of n annotators considered the original sentence to be correct. This is something to be aware of when combining individual M2 files, as missing noops can affect evaluation. 
+
+# Installation
+
+## Pip Install
+
+The easiest way to install ERRANT and its dependencies is using `pip`. We also recommend installing it in a clean virtual environment (e.g. with `venv`). ERRANT only supports Python >= 3.3.
+```
+python3 -m venv errant_env
+source errant_env/bin/activate
+pip3 install errant
+python3 -m spacy download en
+```
+This will create and activate a new python3 environment called `errant_env` in the current directory. `pip` will then install ERRANT, [spaCy v1.9.0](https://spacy.io/), [NLTK](http://www.nltk.org/) and spaCy's default English model in this environment. You can deactivate the environment at any time by running `deactivate`, but must remember to activate it again whenever you want to use ERRANT.  
+
+**Note: ERRANT does not support spaCy 2 at this time**. spaCy 2 POS tags are slightly different from spaCy 1 POS tags and so ERRANT rules, which were designed for spaCy 1, may not always work with spaCy 2.  
+
+### BEA-2019 Shared Task
+
+ERRANT v2.0.0 was designed to be fully compatible with the [BEA-2019 Shared Task](https://www.cl.cam.ac.uk/research/nl/bea2019st/). If you want to directly compare against the results in the shared task, you should make sure to install ERRANT v2.0.0 as newer versions may produce slightly different scores.  
+```
+pip3 install errant==2.0.0
+```
+
+## Source Install
+
+If you prefer to install ERRANT from source, you can instead run the following commands:
+```
+git clone https://github.com/chrisjbryant/errant.git
+cd errant
+python3 -m venv errant_env
+source errant_env/bin/activate
+pip3 install -e .
+python3 -m spacy download en
+```
+This will clone the github ERRANT source into the current directory, build and activate a python environment inside it, and then install ERRANT and all its dependencies. If you wish to modify ERRANT code, this is the recommended way to install it.
+
+# Usage
+
+## CLI
+
+Three main commands are provided with ERRANT: `errant_parallel`, `errant_m2` and `errant_compare`. You can run them from anywhere on the command line without having to invoke a specific python script.  
+
+1. `errant_parallel`  
+
+     This is the main annotation command that takes an original text file and at least one parallel corrected text file as input, and outputs an annotated M2 file. By default, it is assumed that the original and corrected text files are word tokenised with one sentence per line.  
+	 Example:
+	 ```
+	 errant_parallel -orig <orig_file> -cor <cor_file1> [<cor_file2> ...] -out <out_m2>
+	 ```
+
+2. `errant_m2`  
+
+     This is a variant of `errant_parallel` that operates on an M2 file instead of parallel text files. This makes it easier to reprocess existing M2 files. You must also specify whether you want to use gold or auto edits; i.e. `-gold` will only classify the existing edits, while `-auto` will extract and classify automatic edits. In both settings, uncorrected edits and noops are preserved.  
+     Example:
+	 ```
+	 errant_m2 {-auto|-gold} m2_file -out <out_m2>
+	 ```
+
+3. `errant_compare`  
+
+     This is the evaluation command that compares a hypothesis M2 file against a reference M2 file. The default behaviour evaluates the hypothesis overall in terms of span-based correction. The `-cat {1,2,3}` flag can be used to evaluate error types at increasing levels of granularity, while the `-ds` or `-dt` flag can be used to evaluate in terms of span-based or token-based detection (i.e. ignoring the correction). All scores are presented in terms of Precision, Recall and F-score (default: F0.5), and counts for True Positives (TP), False Positives (FP) and False Negatives (FN) are also shown.  
+	 Examples:
+	 ```
+     errant_compare -hyp <hyp_m2> -ref <ref_m2> 
+     errant_compare -hyp <hyp_m2> -ref <ref_m2> -cat {1,2,3}
+     errant_compare -hyp <hyp_m2> -ref <ref_m2> -ds
+     errant_compare -hyp <hyp_m2> -ref <ref_m2> -ds -cat {1,2,3}
+	 ```	
+
+All these scripts also have additional advanced command line options which can be displayed using the `-h` flag. 
+
+#### Runtime
+
+In terms of speed, ERRANT processes ~155 sents/sec in the fully automatic edit extraction and classification setting, but ~1000 sents/sec in the classification setting alone. These figures were calculated on an Intel Core i5-6600 @ 3.30GHz machine, but results will vary depending on how different/long the original and corrected sentences are.  
+
+## API
+
+As of v2.0.0, ERRANT now also comes with an API.
+
+### Quick Start
+
+```
+import errant
+
+annotator = errant.load('en')
+orig = annotator.parse('This are gramamtical sentence .')
+cor = annotator.parse('This is a grammatical sentence .')
+edits = annotator.annotate(orig, cor)
+for e in edits:
+    print(e.o_start, e.o_end, e.o_str, e.c_start, e.c_end, e.c_str, e.type)
+```
+
+### Loading
+
+`errant`.**load**(lang, nlp=None)  
+Create an ERRANT Annotator object. The `lang` parameter currently only accepts `'en'` for English, but we hope to extend it for other languages in the future. The optional `nlp` parameter can be used if you have already preloaded spacy and do not want ERRANT to load it again.
+
+```
+import errant
+import spacy
+
+nlp = spacy.load('en')
+annotator = errant.load('en', nlp)
+```
+
+### Annotator Objects
+
+An Annotator object is the main interface for ERRANT.
+
+#### Methods
+
+`annotator`.**parse**(string, tokenise=False)  
+Lemmatise, POS tag, and parse a text string with spacy. Set `tokenise` to True to also word tokenise with spacy. Returns a spacy Doc object.
+
+`annotator`.**align**(orig, cor, lev=False)  
+Align spacy-parsed original and corrected text. The default uses a linguistically-enhanced Damerau-Levenshtein alignment, but the `lev` flag can be used for a standard Levenshtein alignment. Returns an Alignment object.
+
+`annotator`.**merge**(alignment, merging='rules')  
+Extract edits from the optimum alignment in an Alignment object. Four different merging strategies are available:
+1. rules: Use a rule-based merging strategy (default)
+2. all-split: Merge nothing: MSSDI -> M, S, S, D, I
+3. all-merge: Merge adjacent non-matches: MSSDI -> M, SSDI
+4. all-equal: Merge adjacent same-type non-matches: MSSDI -> M, SS, D, I
+
+Returns a list of Edit objects.
+
+`annotator`.**classify**(edit)  
+Classify an edit. Sets the `edit.type` attribute in an Edit object and returns the same Edit object. 
+
+`annotator`.**annotate**(orig, cor, lev=False, merging='rules')  
+Run the full annotation pipeline to align two sequences and extract and classify the edits. Equivalent to running `annotator.align`, `annotator.merge` and `annotator.classify` in sequence. Returns a list of Edit objects.
+
+```
+import errant
+
+annotator = errant.load('en')
+orig = annotator.parse('This are gramamtical sentence .')
+cor = annotator.parse('This is a grammatical sentence .')
+alignment = annotator.align(orig, cor)
+edits = annotator.merge(alignment)
+for e in edits:
+    e = annotator.classify(e)
+```
+
+`annotator`.**import_edit**(orig, cor, edit, min=True, old_cat=False)  
+Load an Edit object from a list. `orig` and `cor` must be spacy-parsed Doc objects and the edit must be of the form: `[o_start, o_end, c_start, c_end(, type)]`. The values must be integers that correspond to the token start and end offsets in the original and corrected Doc objects. The `type` value is an optional string that denotes the error type of the edit (if known). Set `min` to True to minimise the edit (e.g. [a b -> a c] = [b -> c]) and `old_cat` to True to preserve the old error type category (i.e. turn off the classifier).
+
+```
+import errant
+
+annotator = errant.load('en')
+orig = annotator.parse('This are gramamtical sentence .')
+cor = annotator.parse('This is a grammatical sentence .')
+edit = [1, 2, 1, 2, 'SVA'] # are -> is
+edit = annotator.import_edit(orig, cor, edit)
+print(edit.to_m2())
+```
+
+### Alignment Objects
+
+An Alignment object is created from two spacy-parsed text sequences.
+
+#### Attributes
+
+`alignment`.**orig**  
+`alignment`.**cor**  
+The spacy-parsed original and corrected text sequences.
+
+`alignment`.**cost_matrix**   
+`alignment`.**op_matrix**  
+The cost matrix and operation matrix produced by the alignment.
+
+`alignment`.**align_seq**  
+The first cheapest alignment between the two sequences.
+
+### Edit Objects
+
+An Edit object represents a transformation between two text sequences.
+
+#### Attributes
+
+`edit`.**o_start**  
+`edit`.**o_end**  
+`edit`.**o_toks**  
+`edit`.**o_str**  
+The start and end offsets, the spacy tokens, and the string for the edit in the *original* text.
+
+`edit`.**c_start**  
+`edit`.**c_end**  
+`edit`.**c_toks**  
+`edit`.**c_str**  
+The start and end offsets, the spacy tokens, and the string for the edit in the *corrected* text.
+
+`edit`.**type**  
+The error type string.
+
+#### Methods
+
+`edit`.**to_m2**(id=0)  
+Format the edit for an output M2 file. `id` is the annotator id.	
+
+# Contact
+
+If you have any questions, suggestions or bug reports, you can contact the authors at:  
+christopher d0t bryant at cl.cam.ac.uk  
+mariano d0t felice at cl.cam.ac.uk