Skip to content

Commit

Permalink
ERRANT v2.1.0
Browse files Browse the repository at this point in the history
  • Loading branch information
Christopher Bryant committed Jan 9, 2020
1 parent 9901a97 commit e1e6066
Show file tree
Hide file tree
Showing 11 changed files with 452 additions and 114 deletions.
13 changes: 13 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,19 @@

This log describes all the significant changes made to ERRANT since its release.

## v2.1.0 (09-01-20)

1. The character level cost in the sentence alignment function is now computed by the much faster [python-Levenshtein](https://pypi.org/project/python-Levenshtein/) library instead of python's native `difflib.SequenceMatcher`. This makes ERRANT 3x faster!

2. Various minor updates:
* Updated the English wordlist.
* Fixed a broken rule for classifying contraction errors.
* Changed a condition in the calculation of transposition errors to be more intuitive.
* Partially updated the ERRANT POS tag map to match the updated [Universal POS tag map](https://universaldependencies.org/tagset-conversion/en-penn-uposf.html). Specifically, EX now maps to PRON rather than ADV, LS maps to X rather than PUNCT, and CONJ has been renamed CCONJ. I did not change the mapping of RP from PART to ADP yet because this breaks several rules involving phrasal verbs.
* Added an `errant.__version__` attribute.
* Added a warning about using ERRANT with spaCy 2.
* Tidied some code in the classifier.

## v2.0.0 (10-12-19)

1. ERRANT has been significantly refactored to accommodate a new API (see README). It should now also be much easier to extend to other languages.
Expand Down
20 changes: 14 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# ERRANT v2.0.0
# ERRANT v2.1.0

This repository contains the grammatical ERRor ANnotation Toolkit (ERRANT) described in:

Expand Down Expand Up @@ -37,17 +37,21 @@ source errant_env/bin/activate
pip3 install errant
python3 -m spacy download en
```
This will create and activate a new python3 environment called `errant_env` in the current directory. `pip` will then install ERRANT, [spaCy v1.9.0](https://spacy.io/), [NLTK](http://www.nltk.org/) and spaCy's default English model in this environment. You can deactivate the environment at any time by running `deactivate`, but must remember to activate it again whenever you want to use ERRANT.
This will create and activate a new python3 environment called `errant_env` in the current directory. `pip` will then install ERRANT, [spaCy v1.9.0](https://spacy.io/), [NLTK](http://www.nltk.org/), [python-Levenshtein](https://pypi.org/project/python-Levenshtein/) and spaCy's default English model in this environment. You can deactivate the environment at any time by running `deactivate`, but must remember to activate it again whenever you want to use ERRANT.

**Note: ERRANT does not support spaCy 2 at this time**. spaCy 2 POS tags are slightly different from spaCy 1 POS tags and so ERRANT rules, which were designed for spaCy 1, may not always work with spaCy 2.

### BEA-2019 Shared Task
#### BEA-2019 Shared Task

ERRANT v2.0.0 was designed to be fully compatible with the [BEA-2019 Shared Task](https://www.cl.cam.ac.uk/research/nl/bea2019st/). If you want to directly compare against the results in the shared task, you should make sure to install ERRANT v2.0.0 as newer versions may produce slightly different scores.
```
pip3 install errant==2.0.0
```

#### ERRANT and spaCy 2

ERRANT was originally designed to work with spaCy v1.9.0 and so only officially supports this version. We nevertheless tested ERRANT v2.1.0 with spaCy v2.2.3 and found it to be **over 4x slower and ~2% less accurate**.

This is mainly because spaCy 2 uses a neural system to trade speed for accuracy (see the [official spaCy benchmarks](https://spacy.io/usage/facts-figures#spacy-models)), but also because some Universal POS tag mappings changed, and so certain ERRANT rules no longer worked as intended. Although we could offset the accuracy loss by modifying ERRANT rules for the new POS mappings, there is nothing we can do about the significant speed loss, and so do not recommend spaCy 2 with ERRANT at this time.

## Source Install

If you prefer to install ERRANT from source, you can instead run the following commands:
Expand Down Expand Up @@ -98,7 +102,7 @@ All these scripts also have additional advanced command line options which can b

#### Runtime

In terms of speed, ERRANT processes ~155 sents/sec in the fully automatic edit extraction and classification setting, but ~1000 sents/sec in the classification setting alone. These figures were calculated on an Intel Core i5-6600 @ 3.30GHz machine, but results will vary depending on how different/long the original and corrected sentences are.
In terms of speed, ERRANT processes ~500 sents/sec in the fully automatic edit extraction and classification setting, but ~1000 sents/sec in the classification setting alone. These figures were calculated on an Intel Core i5-6600 @ 3.30GHz machine, but results will vary depending on how different/long the original and corrected sentences are.

## API

Expand Down Expand Up @@ -226,6 +230,10 @@ The error type string.
`edit`.**to_m2**(id=0)
Format the edit for an output M2 file. `id` is the annotator id.

## Development for Other Languages

If you want to develop ERRANT for other languages, you should mimic the `errant/en` directory structure. For example, ERRANT for French should import a merger from `errant.fr.merger` and a classifier from `errant.fr.classifier` that respectively have equivalent `get_rule_edits` and `classify` methods. You will also need to add `'fr'` to the list of supported languages in `errant/__init__.py`.

# Contact

If you have any questions, suggestions or bug reports, you can contact the authors at:
Expand Down
8 changes: 8 additions & 0 deletions errant/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
from importlib import import_module
import logging
import spacy
from errant.annotator import Annotator

# ERRANT version
__version__ = '2.1.0'

# Load an ERRANT Annotator object for a given language
def load(lang, nlp=None):
# Make sure the language is supported
Expand All @@ -11,6 +15,10 @@ def load(lang, nlp=None):

# Load spacy
nlp = nlp or spacy.load(lang, disable=["ner"])
# Warning for spacy 2
if spacy.__version__[0] == "2":
logging.warning("ERRANT is 4x slower and 2% less accurate with spaCy 2. "
"We strongly recommend spaCy 1.9.0!")

# Load language edit merger
merger = import_module("errant.%s.merger" % lang)
Expand Down
6 changes: 3 additions & 3 deletions errant/alignment.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from difflib import SequenceMatcher
from itertools import groupby
import Levenshtein
import spacy.parts_of_speech as POS
from errant.edit import Edit

Expand Down Expand Up @@ -62,7 +62,7 @@ def align(self, lev):
# Traverse the diagonal while there is not a Match.
k = 1
while i-k >= 0 and j-k >= 0 and \
cost_matrix[i-k+1][j-k+1]-cost_matrix[i-k][j-k] > 0:
cost_matrix[i-k+1][j-k+1] != cost_matrix[i-k][j-k]:
if sorted(o_low[i-k:i+1]) == sorted(c_low[j-k:j+1]):
trans_cost = cost_matrix[i-k][j-k] + k
break
Expand Down Expand Up @@ -94,7 +94,7 @@ def get_sub_cost(self, o, c):
elif o.pos in self._open_pos and c.pos in self._open_pos: pos_cost = 0.25
else: pos_cost = 0.5
# Char cost
char_cost = 1-SequenceMatcher(None, o.text, c.text).ratio()
char_cost = 1-Levenshtein.ratio(o.text, c.text)
# Combine the costs
return lemma_cost + pos_cost + char_cost

Expand Down
3 changes: 3 additions & 0 deletions errant/commands/parallel_to_m2.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,9 @@ def main():
out_m2.write(edit.to_m2(cor_id)+"\n")
# Write a newline when we have processed all corrections for each line
out_m2.write("\n")

# pr.disable()
# pr.print_stats(sort="time")

# Parse command line args
def parse_args():
Expand Down
Loading

0 comments on commit e1e6066

Please sign in to comment.