Error-repair Dependency Pasring for Ungrammatical Texts

Instructions

N.B. For license restriction, we don't provide the original PTB in this repository.

Prerequisites

The code depends on Python 2.7 (compiled with unicode=ucs2).

Check if your python is compatible with the code.

$ python --version
Python 2.7.17
$ python -c "import sys; print(sys.maxunicode)"
65535 (If this is 1114111, then your python uses unicode=ucs4)

If your python is not compatible, you might want to build python from source.

(for example)
cd $HOME
mkdir local
mkdir temp
cd ./temp
wget https://www.python.org/ftp/python/2.7.17/Python-2.7.17.tgz
tar zxvf Python-2.7.17.tgz
cd Python-2.7.17
./configure --prefix=$HOME/local --enable-unicode=ucs2 --enable-loadable-sqlite-extensions
make && make install
export PATH=$HOME/local/bin:$PATH
cd $HOME/temp
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py

Once you have a compatible python, install pre-requisite modules.
```
pip install -r requirements.txt
```
(You need to install libmysqlclient-dev and libsqlite3-dev (e.g., sudo apt-get install libmysqlclient-dev libsqlite3-dev)

Download kenlm model to ./data

cd ./data
wget http://cs.jhu.edu/~keisuke/shared/gigaword.kenlm

If you use a parser with pre-trained models, download the model weights and put them at ./easyfirst/models/ so that it will look like as follows:

easyfirst/models/
├── E05.model
├── E05.weights.FINAL
├── E10.model
├── E10.weights.FINAL
├── E15.model
├── E15.weights.FINAL
├── E20.model
└── E20.weights.FINAL

Get Penn Treebank under data directory. If you just use a parser with pre-trained models, go to step 6.
```
 cd ./data
 ln -s PATH_TO_YOUR_PTB treebank_3
```

Download and Install CRFsuite for preprocessing.

 [example for linux]
 cd ./data
 wget https://github.com/downloads/chokkan/crfsuite/crfsuite-0.12-x86_64.tar.gz
 wget https://github.com/downloads/chokkan/crfsuite/crfsuite-0.12.tar.gz
 tar zxvf crfsuite-0.12-x86_64.tar.gz
 tar zxvf crfsuite-0.12-.tar.gz

Set CRFSUITE_UTIL and crfsuite paths in preproc.sh and run the script.
```
 sh ./preproc.sh
```
This creates ./data/[train|dev|test].E00 (i.e., Error rate = 0%)

Add noise by running errgent. See the readme file in the directory.

 cd ./errgent
 sh ./generate_train_dev_test.sh (for generating all the files needed)

We assume that we have named the files as ./data/[train|dev|test].[E00|E05|E10|E15|E20]. The file should look like the following.

     1       Ms.     B-NP    NNP     _       _       2       TITLE   _       _
     2       Haag    I-NP    NNP     _       _       3       SBJ     _       _
     3       plays   B-VP    VBZ     _       _       0       ROOT    _       _
     4       Elianti B-NP    NNP     _       _       3       OBJ     _       _
     5       .       O       .       _       _       3       P       _       _
     
     1       The     B-NP    DT      _       _       4       NMOD    _       _
     2       luxury  I-NP    NN      _       _       4       NMOD    _       _
     3       auto    I-NP    NN      _       _       4       NMOD    _       _
     4       maker   I-NP    NN      _       _       7       SBJ     _       _
     5       last    B-NP    JJ      _       _       6       NMOD    _       _
     6       year    I-NP    NN      _       _       7       TMP     _       _
     7       sold    B-VP    VBD     _       _       0       ROOT    _       _
     8       1,214   B-NP    CD      _       _       9       NMOD    _       _
     9       cars    I-NP    NNS     _       _       7       OBJ     _       _
     10      in      B-PP    IN      _       _       7       LOC     _       _
     11      the     B-NP    DT      _       _       12      NMOD    _       _
     12      U.S.    I-NP    NNP     _       _       10      PMOD    _       _
     
     ...

Training a error-repair parser

 cd easyfirst
 (e.g.,) sh sample_train.sh E05 (training a model with 5% error-injected corpus)

Parsing sentences with the trained model

 (e.g.,) sh sample_parse.sh dev E05 E10 (parse 10% error-injected dev set with a model trained on 5% error corpus)

Evaluation on parsing performance

 cd ./eval
 wget https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/srleval/source-archive.zip -O srleval.zip
 unzip srleval.zip
 cd ./eval/srleval/trunk/align
 make
 
 modify line 231 in ./eval/srleval/trunk/eval.py
 (from) for item in alignment.align(ref_words, hyp_words, command=os.path.dirname(__file__) + "/align/align"):
 (to)   for item in alignment.align(ref_words, hyp_words):
 
 run evaluation script
 cd  ./eval
 (e.g.,) sh evaluate.sh dev E05 E10 (evaluate 10% error-injected dev set with a model trained on 5% error corpus)

Evaluation on grammaticality improvement

See Predicting Grammaticality on an Ordinal Scale

Questions

Please e-mail to Keisuke Sakaguchi (keisuke[at]cs.jhu.edu).

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
data		data
easyfirst		easyfirst
errgent		errgent
eval		eval
.gitignore		.gitignore
README.md		README.md
README.original.md		README.original.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Error-repair Dependency Pasring for Ungrammatical Texts

Instructions

Questions

About

Releases

Packages

Languages

keisks/error-repair-parsing

Folders and files

Latest commit

History

Repository files navigation

Error-repair Dependency Pasring for Ungrammatical Texts

Instructions

Questions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages