-
Notifications
You must be signed in to change notification settings - Fork 4
NdtTop
Internal (for the time being at least) notes on working with the Norwegian Dependency Treebank at LTG.
svn co http://svn.emmtee.net/ltg/ndt
The source code of the Regular Expression–Based Pre-Processor (REPP) and the toolkit for Robust Evaluation of Syntactic Analysis (RESA) is included as an external SVN dependency. As a first-time, preparatory step, both tools needs to be compiled. In tokenization/src/repp/ and tokenization/src/resa/, run:
autoreconf -i
./configure
make
cat ../data/txt/nob/ap001.txt \
| ./src/sentence-split_no.perl \
| while read line; do \
echo "$line" | ./src/repp/src/repp -c repp/nob.set --format line; \
done \
> ap001.t
RESA compares two views on syntactic analysis—gold and test, in our case–and and aligns sentence end points and tokens to the original raw text (thus recovering character start and end points); ‘-b’ requests the use of sentence end boundaries (rather than spans); ‘-v’ means to output any mismatched triples; and ‘--interim’ will create additional files (in the current directory, but named after the gold and test inputs) containing all triples. Finally, the file descriptor magic serves to swap standard output and standard output, such that we can filter for sentence and token errors and record mismatches in yet another file.
./src/resa/src/resa \
-r ../data/txt/nob/ap001.txt \
-g ../data/conll/nob/ap001.conll -G CONLLX \
-t ap001.t -T TAB \
-b -v --interim 3>&1 1>&2 2>&3 3>&- | egrep 'SENT|TOK' > ap001.e
For comparison to the CIS tokenizer:
cat ../data/txt/nob/ap001.txt \
| ~/src/logon/cis/bin/linux.x86.32/tokenizer -L german-utf8 -S -p -P -E '' \
| while read line; do \
echo "$line" | ./src/repp/src/repp -c repp/nob.set --format line; \
done \
> ap001.t
Home | Forum | Discussions | Events