Metrics are calculated by sacreBLEU.
English to Lojban
Version | test BLEU | test chrF2 | test TER | train BLEU | train chrF2 | train TER | val BLEU | val chrF2 | val TER |
---|---|---|---|---|---|---|---|---|---|
1.0.0 | 80.06 | 77.83 | 142.03 | 91.28 | 94.55 | 15.22 | 93.06 | 99.00 | 14.29 |
1.1.0 | 64.12 | 73.29 | 170.44 | 91.57 | 96.14 | 15.22 | 49.53 | 76.75 | 128.61 |
1.1.0 lc | 68.35 | 78.05 | 113.63 | 89.01 | 94.50 | 30.43 | 68.34 | 86.94 | 14.29 |
Suffix "lc" means that everything in evaluation was lower-cased. The result of "1.0.0 lc" is much worse than just "1.0.0" (without lowercase conversion), therefore not shown here.
TODO: investigate why the scores of 1.1.0 are worse, despite in interactive experiments 1.1.0 is better.
Lojban to English
Version | test BLEU | test chrF2 | test TER | train BLEU | train chrF2 | train TER | val BLEU | val chrF2 | val TER |
---|---|---|---|---|---|---|---|---|---|
1.0.0 | 25.20 | 21.53 | 276.62 | 45.56 | 43.02 | 252.73 | 6.74 | 16.73 | 92.69 |
1.1.0 | 39.15 | 62.47 | 237.11 | 49.01 | 47.56 | 224.65 | 17.24 | 23.99 | 119.18 |
1.1.0 lc | 45.85 | 63.27 | 237.11 | 49.58 | 48.09 | 224.65 | 17.24 | 23.99 | 119.18 |
The split train/validation/test is for future versions. The versions 1.0.0 and 1.1.0 used other own splits.
High-level description:
- Execute "
make copy-ds
" to get the dataset from huggingface and store in the local cache - tokenize
- translate
- evaluate
make clean
Details are in Makefile.
Code versions to use
- 1.0.0, 1.1.0: 9 August 2022, 0a117049a03ff92119783599cbcac318c79ec06b
- 1.1.0 lc: current