Character-based language models and an application: dehyphenation.
This section is to reproduce the results presented in the MSZNY 2023 paper titled Korpusztisztítás és sorvégi kötőjelek kezelése karakteralapú neurális nyelvmodellel.
Use this commit, please: 9de4f92
.
-
preparation
python3 -m venv venv-clm-dehyph source venv-clm-dehyph/bin/activate venv-clm-dehyph/bin/python3 -m pip install --upgrade pip pip install -r requirements.txt make prepare
7z
is needed for this step. Installing requirements can take up to 2-3 minutes. -
evaluation
make eval-small
runs evaluation on a tiny dataset in some minutes. You will get
dehyphenation/eval/*h50_*/eval.txt
files as they are in this repo.make eval
runs evaluation on the 100.000 line dataset used in the paper. This takes a long time to run, especially on CPU. You will get
dehyphenation/eval/*h100000_*/eval.txt
files as they are in this repo. They contain the same results which are presented in the paper.