In this work, we present transformer-based PTMs for the Khmer language for the first time. We evaluate our models on two downstream tasks: part-of-speech tagging and news categorization; the dataset for the latter task is self-constructed. In addition, we find that the current Khmer word segmentation technology does not aid performance improvement. For more details of our dataset or our models, please see our paper “Pre-trained Models and Evaluation Data for the Khmer Language”.
If you use our models or our dataset, please consider citing our paper:
@article{,
author="Jiang, Shengyi
and Fu, Sihui
and Lin, Nankai
and Fu, Yingwen",
title="Pre-trained Models and Evaluation Data for the Khmer Language",
year="2021",
publisher="Tsinghua Science and Technology",
}