This Indonesian dataset is constructed for Indonesian GEC task. It contains 13,709 sentences for 10 POS tags. As few datasets are provided for Indonesian GEC task, we hope our dataset can help the researchers who focus on this research field. For more details of our dataset, please see our paper “A Framework for Indonesian Grammar Error Correction”.
We optimized the original corpus, deleting two pieces of data in the "preposition" category and three pieces of data in the "indefinite pronoun" category. After retesting, the results of retaining three decimal places are consistent with the results in the original paper.
If you use our corpus, please consider citing our paper:
@article{10.1145/3440993,
author = {Lin, Nankai and Chen, Boyu and Lin, Xiaotian and Wattanachote, Kanoksak and Jiang, Shengyi},
title = {A Framework for Indonesian Grammar Error Correction},
year = {2021},
issue_date = {June 2021},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {20},
number = {4},
issn = {2375-4699},
url = {https://doi.org/10.1145/3440993},
doi = {10.1145/3440993},
journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.},
month = may,
articleno = {57},
numpages = {12},
keywords = {Grammatical error correction, word-embedding, indonesian language processing, low-resource language}
}