Skip to content

GKLMIP/Indonesian-GEC-Corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

Indonesian-GEC-Corpus

Introduction

This Indonesian dataset is constructed for Indonesian GEC task. It contains 13,709 sentences for 10 POS tags. As few datasets are provided for Indonesian GEC task, we hope our dataset can help the researchers who focus on this research field. For more details of our dataset, please see our paper “A Framework for Indonesian Grammar Error Correction”.

v1.1 version

We optimized the original corpus, deleting two pieces of data in the "preposition" category and three pieces of data in the "indefinite pronoun" category. After retesting, the results of retaining three decimal places are consistent with the results in the original paper.

Citation

If you use our corpus, please consider citing our paper:

@article{10.1145/3440993,
author = {Lin, Nankai and Chen, Boyu and Lin, Xiaotian and Wattanachote, Kanoksak and Jiang, Shengyi},
title = {A Framework for Indonesian Grammar Error Correction},
year = {2021},
issue_date = {June 2021},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {20},
number = {4},
issn = {2375-4699},
url = {https://doi.org/10.1145/3440993},
doi = {10.1145/3440993},
journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.},
month = may,
articleno = {57},
numpages = {12},
keywords = {Grammatical error correction, word-embedding, indonesian language processing, low-resource language}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published