Indonesian-GEC-Corpus

Introduction

This Indonesian dataset is constructed for Indonesian GEC task. It contains 13,709 sentences for 10 POS tags. As few datasets are provided for Indonesian GEC task, we hope our dataset can help the researchers who focus on this research field. For more details of our dataset, please see our paper “A Framework for Indonesian Grammar Error Correction”.

v1.1 version

We optimized the original corpus, deleting two pieces of data in the "preposition" category and three pieces of data in the "indefinite pronoun" category. After retesting, the results of retaining three decimal places are consistent with the results in the original paper.

Citation

If you use our corpus, please consider citing our paper:

@article{10.1145/3440993,
author = {Lin, Nankai and Chen, Boyu and Lin, Xiaotian and Wattanachote, Kanoksak and Jiang, Shengyi},
title = {A Framework for Indonesian Grammar Error Correction},
year = {2021},
issue_date = {June 2021},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {20},
number = {4},
issn = {2375-4699},
url = {https://doi.org/10.1145/3440993},
doi = {10.1145/3440993},
journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.},
month = may,
articleno = {57},
numpages = {12},
keywords = {Grammatical error correction, word-embedding, indonesian language processing, low-resource language}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
corpus v1.1.zip		corpus v1.1.zip
corpus.tar.gz		corpus.tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Indonesian-GEC-Corpus

Introduction

v1.1 version

Citation

About

Releases

Packages

GKLMIP/Indonesian-GEC-Corpus

Folders and files

Latest commit

History

Repository files navigation

Indonesian-GEC-Corpus

Introduction

v1.1 version

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages