Skip to content

Latest commit

 

History

History
114 lines (85 loc) · 7.39 KB

README.md

File metadata and controls

114 lines (85 loc) · 7.39 KB

EVBCorpus - English-Vietnamese Parallel corpus

for Comparative Linguistics, Machine Translation, and Vietnamese NLP tasks

The EVBCopus contains over 20,000,000 words (20 million) from 15 bilingual books, 100 parallel English-Vietnamese / Vietnamese-English texts, 250 parallel law and ordinance texts, 5,000 news articles, and 2,000 film subtitles. The composition, annotation, encoding and availability of the corpus are meant to facilitate developments of language technology and studies in bilingual terminology extraction, primarily for the English-Vietnamese-English language pair.

The building EVBCorpus process includes four main steps:

  1. Collect data and align bitext at the paragraph level;
  2. Align bitext at the sentence level,
  3. Linguistic analysis and tagging;
  4. Annotate and correct corpus with toolkits. As result, the EVBCopus was aligned at the sentence level; and a part of this corpus containing 5,000 news articles was aligned at the word level by tool and annotators.

Release EVBNews v.1.0 with 1,000 parallel documents, download at: https://github.com/qhungngo/EVBCorpus/blob/master/EVBCorpus_EVBNews_v1.0.rar https://sites.google.com/a/uit.edu.vn/hungnq/evbcorpus/EVBCorpus_EVBNews_v1.0.rar?attredirects=0&d=1

**Release EVBNews v.2.0 with 1,000 word aligned parallel documents, download at: ** https://github.com/qhungngo/EVBCorpus/blob/master/EVBCorpus_EVBNews_v2.0.rar https://sites.google.com/a/uit.edu.vn/hungnq/evbcorpus/EVBCorpus_EVBNews_v2.0.rar?attredirects=0&d=1

If you are interested in the corpus, please email to hungnq(at)uit.edu.vn to have more details.

Detail of Upgrade EVBCorpus v.2.0 (2018):

Source Document Paragraph Sentence Word
Books 15 14,195 61,167 1,335,180
Fictions 100 192,898 489,787 6,129,161
Laws 250 86,848 98,064 1,981,932
ETests 500 20,288 21,575 411,093
News 5,000 94,933 173,903 2,965,590
Subtitles 2,000 1,302,839 1,447,581 8,150,080
Total 7,865 1,712,001 2,292,077 20,973,036

Details of data sources of EVBCorpus v.1.0 (2012):

Source Document Paragraph Sentence Word
Books 15 13,980 80,323 1,375,492
Fictions 100 192,723 491,703 6,307,613
Laws 250 86,803 98,102 1,912,055
News 1,000 24,523 45,531 740,534
Total 1,365 318,029 715,659 10,431,592

English-Vietnamese Word Alignment Corpus (EVWACorpus)

The EVWACorpus contains 1,000 news articles with 45,531 sentence pairs and 740,534 words which are aligned manually at the word level between English and Vietnamese sentence. Details of the EVWACorpus:

-- English Vietnamese
Files 1,000 1,000
Sentences 45,531 45,531
Words 740,534 832,441
Sure Alignments 447,906 447,906
Possible Alignments 560,215 560,215
Words in Alignments 654,060 768,031

English-Vietnamese Chunker Corpus (EVChkCorpus)

The EVChkCorpus contains 1,000 news articles with 45,531 sentence pairs. It is tagged 5 raw chunker tags in both English and Vietnamese text. Details of the EVChkCorpus:

Tag Name English Vietnamese
NP Noun Phrase 212,500 209,824
VP Verb Phrase 90,784 123,600
PP Preposition Phrase 79,853 70,457
ADVP Adjective Phrase 18,318
ADJP Adverb Phrase 8,367 15,104

English-Vietnamese Named Entities Corpus (EVNECorpus)

The EVNECorpus contains 1,000 news articles with 45,531 sentence pairs. It is tagged named entities in both English and Vietnamese text. Details of the EVNECorpus:

Label Name English Vietnamese
LOC Location 10,115 10,006
PER Person 6,869 6,741
ORG Oganization 7,837 7,549
PCT Percentage 1,107 921
MON Money 898 823
TIM Time 4,244 4,100
- Total 35,879 34,732

The canonical publication for the EVBNews or EVBCorpus is:

Quoc Hung Ngo, Werner Winiwarter, and Bartholomaus Wloka, (2013). "EVBCorpus - A Multi-Layer English-Vietnamese Bilingual Corpus for Studying Tasks in Comparative Linguistics", In Proceedings of the 11th Workshop on Asian Language Resources (11th ALR within the IJCNLP2013), pp. 1-9. Asian Federation of Natural Language Processing, 2013.

Quoc-Hung Ngo, Werner Winiwarter, (2012). "Building an English-Vietnamese Bilingual Corpus for Machine Translation", International Conference on Asian Language Processing 2012 (IALP 2012), pp. 157-160. IEEE Computer Society, 2012.

The canonical publication for the EVNECorpus is:

Quoc Hung Ngo, Dinh Dien, and Werner Winiwarter, (2014). "Building English-Vietnamese Named Entity Corpus with Aligned Bilingual News Articles", The 5th Workshop on South and Southeast Asian Natural Languages Processing (5th SSANLP within the COLING2014). Association for Computational Linguistics, 2014.

The canonical publication for the Annotation Tool is:

Quoc-Hung Ngo, Werner Winiwarter (2012). "A Visualizing Annotation Tool for Semi-Automatically Building a Bilingual Corpus", In Proceedings of the 5th Workshop on Building and Using Comparable Corpora, LREC2012 Workshop, pages 67-74. Association for Computational Linguistics, 2012.

The canonical publication for the GetWebContent tool is:

Quoc-Hung Ngo, Dinh Dien, Werner Winiwarter, (2012). "Automatic Searching for English-Vietnamese Documents on the Internet", The 3rd Workshop on South and Southeast Asian Natural Languages Processing (3rd SSANLP within the COLING2012), pp. 211-220. Association for Computational Linguistics, 2012.

In Use with academic purposes:

  • Trieu, Hai Long, Vu Tran, and Nguyen Le Minh. "Investigating phrase-based and neural-based machine translation on low-resource settings." Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation. 2017.
  • Trieu, Long Hai. "A Study On Machine Translation For Low-Resource Languages". Thesis of Doctor of Philosophy, JAIST, 2017. Phuoc, Nguyen Quang, Yingxiu Quan, and Cheol-Young Ock. "Building a bidirectional English-Vietnamese statistical machine translation system by using MOSES." International Journal of Computer and Electrical Engineering 8.2 (2016): 161.
  • Song Cong Nguyen Duc; Q.Hung Ngo; JIAMTHAPTHAKSIN, Rachsuda. State-of-the-art Vietnamese word segmentation. In: Science in Information Technology (ICSITech), 2016 2nd International Conference on. IEEE, 2016. p. 119-124.
  • Nguyen, L. H., Dinh, D., & Tran, P. (2016). An Approach to Construct a Named Entity Annotated English-Vietnamese Bilingual Corpus. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 16(2), 9.
  • Dawborn, Timothy James. "DOCREP: Document Representation for Natural Language Processing." Thesis of Doctor of Philosophy, The University of Sydney, 2015.
  • Lam, Khang Nhut. "Automatically creating multilingual lexical resources." Proceedings of the Nineteenth AAAI/SIGAI Doctoral Consortium. 2014.
  • Huy, Dang Ngoc, and Pusadee Seresangtakul. "Vietnamese-Thai Lexicon for Machine Translation." The Tenth Symposium on Natural Language Processing (SNLP2013), Phuket, Thailand. 2013.
  • GIANG, Lam Tung; HUNG, Vo Trung; PHAP, Huynh Cong. Experiments with query translation and re-ranking methods in Vietnamese-English bilingual information retrieval. In: Proceedings of the Fourth Symposium on Information and Communication Technology. ACM, 2013. p. 118-122.

If you are interested in the corpus, please email to hungnq(at)uit.edu.vn to have more details.