- Sentence Segmentation
- Word Segmentation
- Part of speech Tagging
- Named Entity Recognition
- Text classification
pip install khmer-nltk
To get the evaluation result of khmer-nltk's functionalities, please refer the sub-modules's readme
>>> from khmernltk import sentence_tokenize
>>> raw_text = "αα½αααααΆαααΈα’α¨! α’α£ αα»ααΆ ααααΆαααΈααααααααΆααΆαα·αααΆαααααααα·αααααα ααΆααα
αααα
αααααααααΆα ααΆααααααΊααααα·ααΆα αα·αααΆααα½ααα½αααΆααααΈ"
>>> print(sentence_tokenize(raw_text))
['αα½αααααΆαααΈα’α¨!', 'α’α£ αα»ααΆ ααααΆαααΈααααααααΆααΆαα·αααΆαααααααα·αααααα ααΆααα
αααα
αααααααααΆα ααΆααααααΊααααα·ααΆα αα·αααΆααα½ααα½αααΆααααΈ']
>>> from khmernltk import word_tokenize
>>> raw_text = "αα½αααααΆαααΈα’α¨! α’α£ αα»ααΆ ααααΆαααΈααααααααΆααΆαα·αααΆαααααααα·αααααα ααΆααα
αααα
αααααααααΆα ααΆααααααΊααααα·ααΆα αα·αααΆααα½ααα½αααΆααααΈ"
>>> print(word_tokenize(raw_text, return_tokens=True))
['αα½α', 'ααααΆα', 'ααΈ', 'α’α¨', '!', ' ', 'α’α£', ' ', 'αα»ααΆ', ' ', 'ααααΆαααΈ', 'ααααααααΆ', 'ααΆαα·', 'αααΆα', 'ααααα', 'αα·α', 'ααααα', ' ', 'ααΆα', 'αα
', 'αααα
αα', 'αααααααΆα', ' ', 'ααΆα', 'αααααΊ', 'ααααα·ααΆα', ' ', 'αα·α', 'ααΆααα½ααα½α', 'ααΆααααΈ']
>>> from khmernltk import pos_tag
>>> raw_text = "αα½αααααΆαααΈα’α¨! α’α£ αα»ααΆ ααααΆαααΈααααααααΆααΆαα·αααΆαααααααα·αααααα ααΆααα
αααα
αααααααααΆα ααΆααααααΊααααα·ααΆα αα·αααΆααα½ααα½αααΆααααΈ"
>>> print(pos_tag(raw_text))
[('αα½α', 'n'), ('ααααΆα', 'n'), ('ααΈ', 'n'), ('α’α¨', '1'), ('!', '.'), (' ', 'n'), ('α’α£', '1'), (' ', 'n'), ('αα»ααΆ', 'n'), (' ', 'n'), ('ααααΆαααΈ', 'n'), ('ααααααααΆ', 'n'), ('ααΆαα·', 'n'), ('αααΆα', 'o'), ('ααααα', 'n'), ('αα·α', 'o'), ('ααααα', 'n'), (' ', 'n'), ('ααΆα', 'v'), ('αα
', 'v'), ('αααα
αα', 'v'), ('αααααααΆα', 'n'), (' ', 'n'), ('ααΆα', 'v'), ('αααααΊ', 'n'), ('ααααα·ααΆα', 'n'), (' ', 'n'), ('αα·α', 'o'), ('ααΆααα½ααα½α', 'n'), ('ααΆααααΈ', 'o')]
@misc{hoang-khmer-nltk,
author = {Phan Viet Hoang},
title = {Khmer Natural Language Processing Tookit},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/VietHoang1512/khmer-nltk}}
}
- stopes: A library for preparing data for machine translation research
- LASER Language-Agnostic SEntence Representations
- Pretrained Models and Evaluation Data for the Khmer Language
- Multilingual Open Text 1.0: Public Domain News in 44 Languages
- ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual Open-retrieval Question Answering System
- Shared Task on Cross-lingual Open-Retrieval QA
- No Language Left Behind: Scaling Human-Centered Machine Translation
- Wordless
- A Simple and Fast Strategy for Handling Rare Words in Neural Machine Translation
- NLP: Text Segmentation Using Conditional Random Fields
- Khmer Word Segmentation Using Conditional Random Fields
- Word Segmentation of Khmer Text Using Conditional Random Fields
- Prof. Huong Le Thanh