MudYom is a module for pre/post-processing text. It combines, aka มัด, words that should be together into one token. This process is done according to a user-defined dictionary.
Because it's still in beta
, installation has to be done via
$ pip install git+https://github.com/pythainlp/mudyom.git
$ mudyom-cli --input "..." --dictionary "..." --output "..."
Remark: Vocabs in the dictionary should be sorted from longest to shortest one.
If not, you can use the command line below to sort the dictionary:
$ cat dictionary.txt | awk '{ print length, $0 }' | sort -g -r | cut -d" " -f2 > sorted_dictionary.txt
# input.txt
ฉัน|ขวัญ|หนี|ตี|ฝ่อ|ใจ|สลาย
# dictionary.txt
หลบลี้
คิดถึง
ตีฝ่อ
# output.txt
ฉัน|ขวัญ|หนี|ตีฝ่อ|ใจ|สลาย
Name | Vocaburary Size | Author |
---|---|---|
Food and Restuarant menues | ~400k | Wongnai |
Names and Acronyms | ~2k | Thachaparn Bunditlurdruk |
Name Entity in BEST | .. | .. |
- The implementation of this module is majorily drawn from Wongnai's post, written by Ekkalak Thongthanomkul.