https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2607
Using the Wikipedia portion of the Vietnamese data linked above.
Train Size: 819,918 sentences
Dev Size: 14,884 sentences
Test Size: 30,000 sentences
The data comes lightly preprocessed (1 sentence per line, all lowercased)
To get syllables/words from raw sentences, split on r'[\s\W]+'
Replace any syllables containing 0-9 with NUMERIC token ('#')
When treating sequences as syllable/word sequences, prepend SENT_START token ('@') as first token in all sentences. E.g. ['@', 'hom', 'nay', ...].
When treating sequences as char sequences, prepend each word's sequence of characters with WORD_START token ('^'). E.g. ['^', 'h', 'o', 'm', '^', 'n', 'a', 'y', ...].
- Word-word / syllable-syllable: observed sequences are asciified syllables, target tags are diacritized sylabbles
- Char-char: observed sequences are asciified characters, target tags are diacritized characters
- Char-diac: observed sequences are asciified characters, target tags are diacritic marks for those characters
- Most Frequent: assign each input token its most frequently observed tags. Unseen input tokens are kept as-is.
- HMM: choose most probable path through tags based on tag emission probabilities and transition probabilities. Unseen input tokens are kept as-is.
- BERT: BERT + classifier architecture. Only used for char-diac set up.