-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Thai rules for CV Sentence Extractor #137
base: main
Are you sure you want to change the base?
Conversation
What is general recommendation for numbers (0-9) btw? I see languages like en, de allow them, but language like ka doesn't. |
Thanks for your efforts here. This perfectly well shows how broken the sentence segmentation is for some languages :( There's #11 already on file for this issue. I've also created a discussion/proposal at https://discourse.mozilla.org/t/future-of-the-sentence-extractor-your-input-is-required/78139 . |
Reviewed 184 samples from the current extracted sentences, got "OK" for 88%. The rest of the errors are mostly due to a "dangling word" - words that meant to be a first/last word in the next/previous sentence, but got incorrectly included in the sentence in question. (probably due to a space) I updated the first comment with error table. |
… pasted from some text editors (like MS Word and iOS Notes) - Simplify rules to reflex the fact that `replacements` will run before other rules
Continue from discussion in #139 (comment) , I'm thinking of one possible way to extract Thai sentences and guarantee the 3 sentences limit. A sentence splitter may work with JSON files inside The sentence splitter will read I will try to have a prototype on this. If success, this will work on top of current pipeline:
and expand it to
|
@bact I've created a proof of concept to use a Python based sentence splitting algorithm, to make sure that the Sentence Extractor can also be used for language that |
The segmenter PR has now been merged, check out https://github.com/common-voice/cv-sentence-extractor#using-a-different-segmenter-to-split-sentences for more info. Looking forward to hear if that helps with Thai :) |
Thank you @MichaelKohler . The new option segmenter is a welcome. I think this will make the pipeline more standardized, even with different language-specific processors. Will take a look more on this. |
I was initially thought that crfcut may work for this, but after several tries and inspections into the split text - some of the output starts or ends with an ill-formed word, very likely because the text got segmented at an invalid point (like before a following vowel: ก|า ). Currently trying to see if I can have a wrapper to post-process the output from crfcut, or does there any other alternative. |
th.toml
:other_patterns
borrowing from:https://github.com/common-voice/sentence-collector/blob/main/server/lib/validation/languages/th.js
(BEGIN_REGEX, END_REGEX, STRUCTURE_REGEX, and ABBREVIATION_REGEX with few adjustments)
replacements
borrowing from:https://github.com/common-voice/sentence-collector/blob/main/server/lib/cleanup/languages/th.js
(with some adjustments)
min_word_count
andmax_word_count
are set on the basis of treating "word" as "a group of character between two whitespaces/punctuations", since currently there's no Thai word tokenization in the extractor.This will close #133
How many sentences did you get at the end?
478
How did you create the blocklist file?
Since the current tokenizer does not work well with a language using no space as a word delimiter, cvtools seems doesn't work, so I haven't create one.
Review / Error ratio
(from 184 samples)
"D" are mostly sentences with a "dangling word" in the beginning (it is meant to be a last word in the previous sentence).
Since the total number of the sentences I have is just below 500, and the suggested amount of random sample is "100-500", I'm not sure if the amount of sentences I have is just unexpectedly low or not.
Would like to clarify this before I ask more people for review.
(I may have to "relax" the rules, but still not sure if this related to the way the punkt sentence tokenizer works or not).
The extracted sentences are here: https://docs.google.com/spreadsheets/d/1pKBH_YQiO9ZdXIduvrb37HvCLlKBt8mGeDCpX8e8dT4/edit?usp=sharing
Questions
Does the original number of articles in Wikipedia also affect the number of extracted output as well?
Tried to extract all the articles, without rules applying, with this command:
cargo run -- extract -l th -d ../wikiextractor/text/ --no_check >> wiki.th.all.txt
Got this
$ wc -l wiki.th.* 1985699 wiki.th.all.txt 478 wiki.th.txt
We actually have a lot of lines extracted in
wiki.th.all.txt
(1,314,274 lines after blank lines removed), but looks like these "sentences" are tend to be very long. In fact, a lot of lines contains more than one sentence (can be a whole paragraph).And the longer the line/sentence is, the more likely that it will got hit by one of the disallowing rules.
Few sample lines from
wiki.th.all.txt
(applying no rules):I guess if we can make the lines shorter, we can get more extracted sentences in
wiki.th.txt
Need some suggestions here. Thank you.