-
Notifications
You must be signed in to change notification settings - Fork 62
add korean sentence validation & cleanup #630
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing these. I have added a few comments.
}, { | ||
regex: /[a-zA-Z]/, | ||
error: 'Sentence should not contain latin alphabet characters', | ||
}, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the three rules here can be removed, as the last rule in this list would catch those anyway. What do you think?
Thank you for feedback. About normalizationNormalization was intended for contributors who use certain kind of keyboard layout, called "Sebeolsik". Korean letters (Hangul) in unicode has two kind of codepoints.
It comes clear if the sentence is normalized with composition normalization (NFC) before the sentence is submitted to the database, but I couldn't find existing way to normalize(morph) the sentence before submitted. Cleanup script is added to complement this in other way while supporting minor cases (And I think that is not clear way to do, too.) So, It might be handled with one of two ways:
(and Character lengthI checked one of "Public for Korean" dataset from aihub.or.kr, - validation set, random 10 sentences (without actually listening them, believing its metadata to be correct)
( 605 characters / 72.58 second) = 8.33563 character / second |
Thanks for the explanation. I'm now checking if we could instead add NFC somewhere else (even before validation) to make this simpler. I will report back. |
Wouldn't it better to increase the max sentence length? Something like 70? |
I have checked with others and per-language NFC normalization is something we want to support. Therefore I implemented that capability here: 5a86a81 . I've already enabled it for Korean, however it's not on the live website yet. I will create a new release once we have the validation here as well. This means:
Would be great if you could update your branch to be based on the latest Happy to provide any more context if needed and thanks for the suggestion!
Not by too much though. 70 would mean roughly 8.5 seconds on average. If somebody reads/speaks 20% slower than average, then it would already hit the 10 seconds hard limit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking this as "request changes", see my previous comment.
Sorry, This PR close was mistake; I'm not familiar with forked Github repository and pressed the "sync" with discard button. I'll reopen after some modification. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
🎉 This PR is included in version 2.18.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
This adds korean validation and cleanup.
[.,?!]
(in my opinion).