-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[7.16] [ML] Fix language identification bug when multi-languages are …
…present (#80675) (#80707) * [ML] Fix language identification bug when multi-languages are present (#80675) Language identification works fairly well when only one language and script type is present. But when multiple are present, it can return some unexpected results Example: "행 레이블 this is english text obviously and 생성 tom said to test it" Which appears to a human to be english text (Latin unicode) with Korean via Hangul unicode is erroneously categorized as Japanese. It should be categorized as English as it is the dominate language and script type. This commit fixes this bug by doing the following: - Input text is partitioned into common, continuous, unicode script sections - Those sections individual language scores are gathered - Each score is then weighted according to the number of characters in each section - The resulting weight scores are transformed into probabilities - The final probabilities are the ones returned to the user. * fixing compilation
- Loading branch information
Showing
4 changed files
with
235 additions
and
36 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters