[ML] fix LangIdent model when multiple unicode scripts are present #81876

benwtrent · 2021-12-17T15:30:34Z

LangIdent was recently updated to handle multiple unicode scripts (#80675). But this introduced some bugs fixed with this commit.

Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length.
The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text.
To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts

elasticmachine · 2021-12-17T15:30:37Z

Pinging @elastic/ml-core (Team:ML)

tveasey

LGTM. I think it would be nice to capture something user visible regarding the multilingual behaviour. Not sure if this belongs in API docs or overall docs though.

tveasey · 2021-12-17T15:35:53Z

...org/elasticsearch/xpack/core/ml/inference/trainedmodel/langident/LangIdentNeuralNetwork.java

        }
        if (totalLen != 0) {
-            divMut(scores, totalLen);
+            divMut(probabilities, totalLen);


I feel like it is worth noting somewhere (maybe in docs) in multilingual cases the probabilities we report are related to the fraction of the document which is classified with the language type. (We can probably just gloss over the fact we give short fragments less weight though.)

droberts195

LGTM

droberts195 · 2021-12-17T16:04:49Z

...c/main/java/org/elasticsearch/xpack/core/ml/inference/preprocessing/CustomWordEmbedding.java

@@ -290,7 +290,7 @@ public void process(Map<String, Object> fields) {
            embeddings.add(
                new StringLengthAndEmbedding(
                    // Don't count white spaces as bytes for the prediction
-                    str.trim().length(),
+                    str.trim().getBytes(StandardCharsets.UTF_8).length,


Please add a comment here to say that using the number of UTF-8 bytes:

Matches what the equivalent Python code did

Acts as a heuristic to account for the fact that languages like Chinese embed more information in each character so using the number of UTF-8 bytes gives them a boost to compensate for shorter words

…lastic#81876) LangIdent was recently updated to handle multiple unicode scripts (elastic#80675). But this introduced some bugs fixed with this commit. 1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length. 2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text. 3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts

…81876) (#81890) LangIdent was recently updated to handle multiple unicode scripts (#80675). But this introduced some bugs fixed with this commit. 1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length. 2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text. 3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts

…81876) (#81889) LangIdent was recently updated to handle multiple unicode scripts (#80675). But this introduced some bugs fixed with this commit. 1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length. 2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text. 3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts

…81876) (#81888) LangIdent was recently updated to handle multiple unicode scripts (#80675). But this introduced some bugs fixed with this commit. 1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length. 2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text. 3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts

benwtrent added >bug :ml Machine learning v8.0.0 auto-backport-and-merge v8.1.0 v7.16.2 labels Dec 17, 2021

elasticmachine added the Team:ML Meta label for the ML team label Dec 17, 2021

tveasey approved these changes Dec 17, 2021

View reviewed changes

benwtrent force-pushed the bugfix/ml-lang-ident-with-multi-languages branch from 55e13bb to 38c3aa7 Compare December 17, 2021 15:46

[ML] fix LangIdent model when multiple unicode scripts are present

2ee148a

benwtrent force-pushed the bugfix/ml-lang-ident-with-multi-languages branch from 38c3aa7 to 2ee148a Compare December 17, 2021 15:46

droberts195 added the v7.17.0 label Dec 17, 2021

droberts195 approved these changes Dec 17, 2021

View reviewed changes

adding code comments around probability weights

505055b

benwtrent merged commit 4b0864d into elastic:master Dec 17, 2021

benwtrent deleted the bugfix/ml-lang-ident-with-multi-languages branch December 17, 2021 20:08

benwtrent mentioned this pull request Dec 17, 2021

[8.0] [ML] fix LangIdent model when multiple unicode scripts are present (#81876) #81888

Merged

benwtrent mentioned this pull request Dec 17, 2021

[7.17] [ML] fix LangIdent model when multiple unicode scripts are present (#81876) #81889

Merged

benwtrent mentioned this pull request Dec 17, 2021

[7.16] [ML] fix LangIdent model when multiple unicode scripts are present (#81876) #81890

Merged

mark-vieira added v8.0.0-rc1 and removed v8.0.0 labels Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] fix LangIdent model when multiple unicode scripts are present #81876

[ML] fix LangIdent model when multiple unicode scripts are present #81876

benwtrent commented Dec 17, 2021

elasticmachine commented Dec 17, 2021

tveasey left a comment

tveasey Dec 17, 2021 •

edited

Loading

droberts195 left a comment

droberts195 Dec 17, 2021

benwtrent Dec 17, 2021

[ML] fix LangIdent model when multiple unicode scripts are present #81876

[ML] fix LangIdent model when multiple unicode scripts are present #81876

Conversation

benwtrent commented Dec 17, 2021

elasticmachine commented Dec 17, 2021

tveasey left a comment

Choose a reason for hiding this comment

tveasey Dec 17, 2021 • edited Loading

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

droberts195 Dec 17, 2021

Choose a reason for hiding this comment

benwtrent Dec 17, 2021

Choose a reason for hiding this comment

tveasey Dec 17, 2021 •

edited

Loading