[ML] Fix language identification bug when multi-languages are present #80675

benwtrent · 2021-11-11T18:17:14Z

Language identification works fairly well when only one language
and script type is present.

But when multiple are present, it can return some unexpected results

Example:

"행 레이블 this is english text obviously and 생성 tom said to test it"

Which appears to a human to be english text (Latin unicode) with Korean
via Hangul unicode is erroneously categorized as Japanese.

It should be categorized as English as it is the dominate language and
script type.

This commit fixes this bug by doing the following:

Input text is partitioned into common, continuous, unicode script
sections
Those sections individual language scores are gathered
Each score is then weighted according to the number of characters in
each section
The resulting weight scores are transformed into probabilities
The final probabilities are the ones returned to the user.

elasticmachine · 2021-11-11T18:17:18Z

Pinging @elastic/ml-core (Team:ML)

Language identification works fairly well when only one language and script type is present. But when multiple are present, it can return some unexpected results Example: "행 레이블 this is english text obviously and 생성 tom said to test it" Which appears to a human to be english text (Latin unicode) with Korean via Hangul unicode is erroneously categorized as Japanese. It should be categorized as English as it is the dominate language and script type. This commit fixes this bug by doing the following: - Input text is partitioned into common, continuous, unicode script sections - Those sections individual language scores are gathered - Each score is then weighted according to the number of utf-8 bytes in each section - The resulting weight scores are transformed into probabilities - The final probabilities are the ones returned to the user.

droberts195

Each score is then weighted according to the number of utf-8 bytes in
each section

I'd like to know why it's the number of UTF-8 bytes and not the number of characters. The number of characters seems more natural to me. If I have 60 characters of English and 40 characters of Russian why would I want to give the Russian weight 80 and the English weight 60?

droberts195 · 2021-11-11T19:44:31Z

...c/main/java/org/elasticsearch/xpack/core/ml/inference/preprocessing/CustomWordEmbedding.java

@@ -214,10 +234,75 @@ public void process(Map<String, Object> fields) {
        text = FeatureUtils.cleanAndLowerText(text);
        text = FeatureUtils.truncateToNumValidBytes(text, MAX_STRING_SIZE_IN_BYTES);
        String finalText = text;


Might be clearer if it's explicitly final

Suggested change

String finalText = text;

final String finalText = text;

droberts195 · 2021-11-11T19:46:31Z

...c/main/java/org/elasticsearch/xpack/core/ml/inference/preprocessing/CustomWordEmbedding.java

-            .map((featureExtractor) -> featureExtractor.extractFeatures(finalText))
-            .collect(Collectors.toList());
-        fields.put(destField, concatEmbeddings(processedFeatures));
+        if (text.isEmpty() || text.isBlank()) {


It seems potentially confusing to mix text and finalText in the main algorithm. Since finalText needs to be used in lambdas I'd just use it everywhere to avoid making the reader double-check if there's a difference.

Suggested change

if (text.isEmpty() || text.isBlank()) {

if (finalText.isEmpty() || finalText.isBlank()) {

droberts195 · 2021-11-11T19:47:36Z

...c/main/java/org/elasticsearch/xpack/core/ml/inference/preprocessing/CustomWordEmbedding.java

+        if (text.isEmpty() || text.isBlank()) {
+            fields.put(
+                destField,
+                Arrays.asList(


Suggested change

Arrays.asList(

Collections.singletonList(

(because Arrays.asList with 1 item causes an IntelliJ warning)

droberts195 · 2021-11-11T19:50:05Z

...c/main/java/org/elasticsearch/xpack/core/ml/inference/preprocessing/CustomWordEmbedding.java

+                Arrays.asList(
+                    new ByteSizeAndEmbedding(
+                        // Don't count white spaces as bytes for the prediction
+                        finalText.trim().getBytes(StandardCharsets.UTF_8).length,


Suggested change

finalText.trim().getBytes(StandardCharsets.UTF_8).length,

0,

If this is wrong please add a comment saying how the trimmed length of a blank or empty string can be > 0

droberts195 · 2021-11-11T19:54:10Z

...org/elasticsearch/xpack/core/ml/inference/trainedmodel/langident/LangIdentNeuralNetwork.java

+                continue;
+            }
+            CustomWordEmbedding.ByteSizeAndEmbedding byteSizeAndEmbedding = (CustomWordEmbedding.ByteSizeAndEmbedding) vec;
+            int square = (int) Math.pow(byteSizeAndEmbedding.getUtf8ByteSize(), 2);


I strongly suspect multiplying two integers is much faster than using some generic x^y algorithm that works on arbitrary floating point numbers.

Suggested change

int square = (int) Math.pow(byteSizeAndEmbedding.getUtf8ByteSize(), 2);

int square = byteSizeAndEmbedding.getUtf8ByteSize() * byteSizeAndEmbedding.getUtf8ByteSize();

droberts195 · 2021-11-11T19:59:58Z

...c/main/java/org/elasticsearch/xpack/core/ml/inference/preprocessing/CustomWordEmbedding.java

@@ -43,6 +45,24 @@
 */
 public class CustomWordEmbedding implements LenientlyParsedPreProcessor, StrictlyParsedPreProcessor {

+    public static class ByteSizeAndEmbedding {
+        final int utf8ByteSize;


I find it very strange that the weighting is the number of UTF-8 bytes, not the number of characters.

That means that if I have some text that's 100 characters of Roman alphabet and 100 Chinese characters then the Chinese could get a weighting of 300 while the western language gets a weighting of 100. Is the byte count a sneaky heuristic for saying each Chinese character embeds more information than a Roman alphabet character? It would be good to add a comment with the justification.

benwtrent · 2021-11-11T20:08:16Z

I'd like to know why it's the number of UTF-8 bytes and not the number of characters. The number of characters seems more natural to me. If I have 60 characters of English and 40 characters of Russian why would I want to give the Russian weight 80 and the English weight 60

@droberts195

Mainly because that is how prior art handles this.

I could switch it to character count pretty simply and make sure all the examples continue to pass.

droberts195 · 2021-11-11T20:16:33Z

Mainly because that is how prior art handles this.

OK fair enough. We can copy that then since we copied the rest of the algorithm. Are there any comments in the code we ported that say why it’s bytes mot characters?

benwtrent · 2021-11-11T20:20:38Z

@droberts195 there are zero comments. I am guessing because the rest of the code works according to UTF-8 bytes. In Java, we have more robust text manipulation tools.

I switched it to string length and the tests continued to pass. It makes sense to use string length as languages with their own special unicode class usually have higher confidence than those without. Artificially increasing that confidence by weighing them according to byte length is unintuitive.

droberts195

LGTM

…ang-ident

benwtrent · 2021-11-12T14:27:04Z

@elasticmachine update branch

hendrikmuhs · 2021-11-15T08:15:05Z

Some conceptual comment. They don't target this PR, but something to think about longer term.

It should be categorized as English as it is the dominate language and script type.

This depends on the use case, it's not that easy:

english is a common language and the text you identify as english are often just proper names. You find lots of mixed CJK/English texts like this, I would not classify them as english.
english is a simple language. If a language identification is used for choosing the tokenizer, it is better to choose the language specific tokenizer. Usually all of them are able to space tokenize the english parts, but vice versa the english tokenizer can not decompound CJK, german, finish, etc.
- note, this might just be something for the user to be taken care of, all probabilities are returned, so you can choose the tokenizer e.g. from the 2nd language, fortunately we return not just 1 result

* Each score is then weighted according to the number of characters in
  each section

This assumption does not work well for CJK, e.g. '棪' means 'tree'. CJK tends to be shorter in characters (but has a larger alphabet).

droberts195 · 2021-11-15T09:24:29Z

This assumption does not work well for CJK, e.g. '棪' means 'tree'. CJK tends to be shorter in characters (but has a larger alphabet).

This is what I was getting at in:

Is the byte count a sneaky heuristic for saying each Chinese character embeds more information than a Roman alphabet character?

But if that's what we end up doing that then we should have a comment saying why we're doing it, because it would be very much a crude heuristic rather than a scientific algorithm.

benwtrent · 2021-11-15T12:06:47Z

english is a common language and the text you identify as english are often just proper names.

Shorter contiguous sequences of the same unicode script are weighed lower than longer ones.

This assumption does not work well for CJK, e.g. '棪' means 'tree'. CJK tends to be shorter in characters (but has a larger alphabet).

Right now character seems OK. But, we can switch it to byte length again in the future after more testing.

benwtrent · 2021-11-15T12:08:56Z

@elasticmachine update branch

…elastic#80675) Language identification works fairly well when only one language and script type is present. But when multiple are present, it can return some unexpected results Example: "행 레이블 this is english text obviously and 생성 tom said to test it" Which appears to a human to be english text (Latin unicode) with Korean via Hangul unicode is erroneously categorized as Japanese. It should be categorized as English as it is the dominate language and script type. This commit fixes this bug by doing the following: - Input text is partitioned into common, continuous, unicode script sections - Those sections individual language scores are gathered - Each score is then weighted according to the number of characters in each section - The resulting weight scores are transformed into probabilities - The final probabilities are the ones returned to the user.

elasticsearchmachine · 2021-11-15T13:05:01Z

💚 Backport successful

Status	Branch	Result
✅	8.0
✅	7.16

…present (#80675) (#80707) * [ML] Fix language identification bug when multi-languages are present (#80675) Language identification works fairly well when only one language and script type is present. But when multiple are present, it can return some unexpected results Example: "행 레이블 this is english text obviously and 생성 tom said to test it" Which appears to a human to be english text (Latin unicode) with Korean via Hangul unicode is erroneously categorized as Japanese. It should be categorized as English as it is the dominate language and script type. This commit fixes this bug by doing the following: - Input text is partitioned into common, continuous, unicode script sections - Those sections individual language scores are gathered - Each score is then weighted according to the number of characters in each section - The resulting weight scores are transformed into probabilities - The final probabilities are the ones returned to the user. * fixing compilation

…#80675) (#80706) Language identification works fairly well when only one language and script type is present. But when multiple are present, it can return some unexpected results Example: "행 레이블 this is english text obviously and 생성 tom said to test it" Which appears to a human to be english text (Latin unicode) with Korean via Hangul unicode is erroneously categorized as Japanese. It should be categorized as English as it is the dominate language and script type. This commit fixes this bug by doing the following: - Input text is partitioned into common, continuous, unicode script sections - Those sections individual language scores are gathered - Each score is then weighted according to the number of characters in each section - The resulting weight scores are transformed into probabilities - The final probabilities are the ones returned to the user.

…81876) LangIdent was recently updated to handle multiple unicode scripts (#80675). But this introduced some bugs fixed with this commit. 1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length. 2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text. 3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts

…lastic#81876) LangIdent was recently updated to handle multiple unicode scripts (elastic#80675). But this introduced some bugs fixed with this commit. 1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length. 2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text. 3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts

…81876) (#81890) LangIdent was recently updated to handle multiple unicode scripts (#80675). But this introduced some bugs fixed with this commit. 1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length. 2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text. 3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts

…81876) (#81889) LangIdent was recently updated to handle multiple unicode scripts (#80675). But this introduced some bugs fixed with this commit. 1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length. 2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text. 3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts

…81876) (#81888) LangIdent was recently updated to handle multiple unicode scripts (#80675). But this introduced some bugs fixed with this commit. 1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length. 2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text. 3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts

benwtrent added >enhancement :ml Machine learning v8.0.0 auto-backport-and-merge v7.16.1 v8.1.0 labels Nov 11, 2021

benwtrent requested a review from droberts195 November 11, 2021 18:17

elasticmachine added the Team:ML Meta label for the ML team label Nov 11, 2021

benwtrent force-pushed the feature/improve-lang-ident branch from 81f5888 to f339cde Compare November 11, 2021 18:26

benwtrent added >bug and removed >enhancement labels Nov 11, 2021

droberts195 reviewed Nov 11, 2021

View reviewed changes

addressing PR comments

b40a9ca

droberts195 approved these changes Nov 11, 2021

View reviewed changes

benwtrent added 3 commits November 12, 2021 07:57

Merge remote-tracking branch 'upstream/master' into feature/improve-l…

b8e868f

…ang-ident

adding more tests

ff529c4

fixing format

f945f01

benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Nov 12, 2021

Merge branch 'master' into feature/improve-lang-ident

aa3779d

Merge branch 'master' into feature/improve-lang-ident

fb3182c

elasticsearchmachine merged commit 49517da into elastic:master Nov 15, 2021

benwtrent mentioned this pull request Nov 15, 2021

[8.0] [ML] Fix language identification bug when multi-languages are present (#80675) #80706

Merged

benwtrent mentioned this pull request Nov 15, 2021

[7.16] [ML] Fix language identification bug when multi-languages are present (#80675) #80707

Merged

benwtrent deleted the feature/improve-lang-ident branch November 15, 2021 13:06

danhermann added v7.16.0 and removed v7.16.1 labels Nov 16, 2021

benwtrent mentioned this pull request Dec 17, 2021

[ML] fix LangIdent model when multiple unicode scripts are present #81876

Merged

mark-vieira added v8.0.0-rc1 and removed v8.0.0 labels Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Fix language identification bug when multi-languages are present #80675

[ML] Fix language identification bug when multi-languages are present #80675

benwtrent commented Nov 11, 2021 •

edited

Loading

elasticmachine commented Nov 11, 2021

droberts195 left a comment

droberts195 Nov 11, 2021

droberts195 Nov 11, 2021

droberts195 Nov 11, 2021

droberts195 Nov 11, 2021

droberts195 Nov 11, 2021

droberts195 Nov 11, 2021

benwtrent commented Nov 11, 2021

droberts195 commented Nov 11, 2021

benwtrent commented Nov 11, 2021 •

edited

Loading

droberts195 left a comment

benwtrent commented Nov 12, 2021

hendrikmuhs commented Nov 15, 2021

droberts195 commented Nov 15, 2021

benwtrent commented Nov 15, 2021 •

edited

Loading

benwtrent commented Nov 15, 2021

elasticsearchmachine commented Nov 15, 2021

	if (text.isEmpty() \|\| text.isBlank()) {
	if (finalText.isEmpty() \|\| finalText.isBlank()) {

	int square = (int) Math.pow(byteSizeAndEmbedding.getUtf8ByteSize(), 2);
	int square = byteSizeAndEmbedding.getUtf8ByteSize() * byteSizeAndEmbedding.getUtf8ByteSize();

[ML] Fix language identification bug when multi-languages are present #80675

[ML] Fix language identification bug when multi-languages are present #80675

Conversation

benwtrent commented Nov 11, 2021 • edited Loading

elasticmachine commented Nov 11, 2021

droberts195 left a comment

Choose a reason for hiding this comment

droberts195 Nov 11, 2021

Choose a reason for hiding this comment

droberts195 Nov 11, 2021

Choose a reason for hiding this comment

droberts195 Nov 11, 2021

Choose a reason for hiding this comment

droberts195 Nov 11, 2021

Choose a reason for hiding this comment

droberts195 Nov 11, 2021

Choose a reason for hiding this comment

droberts195 Nov 11, 2021

Choose a reason for hiding this comment

benwtrent commented Nov 11, 2021

droberts195 commented Nov 11, 2021

benwtrent commented Nov 11, 2021 • edited Loading

droberts195 left a comment

Choose a reason for hiding this comment

benwtrent commented Nov 12, 2021

hendrikmuhs commented Nov 15, 2021

droberts195 commented Nov 15, 2021

benwtrent commented Nov 15, 2021 • edited Loading

benwtrent commented Nov 15, 2021

elasticsearchmachine commented Nov 15, 2021

💚 Backport successful

benwtrent commented Nov 11, 2021 •

edited

Loading

benwtrent commented Nov 11, 2021 •

edited

Loading

benwtrent commented Nov 15, 2021 •

edited

Loading