Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] fix LangIdent model when multiple unicode scripts are present #81876

Merged

Conversation

benwtrent
Copy link
Member

LangIdent was recently updated to handle multiple unicode scripts (#80675). But this introduced some bugs fixed with this commit.

  1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length.
  2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text.
  3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think it would be nice to capture something user visible regarding the multilingual behaviour. Not sure if this belongs in API docs or overall docs though.

}
if (totalLen != 0) {
divMut(scores, totalLen);
divMut(probabilities, totalLen);
Copy link
Contributor

@tveasey tveasey Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it is worth noting somewhere (maybe in docs) in multilingual cases the probabilities we report are related to the fraction of the document which is classified with the language type. (We can probably just gloss over the fact we give short fragments less weight though.)

@benwtrent benwtrent force-pushed the bugfix/ml-lang-ident-with-multi-languages branch from 55e13bb to 38c3aa7 Compare December 17, 2021 15:46
@benwtrent benwtrent force-pushed the bugfix/ml-lang-ident-with-multi-languages branch from 38c3aa7 to 2ee148a Compare December 17, 2021 15:46
Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -290,7 +290,7 @@ public void process(Map<String, Object> fields) {
embeddings.add(
new StringLengthAndEmbedding(
// Don't count white spaces as bytes for the prediction
str.trim().length(),
str.trim().getBytes(StandardCharsets.UTF_8).length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment here to say that using the number of UTF-8 bytes:

  • Matches what the equivalent Python code did
  • Acts as a heuristic to account for the fact that languages like Chinese embed more information in each character so using the number of UTF-8 bytes gives them a boost to compensate for shorter words

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will!

@benwtrent benwtrent merged commit 4b0864d into elastic:master Dec 17, 2021
@benwtrent benwtrent deleted the bugfix/ml-lang-ident-with-multi-languages branch December 17, 2021 20:08
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Dec 17, 2021
…lastic#81876)

LangIdent was recently updated to handle multiple unicode scripts (elastic#80675). But this introduced some bugs fixed with this commit.

1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length.
2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text.
3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Dec 17, 2021
…lastic#81876)

LangIdent was recently updated to handle multiple unicode scripts (elastic#80675). But this introduced some bugs fixed with this commit.

1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length.
2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text.
3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Dec 17, 2021
…lastic#81876)

LangIdent was recently updated to handle multiple unicode scripts (elastic#80675). But this introduced some bugs fixed with this commit.

1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length.
2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text.
3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts
elasticsearchmachine pushed a commit that referenced this pull request Dec 17, 2021
…81876) (#81890)

LangIdent was recently updated to handle multiple unicode scripts (#80675). But this introduced some bugs fixed with this commit.

1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length.
2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text.
3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts
elasticsearchmachine pushed a commit that referenced this pull request Dec 17, 2021
…81876) (#81889)

LangIdent was recently updated to handle multiple unicode scripts (#80675). But this introduced some bugs fixed with this commit.

1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length.
2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text.
3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts
elasticsearchmachine pushed a commit that referenced this pull request Dec 17, 2021
…81876) (#81888)

LangIdent was recently updated to handle multiple unicode scripts (#80675). But this introduced some bugs fixed with this commit.

1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length.
2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text.
3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants