-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] fix LangIdent model when multiple unicode scripts are present #81876
[ML] fix LangIdent model when multiple unicode scripts are present #81876
Conversation
Pinging @elastic/ml-core (Team:ML) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think it would be nice to capture something user visible regarding the multilingual behaviour. Not sure if this belongs in API docs or overall docs though.
} | ||
if (totalLen != 0) { | ||
divMut(scores, totalLen); | ||
divMut(probabilities, totalLen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it is worth noting somewhere (maybe in docs) in multilingual cases the probabilities we report are related to the fraction of the document which is classified with the language type. (We can probably just gloss over the fact we give short fragments less weight though.)
55e13bb
to
38c3aa7
Compare
38c3aa7
to
2ee148a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -290,7 +290,7 @@ public void process(Map<String, Object> fields) { | |||
embeddings.add( | |||
new StringLengthAndEmbedding( | |||
// Don't count white spaces as bytes for the prediction | |||
str.trim().length(), | |||
str.trim().getBytes(StandardCharsets.UTF_8).length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment here to say that using the number of UTF-8 bytes:
- Matches what the equivalent Python code did
- Acts as a heuristic to account for the fact that languages like Chinese embed more information in each character so using the number of UTF-8 bytes gives them a boost to compensate for shorter words
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will!
…lastic#81876) LangIdent was recently updated to handle multiple unicode scripts (elastic#80675). But this introduced some bugs fixed with this commit. 1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length. 2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text. 3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts
…lastic#81876) LangIdent was recently updated to handle multiple unicode scripts (elastic#80675). But this introduced some bugs fixed with this commit. 1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length. 2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text. 3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts
…lastic#81876) LangIdent was recently updated to handle multiple unicode scripts (elastic#80675). But this introduced some bugs fixed with this commit. 1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length. 2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text. 3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts
…81876) (#81890) LangIdent was recently updated to handle multiple unicode scripts (#80675). But this introduced some bugs fixed with this commit. 1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length. 2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text. 3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts
…81876) (#81889) LangIdent was recently updated to handle multiple unicode scripts (#80675). But this introduced some bugs fixed with this commit. 1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length. 2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text. 3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts
…81876) (#81888) LangIdent was recently updated to handle multiple unicode scripts (#80675). But this introduced some bugs fixed with this commit. 1. Sections with the same scripted were weighted by Java string length (utf-16) encoding. This is not accurate as certain languages (like Chinese and Korean) convey much more information with fewer utf-16 characters. FIX weight by utf-8 length. 2. The weighing of different language scores was done via the raw score from the neural network. This caused languages with a high score (but low compared to most likely language) from the network to be inaccurately weighted. FIX We are now instead weighing the probabilities of the sections of the text. 3. To split the input across the multiple scripts, we split on the "paired down" CDL3 script types. Java has superior support for unicode script blocks. FIX split by Java unicode script blocks not by the paired down CDL3 scripts
LangIdent was recently updated to handle multiple unicode scripts (#80675). But this introduced some bugs fixed with this commit.