Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ZH Number] Fixed when the number string is long the extraction take a long time (#2747) #2796

Merged
merged 3 commits into from
Dec 9, 2021

Conversation

aitelint
Copy link
Contributor

@aitelint aitelint commented Dec 9, 2021

Fix to issue #2747.

The slowdown was caused by complex lookarounds used to avoid extracting numbers from percentage expressions in certain patterns. But since in English this kind of filtering is not implemented, they do not seem to be necessary and have been removed.

Test cases added to Chinese NumberModel and CurrencyModel.

Another issue observed is that long numbers in Chinese characters do not seem to be resolved correctly.

@tellarin
Copy link
Collaborator

tellarin commented Dec 9, 2021

Hmmm. This is a bit tricky. In other languages a number is returned too if "inside" a percentage, like "17" in "17%". But here the whole number and percentage are recognized together.

Copy link
Collaborator

@tellarin tellarin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking as not to merge directly. We need to consider if there's a side effect first.

Copy link
Collaborator

@tellarin tellarin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, upon a closer check these seem OK. Not sure why the previous extra processing was there.

@tellarin
Copy link
Collaborator

tellarin commented Dec 9, 2021

@aitelint Regarding the long number resolution issue, any insights on what is the type of failure?
Should we open an issue for it?

@aitelint
Copy link
Contributor Author

aitelint commented Dec 9, 2021

@aitelint Regarding the long number resolution issue, any insights on what is the type of failure? Should we open an issue for it?

I have not investigated the problem for now, only observed that the resolution is not correct, e.g.
"四千五百六十九万八千三百一十一亿四千八百七十二万五千三百八十二" -> 831194415382 instead of 4569831148725382
"二千零二十一万零七百二十一亿一千三百二十一万三千六百三十五" -> 72133423635 instead of 2021072113213635

Yes, the problem could be addressed in a separate issue (not sure however if these long expressions are used in practice, numbers up to 13 digits appear to be parsed correctly)

@tellarin tellarin merged commit a2fa4db into microsoft:master Dec 9, 2021
@aitelint aitelint deleted the #2747 branch December 9, 2021 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants