Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whitespace Tokenizer incorrectly removes vowel signs for text in Hindi #5998

Closed
dakshvar22 opened this issue Jun 11, 2020 · 3 comments · Fixed by #6074
Closed

Whitespace Tokenizer incorrectly removes vowel signs for text in Hindi #5998

dakshvar22 opened this issue Jun 11, 2020 · 3 comments · Fixed by #6074
Assignees
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors.

Comments

@dakshvar22
Copy link
Contributor

Rasa version: master

Python version: 3.6.5

Operating system (windows, osx, ...): osx

Issue:
WhitespaceTokenizer.tokenize tries to remove punctuations/symbols using a regex -

words = re.sub(
            # there is a space or an end of a string after it
            r"[^\w#@&]+(?=\s|$)|"
            # there is a space or beginning of a string before it
            # not followed by a number
            r"(\s|^)[^\w#@&]+(?=[^0-9\s])|"
            # not in between numbers and not . or @ or & or - or #
            # e.g. 10'000.00 or [email protected]
            # and not url characters
            r"(?<=[^0-9\s])[^\w._~:/?#\[\]()@!$&*+,;=-]+(?=[^0-9\s])",
            " ",
            text,
        ).split()

This regex cleans up all vowel signs which is wrong.

Example input: 50 क्या आपके पास डेरी मिल्क 10 वाले बॉक्स मिल सकते है
Ideally, the whitespace tokenizer should give the following list of tokens:

['50', 'क्या', 'आपके', 'पास', 'डेरी', 'मिल्क', '10', 'वाले', 'बॉक्स', 'मिल', 'सकते', 'है']

instead, the tokens returned are:

['50', 'क', 'य', 'आपक', 'प', 'स', 'ड', 'र', 'म', 'ल', 'क', '10', 'व', 'ल', 'ब', 'क', 'स', 'म', 'ल', 'सकत', 'ह']

This also leads to misaligned entity annotations as described here.

@dakshvar22 dakshvar22 added type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. area:rasa-oss 🎡 Anything related to the open source Rasa framework labels Jun 11, 2020
@SreenijaK
Copy link

Hi,
Any update on this issue?

@dakshvar22
Copy link
Contributor Author

Hi @SreenijaK, unfortunately no we haven't had the bandwidth to get to this. Hopefully by end of next week I should be able to tackle it. In the meantime, if you want to take a jab at it please feel free to do so. :)

@dakshvar22
Copy link
Contributor Author

Hi @SreenijaK This is now fixed in the latest version of Rasa Open Source. Let us know if this works for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants