Whitespace Tokenizer incorrectly removes vowel signs for text in Hindi #5998

dakshvar22 · 2020-06-11T13:01:10Z

Rasa version: master

Python version: 3.6.5

Operating system (windows, osx, ...): osx

Issue:
WhitespaceTokenizer.tokenize tries to remove punctuations/symbols using a regex -

words = re.sub(
            # there is a space or an end of a string after it
            r"[^\w#@&]+(?=\s|$)|"
            # there is a space or beginning of a string before it
            # not followed by a number
            r"(\s|^)[^\w#@&]+(?=[^0-9\s])|"
            # not in between numbers and not . or @ or & or - or #
            # e.g. 10'000.00 or [email protected]
            # and not url characters
            r"(?<=[^0-9\s])[^\w._~:/?#\[\]()@!$&*+,;=-]+(?=[^0-9\s])",
            " ",
            text,
        ).split()

This regex cleans up all vowel signs which is wrong.

Example input: 50 क्या आपके पास डेरी मिल्क 10 वाले बॉक्स मिल सकते है
Ideally, the whitespace tokenizer should give the following list of tokens:

['50', 'क्या', 'आपके', 'पास', 'डेरी', 'मिल्क', '10', 'वाले', 'बॉक्स', 'मिल', 'सकते', 'है']

instead, the tokens returned are:

['50', 'क', 'य', 'आपक', 'प', 'स', 'ड', 'र', 'म', 'ल', 'क', '10', 'व', 'ल', 'ब', 'क', 'स', 'म', 'ल', 'सकत', 'ह']

This also leads to misaligned entity annotations as described here.

The text was updated successfully, but these errors were encountered:

SreenijaK · 2020-06-16T19:37:55Z

Hi,
Any update on this issue?

dakshvar22 · 2020-06-17T07:01:29Z

Hi @SreenijaK, unfortunately no we haven't had the bandwidth to get to this. Hopefully by end of next week I should be able to tackle it. In the meantime, if you want to take a jab at it please feel free to do so. :)

dakshvar22 · 2020-07-03T14:10:32Z

Hi @SreenijaK This is now fixed in the latest version of Rasa Open Source. Let us know if this works for you.

dakshvar22 added type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. area:rasa-oss 🎡 Anything related to the open source Rasa framework labels Jun 11, 2020

tabergma self-assigned this Jun 26, 2020

tabergma mentioned this issue Jun 26, 2020

Whitespace tokenizer vowel signs #6074

Merged

4 tasks

rasabot closed this as completed in #6074 Jun 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whitespace Tokenizer incorrectly removes vowel signs for text in Hindi #5998

Whitespace Tokenizer incorrectly removes vowel signs for text in Hindi #5998

dakshvar22 commented Jun 11, 2020

SreenijaK commented Jun 16, 2020

dakshvar22 commented Jun 17, 2020

dakshvar22 commented Jul 3, 2020

Whitespace Tokenizer incorrectly removes vowel signs for text in Hindi #5998

Whitespace Tokenizer incorrectly removes vowel signs for text in Hindi #5998

Comments

dakshvar22 commented Jun 11, 2020

SreenijaK commented Jun 16, 2020

dakshvar22 commented Jun 17, 2020

dakshvar22 commented Jul 3, 2020