Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode/UTF-8 support #11

Open
marty1885 opened this issue Oct 18, 2019 · 8 comments
Open

Unicode/UTF-8 support #11

marty1885 opened this issue Oct 18, 2019 · 8 comments

Comments

@marty1885
Copy link
Contributor

Hi, I want to contribute Unicode support for the library. Do you know how much/what have to be done for the feature?

@tmplt
Copy link
Owner

tmplt commented Oct 18, 2019

None, I'm afraid. I'm unsure if the underlying Levenshtein implementation supports it.

@marty1885
Copy link
Contributor Author

Hmmm.. You said that you use the same Levenshtein library as the original fuzzywuzzy does. How does fuzzywuzzy handles Unicode?

@tmplt
Copy link
Owner

tmplt commented Oct 19, 2019

I recall seatgeek/fuzzywuzzy supporting two implementations; it will use python-Levenshtein if installed. But from their past issues it seems that the library supports unicode.

I interpret that the Levenshtein implementation could work with any integral type, from
https://github.com/Tmplt/fuzzywuzzy/blob/a4f8b717b3f30208436f82054413660a8d2f7613/include/levenshtein.h#L23-L32

So I figure the Python interop maps unicode characters to integral values? I cannot say for sure. I would personally start there: see how Unicode is treated when Python calls C code and figure out if the Levenshtein implementation have to be changed.

@tmplt tmplt pinned this issue Oct 19, 2019
@tmplt tmplt unpinned this issue Oct 19, 2019
@marty1885
Copy link
Contributor Author

I guess there's no easy way to get Unicode support (Without switching std::string to a Unicode aware one for the entire library). Thanks .

@tmplt
Copy link
Owner

tmplt commented Oct 20, 2019 via email

@marty1885
Copy link
Contributor Author

I think the proper solution would be to use something like Qt's QString and re-implement levenshtein.c to support it. No hacks and Qt is popular enough on *nix systems. (Windows will be a problem)

@tmplt
Copy link
Owner

tmplt commented Oct 21, 2019 via email

@marty1885
Copy link
Contributor Author

Levenshtein could instead be reimplemented

I'll look into that.

@tmplt tmplt reopened this Oct 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants