Skip to content
This repository has been archived by the owner on Jun 15, 2024. It is now read-only.

Python binding forks and different fixes #15

Closed
bact opened this issue Dec 22, 2018 · 7 comments
Closed

Python binding forks and different fixes #15

bact opened this issue Dec 22, 2018 · 7 comments

Comments

@bact
Copy link

bact commented Dec 22, 2018

Update: CLD3 now has a Python binding code from Google themselves: gcld3

PyPI: https://pypi.org/project/gcld3/

GitHub: https://github.com/google/cld3/tree/master/gcld3


This issue is to documenting some Python binding forks, with a hope that fixes can be merged as much as possible at the higher upstreams:

Official CLD3: https://github.com/google/cld3
--> [based on google] First Python binding: https://github.com/jbaiter/cld3 by @jbaiter
----> [based on @jbaiter] Remove Chromium repo dependency (see #11) + PyPI: https://github.com/Elizafox/cld3 by @Elizafox
------> [based on @Elizafox] Fix res.language casting error (in Cython): https://github.com/RNogales94/cld3, https://github.com/PythonNut/cld3, https://github.com/houp/cld3 by @RNogales94 @PythonNut @houp
------> [based on @Elizafox] Include protobuf headers and bodies (to get around #13): https://github.com/houp/cld3 by @houp
------> [based on @Elizafox] Fix memory leak; Introduce reuse of language model for faster performance https://github.com/iamthebot/cld3 by @iamthebot
--------> [based on @iamthebot] Fix res.language comparison; Provide easy pip install under pycld3 name https://github.com/bsolomon1124/pycld3 by @bsolomon1124

Note:

Python Binding Documentation

(based on the documentation from https://github.com/Elizafox/cld3 )

Usage:

Here's some examples:

>>> cld3.get_language("This is a test")
LanguagePrediction(language='en', probability=0.9999980926513672, is_reliable=True, proportion=1.0)

>>> cld3.get_frequent_languages("This piece of text is in English. Този текст е на Български.", 5)
[LanguagePrediction(language='bg', probability=0.9173890948295593, is_reliable=True, proportion=0.5853658318519592), LanguagePrediction(language='en', probability=0.9999790191650391, is_reliable=True, proportion=0.4146341383457184)]

In short:

  • get_language returns the most likely language as the named tuple LanguagePrediction. Proportion is always 1.0 when called in this way.
  • get_frequent_languages will return the top number of guesses, up to a maximum specified (in the example, 5). The maximum is mandatory. Proportion will be set to the proportion of bytes found to be the target language in the list.

In the normal cld3 library, "und" may be returned as a language for unknown languages (with no other stats given). This library filters that result out as extraneous; if the language couldn't be detected, nothing will be returned. This also means, as a consequence, get_frequent_languages may return fewer results than what you asked for, or none at all.

@bact bact changed the title Python binding forks Python binding forks and different fixes Dec 22, 2018
@lpla
Copy link

lpla commented Aug 1, 2019

I have been testing the Elizafox/cld3 Python binding and I had severe memory issues. The more sentences I detect, the more memory is used. I don't know if this is an issue in cld3 or in the Python binding specifically.

And given that I cannot open any issue in any of the Python binding forks, I though to report it here.

@iamthebot
Copy link

@Ipla I've fixed these memory leaks in my fork of CLD3. Basically, the elizafox version creates a new model object on each call to get_language and on top of it doesn't clean it up. My fork has both the original functions (but cleans up the objects) and a class called LanguageIdentifier which permits reuse of the model for faster performance.

The fork is iamthebot/cld3

@bsolomon1124
Copy link

bsolomon1124 commented Oct 5, 2019

Hi @jasonriesa and @akihiroota87: do the maintainers of google/cld3 have any interest in incorporating Python bindings within this repo, by reviewing and combining the various forks mentioned above?

As a tangentially related change, as a part of those forks, the Chromium dependency was removed. If that wasn't the case, the logical solution might be a git submodule, but since the C source itself has changed in the forks, that becomes difficult.

@bsolomon1124
Copy link

bsolomon1124 commented Oct 5, 2019

@iamthebot

I believe there's still a small error in your fork.

You use the comparison:

str(res.language) != ident.kUnknown:

This is not doing what you think it is.

Originally, res.language is a CPP string, while ident.kUnknown is a const char array (with value "und").

However, str(res.language) does not do the correct coercion in the same way that str(b"hello") does not decode the string; it just makes a str representation of that bytes object.

>>> str(b"hello")
"b'hello'"
>>> str(b"hello") == "hello"  # No!
False

What is needed here is:

if <bytes> res.language != <bytes> ident.kUnknown:

You can prove this for yourself by throwing this into get_language():

cdef string tst = b"und" 
print(tst)
print(str(tst) == ident.kUnknown)
print(tst.decode("utf-8") == ident.kUnknown)

Then

python3 setup.py build_ext --inplace --quiet && python3 -c 'import cld3; cld3.get_language("hello there!")'

Will produce False, False.

@bsolomon1124
Copy link

bsolomon1124 commented Oct 8, 2019

Using the work of everyone here (thank you everyone!) I've tried to combine the change sets into one clean set of commits and put a shiny new wrapper on things, which also sits on PyPI as pycld3.

https://github.com/bsolomon1124/pycld3

Reviews appreciated. Again, I've made my best effort to make sure the incremental changes across different forks are picked up and put together.

@iamthebot
Copy link

Thanks @bsolomon1124! I actually just copied that part from the elizafox cld3 fork so I guess many of us had been using this in its broken form for a while lol. The new wrapper looks great and we'll switch to using it soon.

@bact
Copy link
Author

bact commented Jan 29, 2024

gcld3 - a Python binding for CLD3 from Google

PyPI: https://pypi.org/project/gcld3/

GitHub: https://github.com/google/cld3/tree/master/gcld3

@bact bact closed this as completed Jan 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants