-
Notifications
You must be signed in to change notification settings - Fork 112
Python binding forks and different fixes #15
Comments
I have been testing the Elizafox/cld3 Python binding and I had severe memory issues. The more sentences I detect, the more memory is used. I don't know if this is an issue in cld3 or in the Python binding specifically. And given that I cannot open any issue in any of the Python binding forks, I though to report it here. |
@Ipla I've fixed these memory leaks in my fork of CLD3. Basically, the elizafox version creates a new model object on each call to The fork is iamthebot/cld3 |
Hi @jasonriesa and @akihiroota87: do the maintainers of google/cld3 have any interest in incorporating Python bindings within this repo, by reviewing and combining the various forks mentioned above? As a tangentially related change, as a part of those forks, the Chromium dependency was removed. If that wasn't the case, the logical solution might be a git submodule, but since the C source itself has changed in the forks, that becomes difficult. |
I believe there's still a small error in your fork. You use the comparison: str(res.language) != ident.kUnknown: This is not doing what you think it is. Originally, However,
What is needed here is: if <bytes> res.language != <bytes> ident.kUnknown: You can prove this for yourself by throwing this into
Then
Will produce |
Using the work of everyone here (thank you everyone!) I've tried to combine the change sets into one clean set of commits and put a shiny new wrapper on things, which also sits on PyPI as pycld3. https://github.com/bsolomon1124/pycld3 Reviews appreciated. Again, I've made my best effort to make sure the incremental changes across different forks are picked up and put together. |
Thanks @bsolomon1124! I actually just copied that part from the elizafox cld3 fork so I guess many of us had been using this in its broken form for a while lol. The new wrapper looks great and we'll switch to using it soon. |
PyPI: https://pypi.org/project/gcld3/ GitHub: https://github.com/google/cld3/tree/master/gcld3 |
Update: CLD3 now has a Python binding code from Google themselves:
gcld3
PyPI: https://pypi.org/project/gcld3/
GitHub: https://github.com/google/cld3/tree/master/gcld3
This issue is to documenting some Python binding forks, with a hope that fixes can be merged as much as possible at the higher upstreams:
Official CLD3: https://github.com/google/cld3
--> [based on google] First Python binding: https://github.com/jbaiter/cld3 by @jbaiter
----> [based on @jbaiter] Remove Chromium repo dependency (see #11) + PyPI: https://github.com/Elizafox/cld3 by @Elizafox
------> [based on @Elizafox] Fix res.language casting error (in Cython): https://github.com/RNogales94/cld3, https://github.com/PythonNut/cld3, https://github.com/houp/cld3 by @RNogales94 @PythonNut @houp
------> [based on @Elizafox] Include protobuf headers and bodies (to get around #13): https://github.com/houp/cld3 by @houp
------> [based on @Elizafox] Fix memory leak; Introduce reuse of language model for faster performance https://github.com/iamthebot/cld3 by @iamthebot
--------> [based on @iamthebot] Fix
res.language
comparison; Provide easy pip install underpycld3
name https://github.com/bsolomon1124/pycld3 by @bsolomon1124Note:
pip install cld3
(from PyPI), it is https://github.com/Elizafox/cld3 by @Elizafoxpip install pycld3
for an updated version, at https://github.com/bsolomon1124/pycld3 by @bsolomon1124, with all the fixes and improvement listed abovePython Binding Documentation
(based on the documentation from https://github.com/Elizafox/cld3 )
Usage:
Here's some examples:
In short:
get_language
returns the most likely language as the named tupleLanguagePrediction
. Proportion is always 1.0 when called in this way.get_frequent_languages
will return the top number of guesses, up to a maximum specified (in the example, 5). The maximum is mandatory. Proportion will be set to the proportion of bytes found to be the target language in the list.In the normal cld3 library, "und" may be returned as a language for unknown languages (with no other stats given). This library filters that result out as extraneous; if the language couldn't be detected, nothing will be returned. This also means, as a consequence,
get_frequent_languages
may return fewer results than what you asked for, or none at all.The text was updated successfully, but these errors were encountered: