Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement: evaluate language detection packages to see which is best for detecting short text #1660

Closed
Coniferish opened this issue Oct 6, 2023 · 2 comments

Comments

@Coniferish
Copy link
Collaborator

Stemming from conversations here and here, it would be worthwhile to do our own comparison of language detection packages to see which is best for detecting the language of short text. Speed and size of the packages should also be considered. Packages of interest include langdetect (which we are currently using), fasttext (if it is compatible with py 3.11), langid, and lingua. We are also currently using a regex pattern and arbitrary text length limit to default to "eng", so this should also be considered/reconsidered.

See detect_languages in lang.py

@Coniferish
Copy link
Collaborator Author

Coniferish commented Oct 16, 2023

Notes/research from 10/2023

Package Url Stars Last updated License Python compatibility Detects multiple languages in the same text Deterministic?
Langdetect https://github.com/Mimino666/langdetect 1.5k 2021        
Polyglot https://github.com/aboSamoor/polyglot 2.2k 2020 GPLv3      
python-polyglot https://github.com/lainq/polyglot        
cld3 https://github.com/google/cld3 706 2022 Apache 2.0      
pycld3 https://github.com/bsolomon1124/pycld3 135 2021 Apache 2.0      
pycld2 https://github.com/aboSamoor/pycld2 147 2022 Apache 2.0      
Fasttext https://github.com/facebookresearch/fastText 25.1k 2023 MIT      
spaCy https://github.com/explosion/spaCy            
lingua https://github.com/wichert/lingua 44 2022        
lingua-py https://github.com/pemistahl/lingua-py 635 2023 apache 2.0      
googletrans https://github.com/ssut/py-googletrans            
textblob https://github.com/sloria/TextBlob 8.7k 2023        
langid https://github.com/saffsd/langid.py 2.2k 2017 BSD-2-Clause      
py3langid https://github.com/adbar/py3langid 26 2022 BSD 3-Clause Python >= 3.6   not documented

@jbne
Copy link

jbne commented May 9, 2024

Thanks for that table. Seems like Polyglot is actually GPLv3, not sure if that was changed recently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants