Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Lingua instead of pycld3 for language detection #615

Closed
wants to merge 5 commits into from

Conversation

osma
Copy link
Member

@osma osma commented Aug 26, 2022

This draft PR fixes #593 by switching from the pycld3 language detection library to Lingua (by @pemistahl).

Lingua is used in the low accuracy mode, because it is much faster than the high accuracy mode and needs a lot less memory. I tested the high accuracy mode very briefly but just the startup overhead was so high (tens of seconds) that I considered it a non-starter.

I did a little benchmarking using the Annif tutorial yso-nlf data set and two project configurations used in the tutorial with two backend algorithms, MLLM and Omikuji Parabel. I compared current master (which uses pycld3) to this PR branch which uses Lingua 1.1.1. As a baseline, I also used project configurations with no language filtering. Here are the project configurations:

[yso-mllm-en-filter]
name=YSO MLLM project
language=en
backend=mllm
vocab=yso-en
analyzer=snowball(english)
transform=limit(10000),filter_lang,limit(5000)

[yso-omikuji-parabel-en-filter]
name=Omikuji Parabel English
language=en
backend=omikuji
analyzer=snowball(english)
vocab=yso-en
transform=limit(10000),filter_lang,limit(5000)

For the unfiltered baseline I used transform=limit(5000) instead.

Here are some performance stats (total user time over all CPU cores and maximum resident set size) that I measured using /usr/bin/time -v:

operation notes no filter time no filter mem pycld3 time pycld3 mem lingua-low time lingua-low mem
pytest optionals: dev,omikuji,pycld3/lingua - - 76 1302904 77 1299056
loadvoc yso yso-skos.ttl 121 2468144 - - 120 2466232
train mllm -d 2000 -j 8 646 1876336 608 1879700 2176 1893320
suggest mllm 2017-D-52518.txt 7 283548 7 278576 14 291960
eval mllm -j 8 128 360836 131 361688 624 359360
train omikuji -j 8 yso-finna-small.tsv.gz 124 663368 125 682708 131 684188
suggest omikuji 2017-D-52518.txt 5 400784 6 396836 14 410744
eval omikuji -j 8 22 483988 29 483704 513 481508

Here are the evaluation results (running annif eval on the 300 documents in the test set and measuring F1@5 and nDCG scores - higher is better):

Project type no filter f1@5 no filter ndcg pycld3 f1@5 pycld3 ndcg lingua-low f1@5 lingua-low ndcg
mllm 0.3276 0.4334 0.3236 0.4282 0.3183 0.4228
omikuji 0.2562 0.3543 0.2435 0.3287 0.2524 0.3385

The good news:

  • Lingua starts up quickly (in low-accuracy mode)
  • Lingua doesn't use any more memory than pycld3 (in low-accuracy mode)

The bad news:

  • Lingua is still a lot slower than pycld3 in the grunt work of filtering long documents sentence by sentence. For example, when training MLLM with 2000 documents (truncated to max 10000 characters each by the limit filter), the user time increased from ~600 to ~2100 seconds. Likewise, evaluation time on 300 documents increased by ~500 seconds. For suggest operations on a single document, the increase was ~7 seconds (but this most likely includes some initialization overhead, so the next document would have been processed faster).
  • In general, this experiment didn't show any benefit of language filtering. The evaluation results were actually best for the baseline experiment with no filtering; using either pycld3 or Lingua just made the results worse. For other data sets and languages, the situation could be different.
  • Maybe this is nitpicking, but it was surprisingly hard to get a standard lowercase ISO 639-1 language code out of Lingua. Eventually I found out that using an expression like result.iso_code_639_1.name.lower() does the trick. pycld3 returns language codes directly, which makes the API easier to use.

I think the take home message is that if Lingua could be made faster still for the detection process, then we could consider switching to it. Right now it seems that the performance cost is quite high. It would also be nice to identify a data set where the language filtering actually improves results; we could then measure whether Lingua does this better than pycld3 or not. This data set was not a good choice in that respect.

@codecov
Copy link

codecov bot commented Aug 26, 2022

Codecov Report

Base: 99.61% // Head: 99.59% // Decreases project coverage by -0.02% ⚠️

Coverage data is based on head (3bbe813) compared to base (ec10014).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #615      +/-   ##
==========================================
- Coverage   99.61%   99.59%   -0.03%     
==========================================
  Files          87       87              
  Lines        6038     5946      -92     
==========================================
- Hits         6015     5922      -93     
- Misses         23       24       +1     
Impacted Files Coverage Δ
annif/transform/__init__.py 100.00% <ø> (ø)
tests/test_transform_langfilter.py 100.00% <ø> (ø)
annif/transform/langfilter.py 96.42% <100.00%> (-3.58%) ⬇️
annif/cli.py 99.67% <0.00%> (-0.02%) ⬇️
tests/test_cli.py 100.00% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@osma osma mentioned this pull request Aug 26, 2022
@@ -32,7 +32,6 @@ def test_lang_filter(project):
Kansalliskirjasto on kaikille avoin kulttuuriperintöorganisaatio, joka
palvelee valtakunnallisesti kansalaisia, tiedeyhteisöjä ja muita
yhteiskunnan toimijoita.
Abc defghij klmnopqr stuwxyz abc defghij klmnopqr stuwxyz.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to change this test. This is a nonsensical sentence that pycld3 is unsure about (.is_reliable == False), so the language filter gives it the benefit of the doubt and retains it. Lingua simply identifies it as Swahili, so it gets stripped.

In the Lingua documentation I can't see a direct way of telling whether Lingua is unsure about some input; there is the minimum relative distance parameter which could be tuned, but I'm not sure that it would help with nonsensical input. For example, when converting PDF documents to text, it's quite likely to result in garbage "sentences" that aren't in any real language.

@pemistahl
Copy link

Thank you @osma for adding my library to your evaluation. :)

It is not surprising that CLD3 is faster than Lingua. CLD3 has been implemented in C++ whereas my library is pure Python only (with exception of the internally used NumPy arrays). You said that you would favor a pure Python library for language detection. Such a library will always be slower than one implemented in a low-level language. So there will always be compromises you have to make. As soon as PyO3 supports exporting Rust enums as Python enums, I will create Python bindings for my Rust implementation of Lingua. This will be significantly faster than the pure Python version.

It seems that you mainly want to classify large documents consisting of multiple sentences. For such kind of textual input, the high accuracy mode does not achieve much benefit. It's better suited for short texts such as tweets, for instance. So the advantages of Lingua compared to other language detectors do not pay off for you. That's ok. I think it's better then if you stick with CLD3 to benefit from the better detection speed.

@osma
Copy link
Member Author

osma commented Aug 26, 2022

It is not surprising that CLD3 is faster than Lingua. CLD3 has been implemented in C++ whereas my library is pure Python only (with exception of the internally used NumPy arrays). You said that you would favor a pure Python library for language detection. Such a library will always be slower than one implemented in a low-level language. So there will always be compromises you have to make. As soon as PyO3 supports PyO3/pyo3#417, I will create Python bindings for my Rust implementation of Lingua. This will be significantly faster than the pure Python version.

Understood. But I think the current Lingua implementation (with NumPy vectors) is slower than it needs to be because of the O(log(n)) lookups - having to do binary searches in big sorted arrays. This is not a question of implementation language but of algorithmic efficiency. Even pure Python (or in this case, helped along by NumPy) can be quite fast. I wrote some ideas about further optimization of Lingua in this discussion.

It seems that you mainly want to classify large documents consisting of multiple sentences. For such kind of textual input, the high accuracy mode does not achieve much benefit. It's better suited for short texts such as tweets, for instance. So the advantages of Lingua compared to other language detectors do not pay off for you. That's ok. I think it's better then if you stick with CLD3 to benefit from the better detection speed.

The problem here is that sticking to CLD3 is not a good option, as explained in the OP of #593 - its most active Python binding library (pycld3) appears to not be actively maintained anymore, and the other ones (cld3, gcld3) are even older. pycld3 doesn't work with Python 3.10. So unless someone starts maintaining it again, we will need to switch to something else.

@pemistahl
Copy link

Hi @osma,

I have just released Lingua 1.1.2 which removes the most significant performance problems of the previous version. The language models are now stored on disk as serialized NumPy arrays instead of JSON. This reduces the preloading time of the language models significantly (between 1 and 2 seconds for all models on my machine). I have also removed a bottleneck in the language detection code which makes language detection 40 % faster, approximately.

Can you please do your evaluation again with the new version? Would you now consider switching to my library?

Thanks. :)

@osma
Copy link
Member Author

osma commented Sep 23, 2022

Thanks @pemistahl for the update, that is great news!

I will try to do a new round of experiments soon, comparing language filtering with either pycld3, Lingua or the recently added language detection functionality in Simplemma. This time I will use a dataset that actually should benefit from the filtering - the tutorial data set I used above was a bit disappointing in this respect.

@osma
Copy link
Member Author

osma commented Sep 23, 2022

Rebased this PR branch on current master and force-pushed. Also upgraded to Lingua 1.1.2.

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@osma
Copy link
Member Author

osma commented Nov 11, 2022

This didn't work well according to the benchmarks, and now the PR branch is also in a conflict with the master branch due to the Black reformatting. It doesn't make sense to spend time salvaging this, so I'll just close the PR.

@osma osma closed this Nov 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace pycld3 dependency?
2 participants