-
Naive (potential) user question here. I'm looking for a good, up to date language detection library for Annif - see this issue. Lingua seems promising, but it seems to require quite a lot of memory, especially when all supported languages are considered - this is pointed out in the README. I tested detecting the language of the example sentence I tested doing the same with pycld3 and langdetect and their memory usage was much much lower - too little to bother measuring accurately. I don't see anything in the README that would justify using such huge amounts of RAM compared to other implementations. Having the rules is certainly good, but I don't think they use lots of RAM. I'm wondering if there's some trick that other language detection libraries are performing to reduce their memory requirements? Could Lingua do that too? Or is this just a tradeoff that you have to accept if you want to achieve the high accuracy? For my purposes, although it's nice to have good accuracy, this isn't a top priority. It would also help to be able to choose smaller and faster models with slightly reduced accuracy. |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 5 replies
-
Hi Osma, thank you for your question and your interest in my library.
Well, the README does mention in section 5 that Lingua uses much larger language models than other libraries. This fact provides for the high detection accuracy but is also the cause of high memory consumption. You are right that the rule engine is not the cause of high RAM usage - on the contrary, it helps to reduce RAM usage a little by filtering out languages which are considered unlikely for the given input text.
I have not studied the implementation details of the other libraries. The main reason for lower memory requirements is the much smaller language models of the other libraries. For long texts, this does not make a significant difference but for short text Lingua is much more accurate.
If you want to achieve the highest accuracy possible, then you have to accept it, yes.
You have probably discovered my other implementations of Lingua which are written in Go, Rust and Kotlin. The Kotlin one has a new feature which I call low accuracy mode. When this mode is enabled, only a small subset of the language models is used which is enough for reliably detecting the language of long texts. For short text, however, accuracy will drop significantly. You can see the differences by looking at the plots. So the low accuracy mode uses much less RAM. The low accuracy mode is not yet available in the other implementations but I will add it in the near future. So the Python version of Lingua will get this feature, too, but I cannot tell you when exactly. I do all of this in my free time which is limited, unfortunately. :-( But I can prioritize the Python implementation, now knowing that there is serious interest in it for a cool project of yours. :-) I don't know whether you have tried to build the language detector from a custom subset of languages. I'm sure that for most use cases, there are always languages which never occur in your input data. So just ignore them during language detection. This is the most reasonable thing you can do to reduce memory consumption. Besides, I plan to add further languages to Lingua in the future. Building the detector from all supported languages will not be feasible anymore at some point, should I decide to support 200 languages or more. :-D |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot for the quick and helpful response!
Ah, right, I had overlooked that. The README does say that up to 5-grams are used and of course there will thus be many more n-grams and a much larger model. I think there could be a clarification here, for example a final sentence in the first paragraph that says something along the lines that "Due to this design choice, Lingua models are much larger than those of most other language detection libraries."
Ah, that is exactly what I was looking for! Good to know you have this on your radar.
The challenge here is that I'm building a generic toolkit that tries to support many languages. So it's not just being tailor made for a small number of data sets, but could be applied in many different scenarios by people other than myself and my team anywhere in the world. While it's true that in most settings there will be a lot of languages that will never appear, the details can be different in different usage scenarios. If it's necessary to restrict the number of languages, that would have to be done using a configuration setting so that it can be adjusted according to needs. Actually, the current main use case of a language detector (pycld3 currently) in Annif is language filtering - to discover and drop sentences within a long document that are different from the main, expected language. For example, academic theses written in Finnish often have an abstract in English and sometimes Swedish as well. The English and Swedish language abstracts, as well as quoted sections which could be any other language, should be dropped from the text before further NLP steps because they would just cause noise further down the line. Maybe it would be possible to build a single language model in Lingua that only looks for (say) Finnish and can detect whether a sentence is Finnish or not-Finnish, without loading the models for all other possible languages? |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot! Likewise, I cannot promise that even if you do implement this, we will be able switch to using Lingua in Annif (instead of pycld3). It would have to be tested carefully, not only for memory consumption but also speed and practicality. Language filtering is just one of many steps in the Annif text processing pipeline, and not actually very crucial, so it doesn't make sense to spend tons of CPU and/or RAM just for that step. Those resources could be better spent in the latter stages, i.e. the text classification / subject indexing algorithms. Often they also involve similar tradeoffs, where larger models produce slightly better results but need more resources. We just integrated Annif with the Simplemma lemmatization library in the most recent 0.58 release. It was a good fit because it's very fast and efficient, although its lemmatization accuracy isn't quite as high as that of other libraries (e.g. spaCy). But that doesn't matter because the results are good enough. Also in that case, a library that does what it promises and is well maintained is what we are looking for, not necessarily top-notch accuracy. Being pure Python is a plus as well, becauses it generally makes deployments easier and is more future-proof when it comes to platform evolution (e.g. new Python releases). |
Beta Was this translation helpful? Give feedback.
-
This is certainly a big improvement in terms of memory usage, with some additional cost in lookups and initialization because querying sorted NumPy arrays for a specific value is O(log(n)) instead of O(1) for dictionaries. Also the initial sorting takes some time (you could avoid this by serializing the NumPy arrays instead of always loading from JSON). Now if you're looking to speed this up further, and also possibly reduce the memory consumption, you could take a look at using LMDB for storing the models instead. This way they would be stored on disk, in memory-mapped files, with very fast lookups. I don't know how that compares with the current NumPy storage, but I suspect it could be faster. Note that measuring the memory usage of a process that uses LMDB can be tricky; often the mmap'ed files show up as process memory even though that's not the "normal" kind of memory usage and much of that memory can be freed by the OS when necessary. Annif uses LMDB to store training data for a Keras/TF neural network backend. It has worked very well in this role; the only headache is that you need to tell LMDB up front how much space to allocate for the memory mapping. If you give it too little, it will fail when you exceed the limit. |
Beta Was this translation helpful? Give feedback.
-
I have another idea for optimization, in case LMDB turns out to be a bad idea or you simply don't want to pursue it. The basic problem AFAICT is storing a very large number of weights associated with n-grams of different lengths. Have you considered the hashing trick? For example fastText uses it. You could store all the weights in a single 1D NumPy array (note that all n-grams regardless of length would go into the same array). The indices would be chosen based on a hash value calculated from the n-gram strings. Like this:
Inevitably there will be some collisions - the larger you choose N, the fewer collisions there will be. For the colliding indices you need to calculate e.g. the average of weights and store that. This will lead to reduced accuracy. If you want to avoid that, you can use a special marker value for collisions (say -1) and have a separate fallback dictionary where you store the weights for colliding n-grams. With some luck and a well chosen value of N, you won't need a very big dictionary for this. This should be very fast as the lookups are O(1) even with a fallback dictionary. And although some of the memory is wasted for index values that are never needed, it should still use overall less memory than the current 2D solution which needs to store the n-gram strings as well. |
Beta Was this translation helpful? Give feedback.
Hi Osma, thank you for your question and your interest in my library.
Well, the README does mention in section 5 that Lingua uses much larger language models than other libraries. This fact provides for the high detection accuracy but is also the cause of high memory consumption. You are right that the rule engine is not the cause of high RAM usage - on the contrary, it helps to reduce RAM usage a little by filtering out languages which are considered unlikely for the given input text.