Reduce memory usage? #51

osma · 2022-08-04T14:21:16Z

osma
Aug 4, 2022

Naive (potential) user question here. I'm looking for a good, up to date language detection library for Annif - see this issue. Lingua seems promising, but it seems to require quite a lot of memory, especially when all supported languages are considered - this is pointed out in the README. I tested detecting the language of the example sentence "languages are awesome" and it required 1.8GB of memory. When I chose to preload all models, this increased to 2.6GB.

I tested doing the same with pycld3 and langdetect and their memory usage was much much lower - too little to bother measuring accurately. I don't see anything in the README that would justify using such huge amounts of RAM compared to other implementations. Having the rules is certainly good, but I don't think they use lots of RAM.

I'm wondering if there's some trick that other language detection libraries are performing to reduce their memory requirements? Could Lingua do that too? Or is this just a tradeoff that you have to accept if you want to achieve the high accuracy? For my purposes, although it's nice to have good accuracy, this isn't a top priority. It would also help to be able to choose smaller and faster models with slightly reduced accuracy.

Answered by pemistahl

Aug 4, 2022

Hi Osma, thank you for your question and your interest in my library.

I don't see anything in the README that would justify using such huge amounts of RAM compared to other implementations.

Well, the README does mention in section 5 that Lingua uses much larger language models than other libraries. This fact provides for the high detection accuracy but is also the cause of high memory consumption. You are right that the rule engine is not the cause of high RAM usage - on the contrary, it helps to reduce RAM usage a little by filtering out languages which are considered unlikely for the given input text.

I'm wondering if there's some trick that other language detection libraries are per…

View full answer

pemistahl · 2022-08-04T18:10:31Z

pemistahl
Aug 4, 2022
Maintainer

Hi Osma, thank you for your question and your interest in my library.

I don't see anything in the README that would justify using such huge amounts of RAM compared to other implementations.

Well, the README does mention in section 5 that Lingua uses much larger language models than other libraries. This fact provides for the high detection accuracy but is also the cause of high memory consumption. You are right that the rule engine is not the cause of high RAM usage - on the contrary, it helps to reduce RAM usage a little by filtering out languages which are considered unlikely for the given input text.

I'm wondering if there's some trick that other language detection libraries are performing to reduce their memory requirements?

I have not studied the implementation details of the other libraries. The main reason for lower memory requirements is the much smaller language models of the other libraries. For long texts, this does not make a significant difference but for short text Lingua is much more accurate.

Or is this just a tradeoff that you have to accept if you want to achieve the high accuracy?

If you want to achieve the highest accuracy possible, then you have to accept it, yes.

It would also help to be able to choose smaller and faster models with slightly reduced accuracy.

You have probably discovered my other implementations of Lingua which are written in Go, Rust and Kotlin. The Kotlin one has a new feature which I call low accuracy mode. When this mode is enabled, only a small subset of the language models is used which is enough for reliably detecting the language of long texts. For short text, however, accuracy will drop significantly. You can see the differences by looking at the plots. So the low accuracy mode uses much less RAM.

The low accuracy mode is not yet available in the other implementations but I will add it in the near future. So the Python version of Lingua will get this feature, too, but I cannot tell you when exactly. I do all of this in my free time which is limited, unfortunately. :-( But I can prioritize the Python implementation, now knowing that there is serious interest in it for a cool project of yours. :-)
Alternatively, you can try the Go or Rust implementations which consume less resources due to their low-level nature. Loading all language models will consume around 2 GB of memory here as well, however.

I don't know whether you have tried to build the language detector from a custom subset of languages. I'm sure that for most use cases, there are always languages which never occur in your input data. So just ignore them during language detection. This is the most reasonable thing you can do to reduce memory consumption. Besides, I plan to add further languages to Lingua in the future. Building the detector from all supported languages will not be feasible anymore at some point, should I decide to support 200 languages or more. :-D

0 replies

osma · 2022-08-05T07:25:52Z

osma
Aug 5, 2022
Author

Thanks a lot for the quick and helpful response!

Well, the README does mention in section 5 that Lingua uses much larger language models than other libraries. T

Ah, right, I had overlooked that. The README does say that up to 5-grams are used and of course there will thus be many more n-grams and a much larger model. I think there could be a clarification here, for example a final sentence in the first paragraph that says something along the lines that "Due to this design choice, Lingua models are much larger than those of most other language detection libraries."

The Kotlin one has a new feature which I call low accuracy mode. When this mode is enabled, only a small subset of the language models is used which is enough for reliably detecting the language of long texts. For short text, however, accuracy will drop significantly.

Ah, that is exactly what I was looking for! Good to know you have this on your radar.

I don't know whether you have tried to build the language detector from a custom subset of languages. I'm sure that for most use cases, there are always languages which never occur in your input data.

The challenge here is that I'm building a generic toolkit that tries to support many languages. So it's not just being tailor made for a small number of data sets, but could be applied in many different scenarios by people other than myself and my team anywhere in the world. While it's true that in most settings there will be a lot of languages that will never appear, the details can be different in different usage scenarios. If it's necessary to restrict the number of languages, that would have to be done using a configuration setting so that it can be adjusted according to needs.

Actually, the current main use case of a language detector (pycld3 currently) in Annif is language filtering - to discover and drop sentences within a long document that are different from the main, expected language. For example, academic theses written in Finnish often have an abstract in English and sometimes Swedish as well. The English and Swedish language abstracts, as well as quoted sections which could be any other language, should be dropped from the text before further NLP steps because they would just cause noise further down the line. Maybe it would be possible to build a single language model in Lingua that only looks for (say) Finnish and can detect whether a sentence is Finnish or not-Finnish, without loading the models for all other possible languages?

1 reply

pemistahl Aug 5, 2022
Maintainer

Thanks a lot for the quick and helpful response!

You are welcome. :-)

Ah, that is exactly what I was looking for! Good to know you have this on your radar.

I like what you are doing with the Annif project and would like to support it. I will try to find the time to implement the low accuracy mode in the Python library during the next week so that you can use Lingua in Annif. I cannot promise you anything but I will do my best. In any case, I'd be happy if you used my library for your concerns.

Maybe it would be possible to build a single language model in Lingua that only looks for (say) Finnish and can detect whether a sentence is Finnish or not-Finnish, without loading the models for all other possible languages?

This would require some kind of absolute confidence metric to decide whether some text is clearly Finnish or not. Such a feature has been requested multiple times already and I'm thinking about how to implement something like that but it's not easy.

osma · 2022-08-05T12:52:24Z

osma
Aug 5, 2022
Author

I like what you are doing with the Annif project and would like to support it. I will try to find the time to implement the low accuracy mode in the Python library during the next week so that you can use Lingua in Annif. I cannot promise you anything but I will do my best. In any case, I'd be happy if you used my library for your concerns.

Thanks a lot! Likewise, I cannot promise that even if you do implement this, we will be able switch to using Lingua in Annif (instead of pycld3). It would have to be tested carefully, not only for memory consumption but also speed and practicality. Language filtering is just one of many steps in the Annif text processing pipeline, and not actually very crucial, so it doesn't make sense to spend tons of CPU and/or RAM just for that step. Those resources could be better spent in the latter stages, i.e. the text classification / subject indexing algorithms. Often they also involve similar tradeoffs, where larger models produce slightly better results but need more resources.

We just integrated Annif with the Simplemma lemmatization library in the most recent 0.58 release. It was a good fit because it's very fast and efficient, although its lemmatization accuracy isn't quite as high as that of other libraries (e.g. spaCy). But that doesn't matter because the results are good enough. Also in that case, a library that does what it promises and is well maintained is what we are looking for, not necessarily top-notch accuracy. Being pure Python is a plus as well, becauses it generally makes deployments easier and is more future-proof when it comes to platform evolution (e.g. new Python releases).

1 reply

pemistahl Aug 11, 2022
Maintainer

Hi @osma, I've got good news for you. :)

I've managed to reduce the memory usage significantly by storing the language models in structured NumPy arrays. Also, I've replaced 64bit floating point numbers with 16bit ones as they are precise enough for language detection. Furthermore, I've removed some language models from those languages that have a unique alphabet. The rule engine is good enough to detect these languages, so the large language models are not necessary for them. If you are interested in the code changes, take a look at this commit.

Additionally, I've implemented the low accuracy mode that I talked about. You can enable it by calling LanguageDetectorBuilder.with_low_accuracy_mode().

In the scripts folder, there are two new scripts now that perform some benchmarks and profile memory usage. Here are the results I obtained on my machine.

Before using NumPy arrays

Measuring time to preload all language models...
Time: 9.83 seconds

Measuring time to detect language of 10,000 sentences after preloading all models...
Time: 32.97 seconds

Measuring memory usage of language models...
Unigrams: 0.89 MB
Bigrams: 3.84 MB
Trigrams: 40.23 MB
Quadrigrams: 241.51 MB
Fivegrams: 535.81 MB
Entire Python process: 2720.86 MB

After using NumPy arrays

Measuring time to preload all language models...
Time: 28.55 seconds

Measuring time to detect language of 10,000 sentences after preloading all models...
Time: 38.18 seconds

Measuring memory usage of language models...
Unigrams: 0.14 MB
Bigrams: 0.98 MB
Trigrams: 14.22 MB
Quadrigrams: 92.76 MB
Fivegrams: 295.51 MB
Entire Python process: 771.22 MB

The entire memory usage has been reduced by 2 GB which is awesome. Loading and querying the models has become a little slower because the NumPy arrays are sorted so that a binary search algorithm can be applied. This is not as fast as querying dictionaries in the previous implementation but the numbers are still acceptable in my opinion.

I would be happy if you evaluated the updated implementation of Lingua once again. Is it now suitable for your purposes? If so, I will release version 1.1.0 within the next week.

Thanks a lot in advance for any further feedback.

osma · 2022-08-23T08:47:41Z

osma
Aug 23, 2022
Author

This is certainly a big improvement in terms of memory usage, with some additional cost in lookups and initialization because querying sorted NumPy arrays for a specific value is O(log(n)) instead of O(1) for dictionaries. Also the initial sorting takes some time (you could avoid this by serializing the NumPy arrays instead of always loading from JSON).

Now if you're looking to speed this up further, and also possibly reduce the memory consumption, you could take a look at using LMDB for storing the models instead. This way they would be stored on disk, in memory-mapped files, with very fast lookups. I don't know how that compares with the current NumPy storage, but I suspect it could be faster. Note that measuring the memory usage of a process that uses LMDB can be tricky; often the mmap'ed files show up as process memory even though that's not the "normal" kind of memory usage and much of that memory can be freed by the OS when necessary.

Annif uses LMDB to store training data for a Keras/TF neural network backend. It has worked very well in this role; the only headache is that you need to tell LMDB up front how much space to allocate for the memory mapping. If you give it too little, it will fail when you exceed the limit.

1 reply

pemistahl Aug 23, 2022
Maintainer

Thank you for the suggestion, @osma. As soon as I find the time, I will try LMDB.

osma · 2022-08-26T16:16:47Z

osma
Aug 26, 2022
Author

I have another idea for optimization, in case LMDB turns out to be a bad idea or you simply don't want to pursue it.

The basic problem AFAICT is storing a very large number of weights associated with n-grams of different lengths. Have you considered the hashing trick? For example fastText uses it. You could store all the weights in a single 1D NumPy array (note that all n-grams regardless of length would go into the same array). The indices would be chosen based on a hash value calculated from the n-gram strings. Like this:

Allocate a big 1D array for the weights. Its size (N) could be e.g. around twice the number of expected n-grams.
For each n-gram string, calculate a hash value - probably using the built-in hash() function.
The index where to store the weight is the hash value modulo N. If N is a power of 2, you can even use bitwise AND instead of modulo, which should be faster to calculate.

Inevitably there will be some collisions - the larger you choose N, the fewer collisions there will be. For the colliding indices you need to calculate e.g. the average of weights and store that. This will lead to reduced accuracy. If you want to avoid that, you can use a special marker value for collisions (say -1) and have a separate fallback dictionary where you store the weights for colliding n-grams. With some luck and a well chosen value of N, you won't need a very big dictionary for this.

This should be very fast as the lookups are O(1) even with a fallback dictionary. And although some of the memory is wasted for index values that are never needed, it should still use overall less memory than the current 2D solution which needs to store the n-gram strings as well.

2 replies

pemistahl Aug 30, 2022
Maintainer

Generally, feature hashing is a good idea for this kind of problem. I did a quick implementation of it yesterday and benchmarked it. I was quite disappointed to find out that the feature hashing algorithm had pretty much the same performance as the current implementation. Loading the language models was faster (13 seconds compared to 28 seconds on my machine) but the actual language detection took the same amount of time.

With the help of multiprocessing, I think I can achieve the same performance gain for model loading. In order to speed up language detection itself, I will try to incorporate some Cython code. This should yield better results than algorithmic changes.

osma Aug 31, 2022
Author

Sounds good! I'm a bit surprised that hashing didn't improve performance, but that's what experimentation is all about!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage? #51

{{title}}

Replies: 5 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Reduce memory usage? #51

osma Aug 4, 2022

Replies: 5 comments · 5 replies

pemistahl Aug 4, 2022 Maintainer

osma Aug 5, 2022 Author

pemistahl Aug 5, 2022 Maintainer

osma Aug 5, 2022 Author

pemistahl Aug 11, 2022 Maintainer

osma Aug 23, 2022 Author

pemistahl Aug 23, 2022 Maintainer

osma Aug 26, 2022 Author

pemistahl Aug 30, 2022 Maintainer

osma Aug 31, 2022 Author

osma
Aug 4, 2022

Replies: 5 comments 5 replies

pemistahl
Aug 4, 2022
Maintainer

osma
Aug 5, 2022
Author

pemistahl Aug 5, 2022
Maintainer

osma
Aug 5, 2022
Author

pemistahl Aug 11, 2022
Maintainer

osma
Aug 23, 2022
Author

pemistahl Aug 23, 2022
Maintainer

osma
Aug 26, 2022
Author

pemistahl Aug 30, 2022
Maintainer

osma Aug 31, 2022
Author