-
-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continuously increasing memory usage when pylint is run via its API #792
Comments
|
Is there any workaround for this? |
perhaps this: https://stackoverflow.com/a/50699209 import functools
import gc
gc.collect()
wrappers = [
a for a in gc.get_objects()
if isinstance(a, functools._lru_cache_wrapper)]
for wrapper in wrappers:
wrapper.cache_clear() |
I wonder if it would make sense for astroid to implement public APIs for controlling caches? For example, astroid could have its own As an aside, pylint has similar issues and any solution we implement here could probably be applied to pylint as well. |
Right now the "cache" is simply a mutable default argument value in a function, so some change needs to be made to makes that happen. What do you have in mind for this API @superbobry ? |
Sorry, if I was vague. I was thinking of something like LRU_CACHES = weakref.WeakSet()
def lru_cache(maxsize=128, typed=False):
def wrapper(func):
cached_func = functools.lru_cache(maxsize=maxsize, typed=typed)(func)
LRU_CACHES.add(cached_func)
return cached_func
return wrapper
# ...
def clear_caches():
for c in LRU_CACHES:
c.cache_clear() so that any internal astroid/pylint API which needs to be cached is explicitly tracked and invalidated via |
This seem better than what we have right now. But I think we need some auto-clearing mechanism because what would be the trigger to clear the cache in pylint otherwise ? I'm absolutely not a caching expert, but maybe we can count the number of time a cache is used and keep only the "useful" entry? |
I was assuming |
Last time I check the problem is that the cache (or is it a variable) is a tree of module dependencies. That means for example the os module could be cached twice if we are linting two different files which import the os module. It would be more efficient both in memory and execution to use a key based cache. |
Hey @char101, this sound interesting. Could you link me to the relevant bits of the code where this is happening? |
Sorry it was more than a year ago. Looking at pylint memory usage now seems to be much better. module.py import os main.py import gc
import astroid
import psutil
from pylint import lint
from pylint.reporters.base_reporter import BaseReporter
class Reporter(BaseReporter):
def _display(self, layout):
pass
def main():
prev = psutil.Process().memory_info().rss
print(prev)
for _ in range(0, 30):
lint.Run(['--reports', 'n', 'module.py'], reporter=Reporter(), exit=False)
gc.collect()
rss = psutil.Process().memory_info().rss
print(rss, rss - prev)
prev = rss
if __name__ == '__main__':
main()
Unfortunately recent pylint installed from pypi would throw an exception when linting pyparsing.py
With 245 KB test_tables.py from pytables
|
@char101 are you using the latest pylint ? This is supposed to be fixed by pylint-dev/pylint#4439 (ie. astroid 2.6.5) |
I was using the version in pypi
|
The mutable default is problematic when astroid is used as a library, because it effectively becomes a memory leak, see pylint-dev#792. This commit moves the cache to the global namespace and adds a public API entry point to clear it.
The mutable default is problematic when astroid is used as a library, because it effectively becomes a memory leak, see pylint-dev#792. This commit moves the cache to the global namespace and adds a public API entry point to clear it.
* Removed mutable default value in _inference_tip_cache The mutable default is problematic when astroid is used as a library, because it effectively becomes a memory leak, see #792. This commit moves the cache to the global namespace and adds a public API entry point to clear it. * Removed the itertools.tee call from _inference_tip_cached This commit is not expected to affect the behavior, and if anything, should improve memory usage, because result is only materilized once (before it was also stored in-full inside itertools.tee).
I've digged into this issue as it is still present in the latest release (2.8.4). The following tweaks combined fixed the memory leak for me (RSS is <100MB after parsing ~100K files ):
In summary, I think we need to implement a common custom caching mechanism as suggested by @superbobry, and expose an API to clear all the caches. What do you think? |
Thank you for digging into this @keichi, maybe we can handle 1) by specifying the capacity in |
Another option is to use sqlite or a key value database to store the cache (possibly compressed), then memory wouldn't be a problem. It might also help speed up execution of the cli command. |
@char101switching to SQLite would hide the problem instead, but it wouldn't fix it. Any unbounded cache indexed on an astroid AST node is leaky when pylint is used as a library. On top of that, bounded caches could also be expensive if they retain a reference to the linter (indirectly ofc). @keichi are you willing to give the common caching mechanism a go? Happy to review the PR if @Pierre-Sassoulas agrees it is worth exploring. @Pierre-Sassoulas my suggestion was to add an API for flushing all caches. I don't have a use-case for changing cache capacity on the fly, but perhaps others do? I would say, thought that all caches should be bounded and even with that it might still be desirable to flush them periodically/in between API calls to reduce memory usage. |
Yes, continuously increasing cache without any solution to clear it is bad :)
Not on the fly, but I guess astroid run on pretty various architectures. If it's hard coded the question of what bound should be applied exactly will rise. The default value we could use is probably in the range of 2g - 8g, but being able to change the value if you have 256g of RAM will arise at some point I assume. |
The module ast references each other right? Trimming the cache might not necessarily frees the objects unless the whole cache is cleared. Removing the module ast from the cache while it is still being referenced by other modules could result in memory leak. So to be safe it might be necessary to loop the cache, use the gc module to find the objects where the referer count is only 1 and remove only those objects from the cache. And this process might need to be run several times until the target memory is reached. |
I'd be happy to draft a PR to introduce a new caching mechanism with help from you guys. I agree the capacity should be configurable but a flushing API is also necessary in certain use cases. For example, I am writing a tool that parses every file in a repository for every git commit, where I need to empty the cache when checking out a new commit. My suggestion would be to introduce a cache with a user-configurable capacity and a function to flush all caches. |
All the modules AST references each other in what become a large graph. To trim a graph we need to start from the leaf nodes so IMO the focus should not be on the cache mechanism only since the cache mechanism are only part of the graph. Alternatively the modules AST should refer to each othes using a weakref, and when the weakref is null, then trigger a module reparse. I think the cache is simply a key index (where the key = module name) to an in-memory graph database. |
@char101 Are you talking about The problems here are Any circular references between the graph nodes are fine because the nodes will be eventually collected by the GC once they become unreachable. |
I've found another suspicious bit: This mutates I was able to reproduce a leak even with all caches explicitly flushed by linting just @keichi as a slight aside, maybe a better way to solve the caching issue once and for all is to introduce a concept of session and only allow caching things in the session object? |
Changing |
Hmm strange, I knew there was a small leak even after clearing the cache but it was small enough that it was practically not a problem (I'm able to parse 100K files as I wrote earlier). The session approach would be ideal in the long term. It would involve a lot of refactoring and breaking changes to the API though. Maybe we can base it off of |
Steps to reproduce
When using pylint in-process to lint a file (in this case in vim), the memory usage of the editor is continually increasing. The increase is even higher when the cache is cleared. I believe that this is caused because the module is still referenced by
lru_cache
even though it has been deleted from the manager cache and each new parsing adds a new cache entry.The linted file is
253KB
pyparsing.py
renamed topyparsing1.py
.Current behavior
Increasing memory usage
Expected behavior
Stable memory usage
python -c "from astroid import __pkginfo__; print(__pkginfo__.version)"
output2.4.1
The text was updated successfully, but these errors were encountered: