-
-
Notifications
You must be signed in to change notification settings - Fork 800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak? #1181
Comments
There might well be, but unfortunately tracing a memory leak down takes a lot of time, so I likely won't get to this until December or maybe 2019. If someone else wants to have a look, please do. |
I've found memory leaks in the past with a package called mem_top. You made a view like this:
I'm going to put this in production and see if it comes up with anything because I'm definitely having this issue as well. Hopefully after a few days in prod whatever it is will show up in there. |
I've been running it for a few days. Two things stand out in the report: My second biggest reference count with 5000 references is a collections.deque with data passed to channel_layer.group_send: My biggest object as far as bytes used is another Channels object. It has 1184 references and uses 73832 byes (not including the objects inside of it): This appears to be a receive_buffer in Channels Redis. Either of these might be normal. Or they might not. Hopefully this helps. After some more searching I believe both these objects are related. The object with 5000 references may (probably) be a member of the receive_buffer. An asyncio.Queue has a deque on the inside. My guess is something was missed in the logic for RedisChannelLayer.receive(). The del command is not being called. Another note: These captures are from one process. I have 8 processes running. Each process has 17-20 connections as of now. Only 2 pages in the fairly large application use the Redis Channel layer. I am pretty confident that those queues are not active connections and instead are old connections not being clean up. |
So both of those are in |
I'm pretty sure that is where the leak is. mem_top only provides the
My day is pretty booked today. Hopefully I can get some time tomorrow. |
I ran the test. I added the following code in @devxplorer's script in consumers.py.
Apparently queues are never deleted from the receive_buffer. The size of the receive_buffer is 2500 - the same number of connections I gave it. (I lowered the range from 500 to 100. It ran 25 * 100.) I even tried running another group of 25 requests to see if it would cause the receive_buffer to get cleared out. It just added 25 entries to the receive_buffer. It is my opinion that this is major issue and should be addressed as priority 1 before any other bug fixes or new features. This makes every Channels instillation using Channels Redis unstable and subject to being crashed by the OOM killer. Here is 1 line of the output being printed out every second:
|
I've hooked up a debugger. I don't 100% understand how this code works but I can see how the del command is never being called.
Update: So I changed my debug print line to It prints out So it may be the case that the buffers never get deleted only if they never get a message. I haven't confirmed that yet. More notes: I modified @devxplorer's code so every user gets added and removed from a group and gets sent a message through the channel layer and another message without going through the channel layer. It made no difference. The receive_buffer still contains an empty queue for every connection it had.
More investigation: del does get called if you send a message through the Channel Layer. It does not get called if you do not send a message through the channel layer. The problem is when it gets removed it is added back to the receive buffer on the next empty() check (since it is a defaultdict) and it remains there and follows the code flow I outlined at the top of this post. I added the same debug code to a production app. I'm noticing all the same issues but sometimes the queue doesn't go down to 0. With all connections closed after running some jobs that use group_send: Another update: I added the following code in the last finally block:
This appears to fix the issue, albeit not in the most elegant way. I'm going to try running this hacky monkey patch in production and see what happens:
Update: It doesn't work. Messages eventually come through but they get stuck. Also, refreshing a page while it is consuming messages causes the receive_buffer to get stuck with messages in it's queue. This is printing out over and over again in my console even though all pages are closed. I refreshed three times while messages were coming through.
|
I think I have a fix. @andrewgodwin I don't 100% understand this code so please let me know if this makes sense. With these changes the receive_buffer is being cleared as soon as the connections terminate. Below is the recieve() method with additions on the lines I made changes:
|
I'm not sure that the @agronick changes will not break anything, but I can confirm that the memory leak has stopped. Added a new graphic to the test project ( |
Those changes look sensible to me, @agronick - could you open them as a pull request? That way the CI system will test them and ensure all the tests pass too. |
I put these changes in prod and ran it over the long weekend. mem_top now shows my highest number of refs and bytes as thousands of objects that look like:
It seems to have solved one memory leak while creating another. Or perhaps this is a separate leak. I will need to investigate more. On the threads that have those objects I'm noticing an abnormally large number of ThreadPools (179). Update: Update: Appears this is a separate memory leak. It happens even when my changes to channels_redis are reverted. If you make a view that blocks indefinitely (test with time.sleep(9999999) ) in the method check_timeouts() the following is added to the application queue in an infinite loop Update:
This is overwriting the disconnected time causing the application instance to never be cleaned up in application_checker():
I'm going to change it to:
and see how prod handles the change. |
Great, thanks - that does look like a big improvement. |
Great. |
@agronick, thanks for big work! But there seems to be another problem when sending huge messages through websockets. I updated test project in new branch (https://github.com/devxplorer/ws_test/tree/big_payload). @andrewgodwin, should I create a new ticket for this or will you reopen this one? |
If it's a different bug, then open a new ticket with new steps to reproduce. Note that sending huge messages through sockets is always going to use up (size of message * 3) amount of RAM, so if this is a true leak you need to show that it's constantly leaking rather than always jumping up to the size of the largest message. |
Thank you for this conversation here! For all the googlers that find this issue: I've also encountered this memory leak, and upgrading to channels_redis 2.3.2 (2.3.3 is inompatible with Python 3.5...) solved my memory leak! |
I'm still seeing instances of this (or something like it) in https://github.com/ansible/awx/blob/devel/awx/main/consumers.py#L114 Perhaps this is similar to what @devxplorer noticed late last year. After noticing a slow memory leak in Daphe at high volumes of (large) messages: I decided to add some instrumentation to my routing code with diff --git a/awx/main/routing.py b/awx/main/routing.py
index 2866d46ed0..71e40c693f 100644
--- a/awx/main/routing.py
+++ b/awx/main/routing.py
@@ -1,3 +1,6 @@
+from datetime import datetime
+import tracemalloc
+
import redis
import logging
@@ -10,11 +13,16 @@ from channels.routing import ProtocolTypeRouter, URLRouter
from . import consumers
+TRACE_INTERVAL = 60 * 5 # take a tracemalloc snapshot every 5 minutes
+
+
logger = logging.getLogger('awx.main.routing')
class AWXProtocolTypeRouter(ProtocolTypeRouter):
def __init__(self, *args, **kwargs):
+ tracemalloc.start()
+class SnapshottedEventConsumer(consumers.EventConsumer):
+
+ last_snapshot = datetime.min
+
+ async def connect(self):
+ if (datetime.now() - SnapshottedEventConsumer.last_snapshot).total_seconds() > TRACE_INTERVAL:
+ snapshot = tracemalloc.take_snapshot()
+ top_stats = snapshot.statistics('lineno')
+ top_stats = [
+ stat for stat in top_stats[:25]
+ if 'importlib._bootstrap_external' not in str(stat)
+ ]
+ for stat in top_stats[:25]:
+ logger.error('[TRACE] ' + str(stat))
+ SnapshottedEventConsumer.last_snapshot = datetime.now()
+ await super().connect()
+
+
websocket_urlpatterns = [
- url(r'websocket/$', consumers.EventConsumer),
+ url(r'websocket/$', SnapshottedEventConsumer),
...and what I found was that we were leaking the actual messages somewhere:
I spent some time reading over https://github.com/django/channels_redis/blob/master/channels_redis/core.py and was immediately suspicious of
I think the root of this issue is that the In practice, having only a send / receive interface might make this a hard/expensive problem to solve. One thing that stands out to me about the Redis channel layer is that it does do expiration in redis, but once messages are loaded into memory, there doesn't seem to be any sort expiration logic for the |
I'm attempting to work around this issue in the AWX project itself at the moment by adding some code which tracks disconnections, and prevents the redis channel layer from continuing to handle messages for that specific channel: It's worth mentioning however, that this workaround probably isn't perfect, and I'm guessing it still sometimes leaks a little bit of data. In my opinion, what's really needed is some mechanism for Aside from a change in the Redis channel layer to add housekeeping to this data structure, can you think of anything else I could do here? |
Also @andrewgodwin, As a fellow open source maintainer, I understand the struggles of dealing with reports like this on closed issues 😄. Would you prefer that I continue this conversation here, or file a new distinct issue to track this? |
I currently don't maintain the channels projects, but I would suggest opening a new issue against |
Thanks @andrewgodwin! |
Hi @ryanpetrello. Good report! Let's see if we can sort this out on channels_redis. 👍 |
Hello! On the production service we faced with the problem of the constant infinite growth of memory consumption.
To reproduce the problem simple project was created (https://github.com/devxplorer/ws_test).
We test it with script
scripts/ws_test.py
, which makes opening a connection, sending a message and closing the connection.For diagnostics we used
memory_profiler
package. An examples of the resulting graphs when running a test script can be found in theplots/
folder.Local environment (where the tests was made):
requirements.txt
The conclusions that we have been able to do:
Perhaps when creating/closing connections, there is some kind of memory leak in
channels
?. Or there is some mistake in our code that we are missing.Maybe someone faced with a similar problem and it turned out to be solved? Any help would be appreciated.
The text was updated successfully, but these errors were encountered: