-
-
Notifications
You must be signed in to change notification settings - Fork 799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new connected channels may fail to receive message from group_send() #1244
Comments
same issue here, please help, thanks! |
I noticed the same behaviour a couple of weeks ago. My setup was It did not seem to appear when the software is running via |
We experience this issue as well. At first we thought it was an issue with the channels-redis backend so we moved over to a rabbitmq backend, but this has not solved the problem. It seems that daphne workers timeout to their backend and stop receiving subscribed group_sends |
So, can we pin that down to a specific issue with Daphne? (If so a ticket there with a reproduce would be wonderful.) (If not some kind of reproduce would be helpful...) |
So is it an nginx config issue? |
This isn't really actionable as is: there's nothing pointing to channels. Likely the issue is elsewhere in the stack. I'm going to close for now. If someone can put together a reproduce showing channels (or Daphne) is at fault, I'm happy to re-open. |
@carltongibson I'm not sure why you'd close it. Surely the fact that there's nothing actionable means that it warrants further investigation. It's not an NGINX issue because simply issuing a reload (kill -HUP) to daphne immediately fixes the problem. Also, anything that isn't group_send based (i.e. application level ping-pong) still work just fine. |
It's not an addressable issue as it stands. Is it DigitalOcean? Is it Nginx? Is it Daphne? Is it Channels? If you read the comments here, there is nothing in common... so pending a reproduce there's nothing to do. We don't have capacity to chase down phantom issues with no details.
|
Channels/Daphne/group_send are the common denominator. |
As I said, I'm very happy to investigate if anyone can provide a reproduce. |
So it would seem I owe @carltongibson an apology. After further debugging, this appears to be an odd behavior with nginx buffering. @DrChai @xuehaoze try adding "proxy_buffering off;" to your nginx conf that handles websockets proxying? EDIT: see below, this does not fix the problem. |
@qeternity Thanks for the follow-up and the pointer. Glad you made progress! |
@qeternity @carltongibson Thanks for your reply, we will give it a try! |
After running this for a few days in production, this does NOT resolve the issue. group_send is still broken, but everything else works. Restarting daphne is the only solution. I have tried both redis and rabbitmq backends so this suggests it's internal daphne/channels. |
@qeternity I have to upgrade an old project to this right now, and this issue is going to be a problem for me, as I'm using this same setup. Now, here's something from layers.py that concerns me, in async def group_send(self, group, message):
Neither is there any logging in self.send if the channel is full. It just raises the exception, which is then ignored by this code. If this condition happens, you'll never know it. So, my first thought is that the group messages are never completely cleaned up from the channel; possibly only in some specific use case, or possibly always. They'd get handled by the worker, but not removed from the channel. This would explain why it takes time for the scenario to happen, because it takes time for the channel to fill up. So, if you already have a setup where you can make this bug happen, could you add a debug line instead of "pass" to the final line, or perhaps just throw the exception again? If we know the problem is this, then the search for the bug should be much easier. Also, if this is the reason, I'd expect raising the frequency of group_send's in your test code to make it happen faster, unless the problem is caused by a group_send happening simultaneously with some other method call, and only then leaving the channel with one slot less space in it. This is all pure guesswork. This is the first time I have ever even looked at this project's code. |
Hi @mkoponen. That's interesting. (Gives an avenue to work on at least...) I have some time to work on Channels at DjangoCon Europe next week, so I'll re-open to investigate. (If anyone wants to jump in before then, please do! 🙂) |
@mkoponen @carltongibson I'll debug this week, but we've used two separate backends (redis and rabbitmq) and neither have an implementation like that. That's for the InMemory layer, which isn't what we're using. |
Ah, so it is. However, then my next question is, does this condition ever return true in the problem case? if redis.call('LLEN', KEYS[i]) < tonumber(ARGV[i + #KEYS]) then I don't understand all the concepts well enough yet to see what this line means, but I do notice that again the method would just fail silently if it didn't. The first goal should be to figure out that when the problem starts, is the problem in pushing the information into Redis, or receiving the information from it. EDIT: Another suggestion: Could there be a scenario in the two awaits at the top that seek to remove expired channels, such that they would simply await forever? EDIT2: I see. This is from "simple" send-method
Capacity is also given as argument to the Lua code, so it is clearly trying to do this same check about exceeded capacity. Only it does it in Lua, whereas send() does it in Python. But I'm back to my original hypothesis: The channel gets full, and send starts failing silently. EDIT3: Further hypothesis. Group messages get cleaned up only through expiry, and never through successful reception and handling of the message. Those users who send them very infrequently never notice the problem, because the slow expiry is enough to leave room for the new message whenever they need it. Those who send them frequently, hit capacity. |
@qeternity |
This problem is still not resolved. We are using the rabbitmq backend. Our server locked up again and I went into the rabbitmq admin panel and saw that the queues all had 100 messages waiting in the queue. This is definitely the problem. It would appear that the asgi websockets workers disconnect from the queue and stop receiving messages, which then pile up to the channel limit. |
So it's not redis... — still looking for a decent reproduce. I need to get DRF v3.10 out and then I'll have some time to dig into this, and other channels issues. |
Indeed, as I wrote above, we switched backends attempting to mitigate this but it had no impact. |
Was anyone able to make any progress on this? Or find a way to reproduce it reliably? Having the same issue. |
I am having the same issue with this. I've tried to create different test app in same env specs and everything is working, but my current app is failing to send correct messages to redis via channel layers. Any update? |
This still needs digging into, I'm afraid. I'm currently looking into issues around If someone could pin it down to an example project that reliably reproduced the issue, that would be more that half way to a solution. |
Hi guys, I've managed to fix my issues regarding this problem. The issue was rather stupid where I needed to open my WebSocket connection via frontend app. If I don't open WS connection and Django doesn't receive connection notification and successful handshake channels and Redis cannot send any messages via channel layers (tested and confirmed in my case). I don't see the logic behind this way but I've managed to remove my problem in this case. I hope this will help someone. Cheers! |
Seems like I've been looking for the culprit in the wrong place as well, at least for some cases. For the Googlers coming here: The issue was, that there wasn't actually a websockets connection anymore. We did listen to |
@maurice-g Sorry I am a bit new to this whole websockets. How do you implement manual heartbeat from frontend to backend? I am using the following:
|
I faced this issue and found two culprits so far:
|
Here is always reproducible example of similar issue https://stackoverflow.com/questions/59195607/django-channels-daphne-and-docker-clients-stop-receiving-messages-when-the-fir |
any solution? I am still facing the same issue stuck almost three days. What I think async_to_sync creates new connection each time it is called.. is there are any alternative? |
I encountered a similar problem ever, the client-side receives nothing from websocket after a period of time. After debug, my problem was caused by python3's bug: https://github.com/python/cpython/blob/v3.6.1/Lib/asyncio/locks.py#L163 asyncio does not deal with exceptions properly when acquiring a lock which makes other locks in deadlock state. After upgrading my local python it fixed. |
We encounter this problem as were developing a real-time game. As long as i could remember sometimes one may disconnect from websocket without even calling disconnect event. I think the problem is not about django channels cause we were using flask, but it's more about routers and switches, because sometimes intermediate routers and switches may drop the connection due to being idle and websocket will not call disconnect event. So we deploy a PING PONG messaging system that solves the problem for us. Although it's worth to say that we used a random data generator because PING or PONG messages were too light. Although the problem @mkoponen mentioned is a serious problem. |
I'm seeing this issue in my app. I can confirm in my case:
There's a common situation in my app where a client will send two messages to the server. One message is a broadcast (thus the server does I'm using redis. Although, others here have stated they see this problem no matter what channel layer they use. I haven't tried other channel layers myself. I'm just digging into this right now... Channels 3.0.1 |
I'm using uvicorn to serve my app and still having this issue. |
It looks occasionally some channels won't receive messages from using
group_send()
, although they all successfully connected to our server. Disconnected and re-connect may resolve the issue.Here is the sample:
two clients connected via websocket and joined a same group
partner_orders_1531
then calling
async_to_sync(channel_layer.group_send)
but only one channel with channel name
qUSiCEQD!TGlphabkczyd
was able to receive the message and triggeredrelated redis log :
Pip & Environment:
Python 3.6.6
channels==2.1.5
channels-redis==2.3.1
daphne==2.2.3
Django==1.11.10
The server is hosted on Ubuntu 16.04.5 LTS with 2vCPUs/4GB MEMORY by Digitalocean. We have two daphne instances and using nginx for load balancing.
The text was updated successfully, but these errors were encountered: