new connected channels may fail to receive message from group_send() #1244

DrChai · 2019-02-08T02:55:07Z

It looks occasionally some channels won't receive messages from using group_send(), although they all successfully connected to our server. Disconnected and re-connect may resolve the issue.
Here is the sample:

two clients connected via websocket and joined a same group partner_orders_1531

127.0.0.1:6379[1]> zrange asgi::group:partner_orders_1531 0 -1 withscores
1) "specific.mrOZIufu!GAaMMeISqdry"
2) "1549585254.4294083"
3) "specific.qUSiCEQD!TGlphabkczyd"
4) "1549585307.3211055"

then calling async_to_sync(channel_layer.group_send)

channel_layer = get_channel_layer()
async_to_sync(channel_layer.group_send)(partner_order_group_name, order_data)

but only one channel with channel name qUSiCEQD!TGlphabkczyd was able to receive the message and triggered

class PartnerOrderConsumer(WebsocketConsumer):
        ...
        def order_receive(self, event):
            self.send(text_data=event["text"])

related redis log :

Pip & Environment:
Python 3.6.6
channels==2.1.5
channels-redis==2.3.1
daphne==2.2.3
Django==1.11.10
The server is hosted on Ubuntu 16.04.5 LTS with 2vCPUs/4GB MEMORY by Digitalocean. We have two daphne instances and using nginx for load balancing.

The text was updated successfully, but these errors were encountered:

xuehaoze · 2019-02-08T20:30:40Z

same issue here, please help, thanks!

bastbnl · 2019-02-09T08:17:15Z

I noticed the same behaviour a couple of weeks ago. My setup was python manage.py runserver behind nginx for SSL offloading. It did not appear to happen when the websocket connection was established to the runserver process without nginx in front.

It did not seem to appear when the software is running via daphne, but I'll have to check it in more detail to be sure.

qeternity · 2019-03-25T10:16:48Z

We experience this issue as well. At first we thought it was an issue with the channels-redis backend so we moved over to a rabbitmq backend, but this has not solved the problem. It seems that daphne workers timeout to their backend and stop receiving subscribed group_sends

carltongibson · 2019-03-26T10:39:39Z

It seems that daphne workers timeout to their backend and stop receiving subscribed group_sends

So, can we pin that down to a specific issue with Daphne? (If so a ticket there with a reproduce would be wonderful.) (If not some kind of reproduce would be helpful...)

carltongibson · 2019-03-26T10:40:08Z

...behind nginx for SSL offloading.

So is it an nginx config issue?

carltongibson · 2019-03-26T10:41:32Z

This isn't really actionable as is: there's nothing pointing to channels. Likely the issue is elsewhere in the stack.

I'm going to close for now. If someone can put together a reproduce showing channels (or Daphne) is at fault, I'm happy to re-open.

qeternity · 2019-03-26T10:43:42Z

@carltongibson I'm not sure why you'd close it. Surely the fact that there's nothing actionable means that it warrants further investigation. It's not an NGINX issue because simply issuing a reload (kill -HUP) to daphne immediately fixes the problem. Also, anything that isn't group_send based (i.e. application level ping-pong) still work just fine.

carltongibson · 2019-03-26T11:35:32Z

It's not an addressable issue as it stands. Is it DigitalOcean? Is it Nginx? Is it Daphne? Is it Channels? If you read the comments here, there is nothing in common... so pending a reproduce there's nothing to do. We don't have capacity to chase down phantom issues with no details. needsinfo is a legitimate closing reason.

If someone can put together a reproduce showing channels (or Daphne) is at fault, I'm happy to re-open.

qeternity · 2019-03-26T12:40:37Z

If you read the comments here, there is nothing in common

Channels/Daphne/group_send are the common denominator.

carltongibson · 2019-03-26T12:53:52Z

It did not seem to appear when the software is running via daphne

As I said, I'm very happy to investigate if anyone can provide a reproduce.

qeternity · 2019-03-30T15:50:57Z

So it would seem I owe @carltongibson an apology. After further debugging, this appears to be an odd behavior with nginx buffering.

@DrChai @xuehaoze try adding "proxy_buffering off;" to your nginx conf that handles websockets proxying?

EDIT: see below, this does not fix the problem.

carltongibson · 2019-03-30T20:16:16Z

@qeternity Thanks for the follow-up and the pointer. Glad you made progress!

xuehaoze · 2019-03-30T22:26:04Z

@qeternity @carltongibson Thanks for your reply, we will give it a try!

qeternity · 2019-04-03T10:24:02Z

After running this for a few days in production, this does NOT resolve the issue. group_send is still broken, but everything else works. Restarting daphne is the only solution. I have tried both redis and rabbitmq backends so this suggests it's internal daphne/channels.

mkoponen · 2019-04-03T22:56:54Z

@qeternity I have to upgrade an old project to this right now, and this issue is going to be a problem for me, as I'm using this same setup. Now, here's something from layers.py that concerns me, in async def group_send(self, group, message):

for channel in self.groups.get(group, set()): 
 try: 
  await self.send(channel, message) 
 except ChannelFull: 
  pass

Neither is there any logging in self.send if the channel is full. It just raises the exception, which is then ignored by this code. If this condition happens, you'll never know it.

So, my first thought is that the group messages are never completely cleaned up from the channel; possibly only in some specific use case, or possibly always. They'd get handled by the worker, but not removed from the channel. This would explain why it takes time for the scenario to happen, because it takes time for the channel to fill up.

So, if you already have a setup where you can make this bug happen, could you add a debug line instead of "pass" to the final line, or perhaps just throw the exception again? If we know the problem is this, then the search for the bug should be much easier. Also, if this is the reason, I'd expect raising the frequency of group_send's in your test code to make it happen faster, unless the problem is caused by a group_send happening simultaneously with some other method call, and only then leaving the channel with one slot less space in it.

This is all pure guesswork. This is the first time I have ever even looked at this project's code.

carltongibson · 2019-04-04T06:46:49Z

Hi @mkoponen. That's interesting. (Gives an avenue to work on at least...) I have some time to work on Channels at DjangoCon Europe next week, so I'll re-open to investigate. (If anyone wants to jump in before then, please do! 🙂)

qeternity · 2019-04-08T17:07:53Z

@mkoponen @carltongibson I'll debug this week, but we've used two separate backends (redis and rabbitmq) and neither have an implementation like that. That's for the InMemory layer, which isn't what we're using.

mkoponen · 2019-04-08T17:32:00Z

Ah, so it is. However, then my next question is, does this condition ever return true in the problem case?

if redis.call('LLEN', KEYS[i]) < tonumber(ARGV[i + #KEYS]) then

I don't understand all the concepts well enough yet to see what this line means, but I do notice that again the method would just fail silently if it didn't.

The first goal should be to figure out that when the problem starts, is the problem in pushing the information into Redis, or receiving the information from it.

EDIT: Another suggestion: Could there be a scenario in the two awaits at the top that seek to remove expired channels, such that they would simply await forever?

EDIT2: I see. This is from "simple" send-method

# Check the length of the list before send
# This can allow the list to leak slightly over capacity, but that's fine.
if await connection.llen(channel_key) >= self.get_capacity(channel):
    raise ChannelFull()

Capacity is also given as argument to the Lua code, so it is clearly trying to do this same check about exceeded capacity. Only it does it in Lua, whereas send() does it in Python. But I'm back to my original hypothesis: The channel gets full, and send starts failing silently.

EDIT3: Further hypothesis. Group messages get cleaned up only through expiry, and never through successful reception and handling of the message. Those users who send them very infrequently never notice the problem, because the slow expiry is enough to leave room for the new message whenever they need it. Those who send them frequently, hit capacity.

xiaoqiao99 · 2019-04-22T05:06:16Z

@qeternity
Excuse me,
Have it been solved now?

qeternity · 2019-05-16T17:35:37Z

This problem is still not resolved. We are using the rabbitmq backend. Our server locked up again and I went into the rabbitmq admin panel and saw that the queues all had 100 messages waiting in the queue. This is definitely the problem. It would appear that the asgi websockets workers disconnect from the queue and stop receiving messages, which then pile up to the channel limit.

carltongibson · 2019-05-16T19:41:22Z

We are using the rabbitmq backend

So it's not redis... — still looking for a decent reproduce.

I need to get DRF v3.10 out and then I'll have some time to dig into this, and other channels issues.
Following up @mkoponen's lead seems most promising currently. If anyone has time to look, that's be great.

qeternity · 2019-05-17T08:29:04Z

Indeed, as I wrote above, we switched backends attempting to mitigate this but it had no impact.

maurice-g · 2019-10-03T14:17:29Z

Was anyone able to make any progress on this? Or find a way to reproduce it reliably? Having the same issue.

Kavarakis · 2019-10-11T11:24:53Z

I am having the same issue with this. I've tried to create different test app in same env specs and everything is working, but my current app is failing to send correct messages to redis via channel layers.

Any update?

carltongibson · 2019-10-11T12:00:10Z

This still needs digging into, I'm afraid.

I'm currently looking into issues around AsyncHttpConsumer (in the time I have 😝). Maybe this will turn out to be related to that. If not this is on my list for afterwards.

If someone could pin it down to an example project that reliably reproduced the issue, that would be more that half way to a solution.

Kavarakis · 2019-10-11T12:11:21Z

Hi guys, I've managed to fix my issues regarding this problem. The issue was rather stupid where I needed to open my WebSocket connection via frontend app. If I don't open WS connection and Django doesn't receive connection notification and successful handshake channels and Redis cannot send any messages via channel layers (tested and confirmed in my case). I don't see the logic behind this way but I've managed to remove my problem in this case.

I hope this will help someone.

Cheers!

maurice-g · 2019-10-24T08:11:19Z

Seems like I've been looking for the culprit in the wrong place as well, at least for some cases.

For the Googlers coming here: The issue was, that there wasn't actually a websockets connection anymore. We did listen to close and error events on the websocket connection in the frontend, and reacted by automatically trying to reconnect. Thus, I assumed, there would always be an active WS connection to django channels. Turns out that in some cases (such as just disconnecting your wifi) no close or error event will be thrown (tested in Chrome 77) and the browser will happily report the connection as active forever. The backend will close the connection from its side after ~50s (no idea where that number is coming from.. Channels / Daphne / Load balancer?) and therefore also send no further notifications to that channel.
Solution for us was implementing manual heartbeat messages from frontend to the backend every few seconds. Contrary to my understanding of websockets, this is not handled by the Browser (whatever use the Ping/Pong has then..).

simkimsia · 2019-11-07T06:22:56Z

@maurice-g Sorry I am a bit new to this whole websockets. How do you implement manual heartbeat from frontend to backend?

I am using the following:

celery[redis]==4.3.0
channels==2.2.0  
channels-redis==2.4.0

tarikki · 2019-11-09T01:40:15Z

I faced this issue and found two culprits so far:

Client silently disconnects
If you're using django-channels to communicate to a websocket running in a browser, then this is the most likely cause. I noticed that if my users lock their device then come back to my web app after a certain amount of time has passed, then the connection will not be there anymore but it will not have thrown any errors either. The solution is to do what @maurice-g suggested and have a manual heartbeat between the browser and the backend
group_send in channels_redis silently failing if channel over capacity
Following the lead from @mkoponen I noticed that when you use group_send() with channels_redis as the backend, then there is the potential for group_send() to silently fail. When using group_send, before sending the current message the lua script will check the number of messages in the channel. If the channel is not yet at full capacity, it will send the message. However if the channel has hit full capacity then this channel will be skipped silently. I wrote a PR to fix this so the number channels that are at full capacity will be logged, which will help with debuggin.

SerhiiK17 · 2019-12-22T18:01:47Z

Here is always reproducible example of similar issue https://stackoverflow.com/questions/59195607/django-channels-daphne-and-docker-clients-stop-receiving-messages-when-the-fir

sabuhish · 2020-02-08T08:36:02Z

any solution? I am still facing the same issue stuck almost three days. What I think async_to_sync creates new connection each time it is called.. is there are any alternative?

kmiku7 · 2020-04-10T09:19:29Z

I encountered a similar problem ever, the client-side receives nothing from websocket after a period of time. After debug, my problem was caused by python3's bug: https://github.com/python/cpython/blob/v3.6.1/Lib/asyncio/locks.py#L163 asyncio does not deal with exceptions properly when acquiring a lock which makes other locks in deadlock state. After upgrading my local python it fixed.
Hope it is helpful for you.

hamedkalantari · 2020-08-29T08:07:57Z

We encounter this problem as were developing a real-time game. As long as i could remember sometimes one may disconnect from websocket without even calling disconnect event. I think the problem is not about django channels cause we were using flask, but it's more about routers and switches, because sometimes intermediate routers and switches may drop the connection due to being idle and websocket will not call disconnect event. So we deploy a PING PONG messaging system that solves the problem for us. Although it's worth to say that we used a random data generator because PING or PONG messages were too light.

Although the problem @mkoponen mentioned is a serious problem.

acu192 · 2021-04-17T05:44:13Z

I'm seeing this issue in my app. I can confirm in my case:

It is not a silent disconnect of the front-end.
It happens only on group_send ... but just sometimes. The same code-path succeeds most of the time but fails occasionally (fail case = messages sent with group_send are not actually sent).
It never fails when sending directly through channel layer (i.e. self.channel_layer.send(...)) .

There's a common situation in my app where a client will send two messages to the server. One message is a broadcast (thus the server does self.channel_layer.group_send) and the other is intended for a specific client (thus the server does self.channel_layer.group_send). Most of the time this works. But occasionally the broadcast message is silently dropped. This is the scenario that leads to my 1-3 conclusions in the list above. The broadcast message is sent first, which is how I know it's not a silent disconnect on the front-end (because the front-ends that gets the second message always gets it, but it doesn't always get the broadcast message that was sent first).

I'm using redis. Although, others here have stated they see this problem no matter what channel layer they use. I haven't tried other channel layers myself. I'm just digging into this right now...

Channels 3.0.1
Python 3.7

itsmehdi97 · 2021-10-30T06:39:45Z

I'm using uvicorn to serve my app and still having this issue.
In my case the problem is totally gone when I use only 1 uvicorn worker, but the more workers are used the more messages are dropped.

carltongibson closed this as completed Mar 26, 2019

qeternity mentioned this issue Mar 31, 2019

the service run for a while ，but don't push the message sudden, you have to restart the service. Why is that? #1271

Closed

carltongibson reopened this Apr 4, 2019

This was referenced Apr 21, 2021

Rare (but reproducible!) dropped messages when using group_send() #1683

Closed

Why not use Redis Pub/Sub? django/channels_redis#245

Closed

nicksellen mentioned this issue Aug 28, 2023

Dropping websocket messages? karrot-dev/karrot-frontend#2672

Closed

new connected channels may fail to receive message from group_send() #1244

new connected channels may fail to receive message from group_send() #1244

Comments

DrChai commented Feb 8, 2019

xuehaoze commented Feb 8, 2019

bastbnl commented Feb 9, 2019

qeternity commented Mar 25, 2019

carltongibson commented Mar 26, 2019

carltongibson commented Mar 26, 2019

carltongibson commented Mar 26, 2019

qeternity commented Mar 26, 2019

carltongibson commented Mar 26, 2019

qeternity commented Mar 26, 2019 • edited Loading

carltongibson commented Mar 26, 2019

qeternity commented Mar 30, 2019 • edited Loading

carltongibson commented Mar 30, 2019

xuehaoze commented Mar 30, 2019

qeternity commented Apr 3, 2019

mkoponen commented Apr 3, 2019

carltongibson commented Apr 4, 2019

qeternity commented Apr 8, 2019

mkoponen commented Apr 8, 2019 • edited Loading

xiaoqiao99 commented Apr 22, 2019 • edited Loading

qeternity commented May 16, 2019

carltongibson commented May 16, 2019

qeternity commented May 17, 2019

maurice-g commented Oct 3, 2019

Kavarakis commented Oct 11, 2019

carltongibson commented Oct 11, 2019

Kavarakis commented Oct 11, 2019

maurice-g commented Oct 24, 2019

simkimsia commented Nov 7, 2019

tarikki commented Nov 9, 2019 • edited Loading

SerhiiK17 commented Dec 22, 2019

sabuhish commented Feb 8, 2020

kmiku7 commented Apr 10, 2020 • edited Loading

hamedkalantari commented Aug 29, 2020

acu192 commented Apr 17, 2021

itsmehdi97 commented Oct 30, 2021

qeternity commented Mar 26, 2019 •

edited

Loading

qeternity commented Mar 30, 2019 •

edited

Loading

mkoponen commented Apr 8, 2019 •

edited

Loading

xiaoqiao99 commented Apr 22, 2019 •

edited

Loading

tarikki commented Nov 9, 2019 •

edited

Loading

kmiku7 commented Apr 10, 2020 •

edited

Loading