Rare (but reproducible!) dropped messages when using `group_send()` #1683

acu192 · 2021-04-21T15:53:53Z

I have a repository where you can reproduce this bug.

https://github.com/acu192/channels-bug

The repo is a chat-room-sort-of-thing (just as a demo, to see the bug). It's as minimal as I can make it. It uses the Redis channel layer.

If you connect 7 clients to the chat room (3 in one room; 4 in another), you can reproduce the bug in a few minutes. Eventually, the server will drop a message and the receiving clients will notice the "gap" in the message sequence; those clients will print an error and exit.

Possibly related to #1244. I'm filing a new issue here, because I feel there is a lot of contradictory speculation in #1244 and I hope this new issue (with the reproducible code) can provide a fresh, more concrete, path toward fixing.

carltongibson · 2021-04-21T17:07:31Z

@acu192 Super, yes dead right. A clean slate really helps. 👍

acu192 · 2021-04-22T04:29:47Z

Two more observations (repo is also updated with the same info):

I cannot reproduce the bug unless there are clients that connect-disconnect-reconnect in a loop. Before anyone says "oh, well of course messages are lost when a client disconnects". Yes, but no. I'm looking here at gaps in the sequence number of messages that are delivered to the peers. If there is a gap, there must have been a message dropped by the server, which is what happens in this case. Why do you need clients that connect-disconnect-reconnect in a loop? I don't know, but it seems that is what it takes for gaps to appear in the sequence numbers.
I cannot reproduce it when using InMemoryChannelLayer... at all, no matter what I try, suggesting (but not proving) that the issue is with the Redis Channel Layer. @carltongibson Do you think this is enough evidence that this issue should be moved to channels_redis?

carltongibson · 2021-04-22T08:43:04Z

@acu192 — I can't say yet. Personally, I need a session to sit down with your example and investigate. I suspect it's not simple.

If you want to keep trying to narrow it down, that's all super! (It's not vital which repo it lives on still at this stage — for me, the key progress here is that you say it reproduces, which we've been lacking so far...)

acu192 · 2021-04-22T15:56:14Z

I suspect it's not simple.

Yeah, this is the hardest type of bug. 😔

If you want to keep trying to narrow it down, that's all super!

I've narrowed it down a bit more:

It happens when a client disconnects. It never seems to happen when a client connects, nor when you have stable clients.
A disconnecting client does not just affect its own groups. Rather, a disconnecting client can cause different groups (groups that the client does not below to) to drop a message.

If I'm correct (of course I think I am; I've been running this code for a while now studying when it fails), then it means the bug is due to the bail-out code path (when asyncio.CancelledError is thrown) inside channels_redis.RedisChannelLayer.receive. Something about that task being canceled is causing a hiccup in the Redis Channel Layer.

qeternity · 2021-04-27T13:36:16Z

Very interested in this, and happy to help.

@acu192 we have been working to upstream sentinel support, but I would much prefer a clean slate pub/sub and implementing vanilla/sentinel/cluster support all in one go.

yedpodtrzitko · 2021-05-08T13:44:11Z

Hi,
thanks to the @acu192's repo I managed to narrow this down a bit.

What I figured out is that the message get lost when this bzpopmin timeouts:
https://github.com/django/channels_redis/blob/38ffdeb610ce175a01698ca267a6cccc171043c9/channels_redis/core.py#L361

I added traceback.print_exc() into the following except block, and this exception shows up every time (and only then) when a message is not delivered:
https://github.com/django/channels_redis/blob/38ffdeb610ce175a01698ca267a6cccc171043c9/channels_redis/core.py#L545

Traceback (most recent call last):
  File "/Users/yed/.pyenv/versions/foobar39/lib/python3.9/site-packages/channels_redis/core.py", line 543, in receive_single
    content = await self._brpop_with_clean(
  File "/Users/yed/.pyenv/versions/foobar39/lib/python3.9/site-packages/channels_redis/core.py", line 375, in _brpop_with_clean
    result = await connection.bzpopmin(channel, timeout=timeout)
asyncio.exceptions.CancelledError

The question is why the bzpopmin timeouts. I'm not exactly sure, but I added a bunch of logs everywhere, and I noticed that the channels are not cleaned up properly. ~~For example printing args in group_send() shows~~ the list keeps growing over time, which might be why the bug shows up after a few minutes when the list gets way too long.

So my assumption would be that the cleanup Lua script executed just above the bzpopmin (link below) takes too long to proceed... or... something idk, which is why bzpopmin then timeouts.
https://github.com/django/channels_redis/blob/38ffdeb610ce175a01698ca267a6cccc171043c9/channels_redis/core.py#L359

acu192 · 2021-05-08T14:54:36Z

@yedpodtrzitko glad you were able to use my repo!

... and this exception [CancelledError] shows up every time (and only then) when a message is not delivered...

That observation is consistent with what I have seen as well. The CancelledError that you are seeing comes from here (see also here) when a client disconnects and the consumer is cleaning up.

... and I noticed that the channels are not cleaned up properly. For example printing args in group_send() (...) shows the list keeps growing over time ...

Wow, that's a big find!!!

__

Please feel free to weigh-in on the alternative implementation (this PR) which uses Redis' Pub/Sub. I've had it in production for a few weeks now and it is doing a spectacular job. It fixes this bug, and our production servers' CPU usage is down as well. As a bonus, it probably also scales well to large groups (although I haven't tested that yet). An easy way to play with it is by changing this line (be sure you have the most recent code from that repo of course).

yedpodtrzitko · 2021-05-11T14:10:32Z

~~I'm not able to reproduce this with the latest commit on main branch made by @qeternity, so if someone can confirm it, I would consider this fixed.~~
edit: error still present

qeternity · 2021-05-11T14:13:00Z

@yedpodtrzitko interesting find, this would make sense given that my last commit pipelined the cleanup, instead of awaiting each cleanup command.

@carltongibson it might be worth pushing a release with latest if this is now resolved in main.

acu192 · 2021-05-11T18:44:09Z

@yedpodtrzitko I'm still able to reproduce this bug using the bleeding-edge from the main branch (your link).

Maybe double-check which implementation you are using? If you recently did git pull on my repo, I might have mess up your test because I changed that setting at some point.

yedpodtrzitko · 2021-05-11T19:19:34Z

@acu192 oh no, you are right. I'm not sure what changed on my end since my last test, but previously I was observing the message drop when the counter reached ~3k, now I had to wait until ~13k before the error showed up (no matter which commit of channels-redis I used). Sorry for the misleading statement and thanks for double-checking it.

carltongibson · 2021-06-25T04:26:09Z

Hmmm. This is curious. Running this for some time locally without seeing the error. Why do this things never reproduce how you want them to? 😀

acu192 · 2021-06-25T05:23:48Z

@carltongibson Make sure to change this line!

https://github.com/acu192/channels-bug/blob/f02918b97bc54904482bd1c28fa899d45a7e18fe/mysite/settings.py#L43

I might have messed you up because I have it set up to use my pubsub impl. So yeah, you should not see the bug in that case.

carltongibson · 2021-06-25T05:33:24Z

😄 Genius.

carltongibson · 2021-07-10T13:51:57Z

Closing now the new PubSub layer is in-play. We can address further issues over there.

acu192 mentioned this issue Apr 26, 2021

Why not use Redis Pub/Sub? django/channels_redis#245

Closed

jalaziz mentioned this issue May 23, 2021

Provide an alternative channel layer implementation that uses Redis Pub/Sub django/channels_redis#251

Merged

acu192 mentioned this issue Jul 7, 2021

New Redis PubSub Layer throwing exception in aioredis django/channels_redis#266

Closed

carltongibson closed this as completed Jul 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rare (but reproducible!) dropped messages when using `group_send()` #1683

Rare (but reproducible!) dropped messages when using `group_send()` #1683

acu192 commented Apr 21, 2021

carltongibson commented Apr 21, 2021

acu192 commented Apr 22, 2021

carltongibson commented Apr 22, 2021 •

edited

Loading

acu192 commented Apr 22, 2021

qeternity commented Apr 27, 2021

yedpodtrzitko commented May 8, 2021 •

edited

Loading

acu192 commented May 8, 2021

yedpodtrzitko commented May 11, 2021 •

edited

Loading

qeternity commented May 11, 2021

acu192 commented May 11, 2021

yedpodtrzitko commented May 11, 2021

carltongibson commented Jun 25, 2021

acu192 commented Jun 25, 2021

carltongibson commented Jun 25, 2021

carltongibson commented Jul 10, 2021

Rare (but reproducible!) dropped messages when using group_send() #1683

Rare (but reproducible!) dropped messages when using group_send() #1683

Comments

acu192 commented Apr 21, 2021

carltongibson commented Apr 21, 2021

acu192 commented Apr 22, 2021

carltongibson commented Apr 22, 2021 • edited Loading

acu192 commented Apr 22, 2021

qeternity commented Apr 27, 2021

yedpodtrzitko commented May 8, 2021 • edited Loading

acu192 commented May 8, 2021

yedpodtrzitko commented May 11, 2021 • edited Loading

qeternity commented May 11, 2021

acu192 commented May 11, 2021

yedpodtrzitko commented May 11, 2021

carltongibson commented Jun 25, 2021

acu192 commented Jun 25, 2021

carltongibson commented Jun 25, 2021

carltongibson commented Jul 10, 2021

Rare (but reproducible!) dropped messages when using `group_send()` #1683

Rare (but reproducible!) dropped messages when using `group_send()` #1683

carltongibson commented Apr 22, 2021 •

edited

Loading

yedpodtrzitko commented May 8, 2021 •

edited

Loading

yedpodtrzitko commented May 11, 2021 •

edited

Loading