Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rare (but reproducible!) dropped messages when using group_send() #1683

Closed
acu192 opened this issue Apr 21, 2021 · 15 comments
Closed

Rare (but reproducible!) dropped messages when using group_send() #1683

acu192 opened this issue Apr 21, 2021 · 15 comments

Comments

@acu192
Copy link

acu192 commented Apr 21, 2021

I have a repository where you can reproduce this bug.

https://github.com/acu192/channels-bug

The repo is a chat-room-sort-of-thing (just as a demo, to see the bug). It's as minimal as I can make it. It uses the Redis channel layer.

If you connect 7 clients to the chat room (3 in one room; 4 in another), you can reproduce the bug in a few minutes. Eventually, the server will drop a message and the receiving clients will notice the "gap" in the message sequence; those clients will print an error and exit.

Possibly related to #1244. I'm filing a new issue here, because I feel there is a lot of contradictory speculation in #1244 and I hope this new issue (with the reproducible code) can provide a fresh, more concrete, path toward fixing.

@carltongibson
Copy link
Member

@acu192 Super, yes dead right. A clean slate really helps. 👍

@acu192
Copy link
Author

acu192 commented Apr 22, 2021

Two more observations (repo is also updated with the same info):

  1. I cannot reproduce the bug unless there are clients that connect-disconnect-reconnect in a loop. Before anyone says "oh, well of course messages are lost when a client disconnects". Yes, but no. I'm looking here at gaps in the sequence number of messages that are delivered to the peers. If there is a gap, there must have been a message dropped by the server, which is what happens in this case. Why do you need clients that connect-disconnect-reconnect in a loop? I don't know, but it seems that is what it takes for gaps to appear in the sequence numbers.

  2. I cannot reproduce it when using InMemoryChannelLayer... at all, no matter what I try, suggesting (but not proving) that the issue is with the Redis Channel Layer. @carltongibson Do you think this is enough evidence that this issue should be moved to channels_redis?

@carltongibson
Copy link
Member

carltongibson commented Apr 22, 2021

@acu192 — I can't say yet. Personally, I need a session to sit down with your example and investigate. I suspect it's not simple.

If you want to keep trying to narrow it down, that's all super! (It's not vital which repo it lives on still at this stage — for me, the key progress here is that you say it reproduces, which we've been lacking so far...)

@acu192
Copy link
Author

acu192 commented Apr 22, 2021

I suspect it's not simple.

Yeah, this is the hardest type of bug. 😔

If you want to keep trying to narrow it down, that's all super!

I've narrowed it down a bit more:

  1. It happens when a client disconnects. It never seems to happen when a client connects, nor when you have stable clients.

  2. A disconnecting client does not just affect its own groups. Rather, a disconnecting client can cause different groups (groups that the client does not below to) to drop a message.

If I'm correct (of course I think I am; I've been running this code for a while now studying when it fails), then it means the bug is due to the bail-out code path (when asyncio.CancelledError is thrown) inside channels_redis.RedisChannelLayer.receive. Something about that task being canceled is causing a hiccup in the Redis Channel Layer.

@qeternity
Copy link

Very interested in this, and happy to help.

@acu192 we have been working to upstream sentinel support, but I would much prefer a clean slate pub/sub and implementing vanilla/sentinel/cluster support all in one go.

@yedpodtrzitko
Copy link
Contributor

yedpodtrzitko commented May 8, 2021

Hi,
thanks to the @acu192's repo I managed to narrow this down a bit.

What I figured out is that the message get lost when this bzpopmin timeouts:
https://github.com/django/channels_redis/blob/38ffdeb610ce175a01698ca267a6cccc171043c9/channels_redis/core.py#L361

I added traceback.print_exc() into the following except block, and this exception shows up every time (and only then) when a message is not delivered:
https://github.com/django/channels_redis/blob/38ffdeb610ce175a01698ca267a6cccc171043c9/channels_redis/core.py#L545

Traceback (most recent call last):
  File "/Users/yed/.pyenv/versions/foobar39/lib/python3.9/site-packages/channels_redis/core.py", line 543, in receive_single
    content = await self._brpop_with_clean(
  File "/Users/yed/.pyenv/versions/foobar39/lib/python3.9/site-packages/channels_redis/core.py", line 375, in _brpop_with_clean
    result = await connection.bzpopmin(channel, timeout=timeout)
asyncio.exceptions.CancelledError

The question is why the bzpopmin timeouts. I'm not exactly sure, but I added a bunch of logs everywhere, and I noticed that the channels are not cleaned up properly. For example printing args in group_send() shows the list keeps growing over time, which might be why the bug shows up after a few minutes when the list gets way too long.

So my assumption would be that the cleanup Lua script executed just above the bzpopmin (link below) takes too long to proceed... or... something idk, which is why bzpopmin then timeouts.
https://github.com/django/channels_redis/blob/38ffdeb610ce175a01698ca267a6cccc171043c9/channels_redis/core.py#L359

@acu192
Copy link
Author

acu192 commented May 8, 2021

@yedpodtrzitko glad you were able to use my repo!

... and this exception [CancelledError] shows up every time (and only then) when a message is not delivered...

That observation is consistent with what I have seen as well. The CancelledError that you are seeing comes from here (see also here) when a client disconnects and the consumer is cleaning up.

... and I noticed that the channels are not cleaned up properly. For example printing args in group_send() (...) shows the list keeps growing over time ...

Wow, that's a big find!!!

__

Please feel free to weigh-in on the alternative implementation (this PR) which uses Redis' Pub/Sub. I've had it in production for a few weeks now and it is doing a spectacular job. It fixes this bug, and our production servers' CPU usage is down as well. As a bonus, it probably also scales well to large groups (although I haven't tested that yet). An easy way to play with it is by changing this line (be sure you have the most recent code from that repo of course).

@yedpodtrzitko
Copy link
Contributor

yedpodtrzitko commented May 11, 2021

I'm not able to reproduce this with the latest commit on main branch made by @qeternity, so if someone can confirm it, I would consider this fixed.
edit: error still present

@qeternity
Copy link

@yedpodtrzitko interesting find, this would make sense given that my last commit pipelined the cleanup, instead of awaiting each cleanup command.

@carltongibson it might be worth pushing a release with latest if this is now resolved in main.

@acu192
Copy link
Author

acu192 commented May 11, 2021

@yedpodtrzitko I'm still able to reproduce this bug using the bleeding-edge from the main branch (your link).

Maybe double-check which implementation you are using? If you recently did git pull on my repo, I might have mess up your test because I changed that setting at some point.

@yedpodtrzitko
Copy link
Contributor

@acu192 oh no, you are right. I'm not sure what changed on my end since my last test, but previously I was observing the message drop when the counter reached ~3k, now I had to wait until ~13k before the error showed up (no matter which commit of channels-redis I used). Sorry for the misleading statement and thanks for double-checking it.

@carltongibson
Copy link
Member

Hmmm. This is curious. Running this for some time locally without seeing the error. Why do this things never reproduce how you want them to? 😀

@acu192
Copy link
Author

acu192 commented Jun 25, 2021

@carltongibson Make sure to change this line!

https://github.com/acu192/channels-bug/blob/f02918b97bc54904482bd1c28fa899d45a7e18fe/mysite/settings.py#L43

I might have messed you up because I have it set up to use my pubsub impl. So yeah, you should not see the bug in that case.

@carltongibson
Copy link
Member

😄 Genius.

@carltongibson
Copy link
Member

Closing now the new PubSub layer is in-play. We can address further issues over there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants