-
-
Notifications
You must be signed in to change notification settings - Fork 799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rare (but reproducible!) dropped messages when using group_send()
#1683
Comments
@acu192 Super, yes dead right. A clean slate really helps. 👍 |
Two more observations (repo is also updated with the same info):
|
@acu192 — I can't say yet. Personally, I need a session to sit down with your example and investigate. I suspect it's not simple. If you want to keep trying to narrow it down, that's all super! (It's not vital which repo it lives on still at this stage — for me, the key progress here is that you say it reproduces, which we've been lacking so far...) |
Yeah, this is the hardest type of bug. 😔
I've narrowed it down a bit more:
If I'm correct (of course I think I am; I've been running this code for a while now studying when it fails), then it means the bug is due to the bail-out code path (when |
Very interested in this, and happy to help. @acu192 we have been working to upstream sentinel support, but I would much prefer a clean slate pub/sub and implementing vanilla/sentinel/cluster support all in one go. |
Hi, What I figured out is that the message get lost when this I added
The question is why the So my assumption would be that the cleanup Lua script executed just above the |
@yedpodtrzitko glad you were able to use my repo!
That observation is consistent with what I have seen as well. The
Wow, that's a big find!!! __ Please feel free to weigh-in on the alternative implementation (this PR) which uses Redis' Pub/Sub. I've had it in production for a few weeks now and it is doing a spectacular job. It fixes this bug, and our production servers' CPU usage is down as well. As a bonus, it probably also scales well to large groups (although I haven't tested that yet). An easy way to play with it is by changing this line (be sure you have the most recent code from that repo of course). |
|
@yedpodtrzitko interesting find, this would make sense given that my last commit pipelined the cleanup, instead of awaiting each cleanup command. @carltongibson it might be worth pushing a release with latest if this is now resolved in main. |
@yedpodtrzitko I'm still able to reproduce this bug using the bleeding-edge from the main branch (your link). Maybe double-check which implementation you are using? If you recently did |
@acu192 oh no, you are right. I'm not sure what changed on my end since my last test, but previously I was observing the message drop when the counter reached ~3k, now I had to wait until ~13k before the error showed up (no matter which commit of |
Hmmm. This is curious. Running this for some time locally without seeing the error. Why do this things never reproduce how you want them to? 😀 |
@carltongibson Make sure to change this line! I might have messed you up because I have it set up to use my pubsub impl. So yeah, you should not see the bug in that case. |
😄 Genius. |
Closing now the new PubSub layer is in-play. We can address further issues over there. |
I have a repository where you can reproduce this bug.
https://github.com/acu192/channels-bug
The repo is a chat-room-sort-of-thing (just as a demo, to see the bug). It's as minimal as I can make it. It uses the Redis channel layer.
If you connect 7 clients to the chat room (3 in one room; 4 in another), you can reproduce the bug in a few minutes. Eventually, the server will drop a message and the receiving clients will notice the "gap" in the message sequence; those clients will print an error and exit.
Possibly related to #1244. I'm filing a new issue here, because I feel there is a lot of contradictory speculation in #1244 and I hope this new issue (with the reproducible code) can provide a fresh, more concrete, path toward fixing.
The text was updated successfully, but these errors were encountered: