-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic during reconnect to mulfunctional Redis Cluster #55
Comments
Oh, I might know the cause of this. This could also take me a couple of days. |
I was trying to reproduce the panic, but can not now. Maybe I changed the code somehow to avoid it to happen. Were you able to reproduce it btw? One more thing though I came across when trying to find a minimal reproducer is that program like this: func run() {
client, err := rueidis.NewClient(rueidis.ClientOption{
InitAddress: []string{"127.0.0.1:7000", "127.0.0.1:7001"},
})
if err != nil {
panic(err)
}
defer client.Close()
for {
log.Println("start pub/sub")
conn, _ := client.Dedicate()
wait := conn.SetPubSubHooks(rueidis.PubSubHooks{
OnMessage: func(m rueidis.PubSubMessage) {
},
})
log.Println("pub/sub is ready")
err := <-wait
log.Println("pub/sub error", err)
conn.Close()
time.Sleep(1 * time.Millisecond)
}
} Does not detect cluster shutdowns - i.e. with single Redis instance PUB/SUB interrupted on Redis close and goes to reconnect loop, but when using Redis Cluster and shutting it down – PUB/SUB connection close not detected. |
Hi @FZambia, For the dedicated cluster client, the underlying connection is actually not established until you send a command. Because which node to connect to is decided by the key slot of the command. That is why the above program has no reaction to cluster shutdown, since the underlying connection is not yet established after all. |
I see, makes sense! |
Think I was able to reproduce a panic by moving Cluster slot to another node with active PUB/SUB connection (using SSUBSCRIBE). It's reproduced with 2eba08d |
So steps to reproduce.
In my case slot 6918 originally belonged to 7001.
On Node 7000:
On Node 7001:
On Node 7000:
At this moment panic happens:
|
Oh, thank you very much. Redis sends a |
Hi @FZambia, The above issues are fixed in the latest commit of PR #56:
|
Just checked – no panic, thanks! 🎉 Though one last thing – after migrating a slot I am successfully getting
Seems like information about slots is not updated in rueidis cache? Or slot refresh not triggered? If I restart my app – then all subscriptions are established correctly, picking correct node for slot subscribe. |
Hi @FZambia, Yes, you are right. I had missed triggering slot refresh in the dedicated client. I have reworked that in the latest commit. Now it should work with the following program: package main
import (
"context"
"fmt"
"github.com/rueian/rueidis"
)
func main() {
client, err := rueidis.NewClient(rueidis.ClientOption{
InitAddress: []string{"127.0.0.1:7000", "127.0.0.1:7001"},
})
if err != nil {
panic(err)
}
defer client.Close()
for {
conn, _ := client.Dedicate()
sig := make(chan struct{})
conn.SetPubSubHooks(rueidis.PubSubHooks{
OnMessage: func(m rueidis.PubSubMessage) {
fmt.Println(m)
},
OnSubscription: func(s rueidis.PubSubSubscription) {
fmt.Println(s)
if s.Kind == "sunsubscribe" {
close(sig)
}
},
})
if err := conn.Do(context.Background(), conn.B().Ssubscribe().Channel("test").Build()).Error(); err == nil {
<-sig
}
conn.Close()
}
} Note that the I have tried to implement handling |
Yep, I already do this with Dedicated connection, I think re-establishing is pretty reasonable in this case, should not happen often. Actually I'd prefer Redis to close PUB/SUB connection automatically upon slot migration, possibly with an error containing where slot was moved to instead of sending I just tried the latest commit – and everything works as expected. Such a great progress, many thanks – will continue experimenting, the feeling is that it's very close :) Also want to test Sentinel scenario - seems the last part of the migration. |
Hi @FZambia,
Thank you for your assistance, v0.0.59 is released. Please let me know if you find any issues. I am happy to help.
Agreed. It seems that they had discussed just closing the connection from the Redis side once the slot was moved: redis/redis#8621 (comment), but it just not happened finally. And I didn't find further discussion on this.
Well, you don't have to worry about that currently. There is no case that the |
I ran into this problem again using the client in Thanos. The panic was preceded by many exceeded context deadlines.
|
Hi @verejoel, thanks for the report. I will try to figure it out. Do you have any other clue that could help reproduce the problem? |
We're using Azure Cache for Redis, Standard tier 26 GB (C5 instance). I just checked and the Redis version is 4, not the latest 6. On Monday I will ask our team to upgrade to 6 to see if this might resolve the issue. Thanos version is the latest release (v0.30.1) and I'm using their default connection settings for the redis client with TLS enabled, as documented here. Thanks for taking a look! I had a look in Loki and found another panic that seems to be related.
|
Hi @verejoel, The bug should be fixed by #170. I will later open a PR to Thanos for updating rueidis as well. This bug won't be triggered if you have client-side caching enabled with Redis >= 6. So, if you can't wait for a new Thanos release, upgrading your Redis should also work around the problem. |
Got the panic like this after reconnect to a Redis Cluster:
In my case I had a working Redis cluster, then stopped the cluster (I am using docker-compose from https://github.com/Grokzen/docker-redis-cluster), ran it again. Newly created cluster was not properly functional due to cluster setup error, so every operation on every node at this point returned:
I'll try to provide a minimal reproducer example soon. This may be a bit tricky to reproduce though.
The text was updated successfully, but these errors were encountered: