Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection established but 'The specified endpoint is not defined' #2728

Open
Timmoth opened this issue May 24, 2024 · 4 comments
Open

Connection established but 'The specified endpoint is not defined' #2728

Timmoth opened this issue May 24, 2024 · 4 comments

Comments

@Timmoth
Copy link

Timmoth commented May 24, 2024

I'm running a three node redis:7.2-alpine cluster on kubernetes, 1 master, 2 replicas, 3 sentinels.
My config is here

In dotnet I am using this code to connect:

     var sentinelConfig = new ConfigurationOptions
        {
            AbortOnConnectFail = false,
            AllowAdmin = true,
            ConnectTimeout = 5000,
            ConnectRetry = 10,
            ServiceName = "mymaster",
            Proxy = Proxy.None,
            Ssl = false,
            KeepAlive = 10,
            ResolveDns = true,
            SyncTimeout = 5000,
            TieBreaker = "",
            Password = redisSettings.Password
        };

        foreach (var sentinel in redisSettings.Sentinels)
        {
            sentinelConfig.EndPoints.Add(sentinel.Host, sentinel.Port);
        }

        var redis = ConnectionMultiplexer.Connect(sentinelConfig, Console.Out);
        services.AddSingleton<IConnectionMultiplexer>(redis);

Which works fine when running a redis cluster in docker compose, it has also worked on/off in the k8 cluster.
When it doesn't work the endpoint summary looks correct. As far as i can tell from the logs it's connected to the sentinels and resolved the correct ip / port for each redis endpoint, the exception thrown is the only thing I can tell that seems out of place:

06:42:16.9712: All 3 available tasks completed cleanly, IOCP: (Busy=0,Free=1000,Min=50,Max=1000), WORKER: (Busy=1,Free=32766,Min=50,Max=32767), POOL: (Threads=11,QueuedItems=0,CompletedItems=131,Timers=2 │
│ 06:42:16.9714: Endpoint summary:                                                                                                                                                                            │
│ 06:42:16.9716:   10.244.1.68:6379: Endpoint is (Interactive: ConnectedEstablished, Subscription: ConnectedEstablished)                                                                                      │
│ 06:42:16.9717:   10.244.0.85:6379: Endpoint is (Interactive: ConnectedEstablished, Subscription: ConnectedEstablished)                                                                                      │
│ 06:42:16.9718:   10.244.0.174:6379: Endpoint is (Interactive: ConnectedEstablished, Subscription: ConnectedEstablished)                                                                                     │
│ 06:42:16.9719: Task summary:                                                                                                                                                                                │
│ 06:42:16.9720:   10.244.1.68:6379: Returned with success as Standalone primary (Source: Connection race)                                                                                                    │
│ 06:42:16.9723:   10.244.0.85:6379: Returned with success as Standalone replica (Source: Already connected)                                                                                                  │
│ 06:42:16.9724:   10.244.0.174:6379: Returned with success as Standalone replica (Source: Already connected)                                                                                                 │
│ 06:42:16.9725: Election summary:                                                                                                                                                                            │
│ 06:42:16.9727:   Election: Single primary detected: 10.244.1.68:6379                                                                                                                                        │
│ 06:42:16.9728: 10.244.1.68:6379: Clearing as RedundantPrimary                                                                                                                                               │
│ 06:42:16.9729: Endpoint Summary:                                                                                                                                                                            │
│ 06:42:16.9731:   10.244.1.68:6379: Standalone v7.2.5, primary; 16 databases; keep-alive: 00:00:10; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active                                           │
│ 06:42:16.9732:   10.244.1.68:6379: int ops=13, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=7, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1                                                                          │
│ 06:42:16.9733:   10.244.1.68:6379: Circular op-count snapshot; int: 0+13=13 (1.30 ops/s; spans 10s); sub: 0+7=7 (0.70 ops/s; spans 10s)                                                                     │
│ 06:42:16.9735:   10.244.0.85:6379: Standalone v7.2.5, replica; 16 databases; keep-alive: 00:00:10; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active                                           │
│ 06:42:16.9736:   10.244.0.85:6379: int ops=14, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=7, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1                                                                          │
│ 06:42:16.9738:   10.244.0.85:6379: Circular op-count snapshot; int: 0+14=14 (1.40 ops/s; spans 10s); sub: 0+7=7 (0.70 ops/s; spans 10s)                                                                     │
│ 06:42:16.9739:   10.244.0.174:6379: Standalone v7.2.5, replica; 16 databases; keep-alive: 00:00:10; int: ConnectedEstablished; sub: ConnectedEstablished, 1 active
 06:42:16.9741:   10.244.0.174:6379: int ops=14, qu=0, qs=0, qc=0, wr=0, socks=1; sub ops=7, qu=0, qs=0, qc=0, wr=0, subs=1, socks=1                                                                         │
│ 06:42:16.9742:   10.244.0.174:6379: Circular op-count snapshot; int: 0+14=14 (1.40 ops/s; spans 10s); sub: 0+7=7 (0.70 ops/s; spans 10s)                                                                    │
│ 06:42:16.9744: Sync timeouts: 0; async timeouts: 0; fire and forget: 0; last heartbeat: -1s ago
│ 06:42:16.9745: Starting heartbeat...                                                                                                                                                                        │
│ 06:42:16.9747: Total connect time: 35 ms                                                                                                                                                                    │
│ Unhandled exception. System.ArgumentException: The specified endpoint is not defined (Parameter 'endpoint')                                                                                                 │
│    at StackExchange.Redis.ConnectionMultiplexer.GetServer(EndPoint endpoint, Object asyncState) in /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:line 1247                                            │
│    at StackExchange.Redis.ConnectionMultiplexer.GetSentinelMasterConnection(ConfigurationOptions config, TextWriter log) in /_/src/StackExchange.Redis/ConnectionMultiplexer.Sentinel.cs:line 237           │
│    at StackExchange.Redis.ConnectionMultiplexer.SentinelPrimaryConnect(ConfigurationOptions configuration, TextWriter log) in /_/src/StackExchange.Redis/ConnectionMultiplexer.Sentinel.cs:line 134         │
│    at StackExchange.Redis.ConnectionMultiplexer.Connect(ConfigurationOptions configuration, TextWriter log) in /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:line 685

This suggests something might be wrong with my config? But the fact that it has worked on the cluster, and consistently works locally has me confused.

Does anyone have any ideas or would be able to provide me with some direction to trouble shoot?

@NickCraver
Copy link
Collaborator

This looks like Sentinel is not returning a valid endpoint (or one we recognize) when asked what the master is.

If you connect up directly and query sentinel master mymaster, what do you get back?

@Tasteful
Copy link
Contributor

@NickCraver We have identified something similar with this, see samcook/RedLock.net#112 (comment)

It exists cases when sentinel returns IP addresses that isn't longer included in the cluster, the connection multiplexer will work correctly and abort them during initialization, but the IConnectionMultiplexer.GetEndPoints() includes them and when executing the IConnectionMultiplexer.GetServer(endPoint) for an endpoint that not received and answer the ArgumentException is thrown.

Is the expectation that IConnectionMultiplexer.GetEndPoints() should return all entries that sentinel knows about?

@kmcclellan
Copy link

I am hitting this issue as well.

@Tasteful please correct my assumptions if they are wrong. This issue is likely to be encountered by anyone calling IConnectionMultiplexer.GetServer(...) while using Redis Sentinel running in Kubernetes. The only current workaround is to add code to catch ArgumentException and skip the endpoint, with the assumption that this indicates it is no longer an active node?

This seems like a pretty serious issue. I have a lot of code that uses SE.Redis and hesitations with adding this exception handling uniformly. ArgumentException is meant to be avoided, not caught. Aside from the code smell, I lose the ability to distinguish this situation from others that would indicate a bug in consuming code - such as attempting to pass an endpoint that never was a valid node.

@Tasteful
Copy link
Contributor

@Tasteful please correct my assumptions if they are wrong. This issue is likely to be encountered by anyone calling IConnectionMultiplexer.GetServer(...) while using Redis Sentinel running in Kubernetes. The only current workaround is to add code to catch ArgumentException and skip the endpoint, with the assumption that this indicates it is no longer an active node?

Yes, that is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants