Different RedisFailover's sentinels join together #6

dbackeus · 2024-11-12T12:32:35Z

I'm reopening what was probably the most critical bug before the fork: spotahome#550

It can result in complete data loss for a cluster. Not just in theory, it happened in our production k8s cluster. Eg. if cluster A gets mixed up with cluster B and cluster A ends up being elected master for all nodes, the previously existing data of cluster B will get completely overwritten by cluster A's.

I still believe the appropriate solution is to start relying on hostnames instead of injecting pod IP's into config as mentioned in spotahome#550 (comment)

samof76 · 2024-12-04T10:29:05Z

This will fix it? #3

Also was planning to implement, network policies so no egress is allow from the sentinels except within the namespace.

samof76 · 2024-12-04T10:30:16Z

@dbackeus Can you please simulate with those changes? in the #3

dbackeus · 2024-12-04T12:32:48Z

Bad timing, we ditched the operator in favour of our own templates two weeks back. If you're interested here's what we ended up with: reclaim-the-stack/get-started@736a3af

So I won't be spending time on testing. This would be my feedback though:

Your PR should prevent sentinels from accidentally electing new masters across redis clusters, which is certainly better than what we have currently.

However, it can still be confusing that cross cluster sentinels start interacting, and it would not prevent the same problem from occurring on the Redis side.

Eg. redis-cluster-A-1 is following redis-cluster-A-2. Some disaster occurs forcing some pods to reschedule and you end up with redis-cluster-B-1 getting the IP of redis-cluster-A-1 and causing redis-cluster-A-2 to re-sync from cluster B.

The root problem is relying on IP's in the first place. Hostnames is really the way to go in Kubernetes since IP's are not reliable / deterministic.

samof76 · 2024-12-05T07:42:36Z

I agree with you. This definitely not solve the entire problem, within the cluster, and across if you have unique names for your redis failover, which we have currently, as do not and will not run into this issue. I would definitely pick this one as feature to be enableb by a flag in the CRD. Do you have comments on this?

dbackeus mentioned this issue Nov 12, 2024

New home for this operator. spotahome/redis-operator#705

Open

samof76 pinned this issue Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different RedisFailover's sentinels join together #6

Different RedisFailover's sentinels join together #6

dbackeus commented Nov 12, 2024 •

edited

Loading

samof76 commented Dec 4, 2024

samof76 commented Dec 4, 2024

dbackeus commented Dec 4, 2024 •

edited

Loading

samof76 commented Dec 5, 2024

Different RedisFailover's sentinels join together #6

Different RedisFailover's sentinels join together #6

Comments

dbackeus commented Nov 12, 2024 • edited Loading

samof76 commented Dec 4, 2024

samof76 commented Dec 4, 2024

dbackeus commented Dec 4, 2024 • edited Loading

samof76 commented Dec 5, 2024

dbackeus commented Nov 12, 2024 •

edited

Loading

dbackeus commented Dec 4, 2024 •

edited

Loading