Clean install does not work #265

Sieabah · 2020-05-11T01:21:50Z

Expected behaviour

To be able to connect to redis without having to do additional work of creating additional charts and services to work around the 127.0.0.1 "redis host"

Actual behaviour

All redis master instances are 127.0.0.1, which fails 100% of the time and is incompatible with actual redis sidecar caches.

Steps to reproduce the behaviour

Clean install, directly into the default namespace. Use ioredis to connect to the sentinel, pretty sure this doesn't work with any library

Environment

How are the pieces configured?

Redis Operator version: latest, whatever is available at the time of this issue
Kubernetes version: 1.15, latest GKE (cannot "upgrade" to fix the issue)
Kubernetes configuration used (eg: Is RBAC active?): whatever is in the master branch, unmodified except for adding a password
3 nodes

Logs

Redis operates fine, the sentinels operate fine. The value returns from getting the master ip is useless because all masters are local to the sentinels. (3 instances in a cluster?)

Is the idea to have more than 3 instances and have a minimum of 6 or more? Do we need to scale the cluster well past 6 to absolutely ensure 127.0.0.1 is never the result by having the scheduler kill any pods that are on the same nodes?

In the mean time I'm going to go with another solution as this seems fundamentally broken on smaller clusters.

The cluster permissions are also excessively terrible

Is this operator meant to be used in production, it doesn't seem like it can guarantee an outage if pods are scheduled on the same node?

Edit:
Additional issues were found with the "all resources" yaml. The order in which the resources are created creates a race for the operator deployment. If the resources are not created in time the operator spits out an error that it doesn't have permission to run. (It's not a "failed state", must be terminated)

Due to this failure the CRD's are never created, which also makes it so you're unable to apply the redisfailover resource. Why are the CRDs not provided externally from the operator, why must the operator be the one which creates the CRD?

Sieabah · 2020-05-11T04:03:15Z

After bringing it up/down a few times it seems to be hit or miss whether the master election actually occurs. It depends on the number of nodes and whether the application requesting is hitting a sentinel that is colocated with the master.

Sieabah · 2020-05-14T04:04:40Z

Yeah, I just think this isn't compatible on smaller node count clusters, the risk that a sentinel and redis are scheduled together is too high.

You can get this error to happen easily by starting the operator in kubernetes on docker.

1:S 14 May 2020 03:59:57.407 * Connecting to MASTER 127.0.0.1:6379
1:S 14 May 2020 03:59:57.407 * MASTER <-> REPLICA sync started
1:S 14 May 2020 03:59:57.407 * Non blocking connect for SYNC fired the event.
1:S 14 May 2020 03:59:57.407 * Master replied to PING, replication can continue...
1:S 14 May 2020 03:59:57.407 * Partial resynchronization not possible (no cached master)
1:S 14 May 2020 03:59:57.407 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master
```

yuki-xin · 2020-10-27T10:06:38Z

same issue

diego-maravankin · 2020-11-02T20:25:04Z

Are you storing to disk?

I had the same issue, and it was caused by bad permissions in the redis storage folder (host level). Once I corrected that, the cluster booted and synchronized. I'll keep testing, but that solved my issue.

XBeg9 · 2020-11-04T09:39:17Z

I have the same issue, were you able to fix it? @Sieabah

chlunde · 2021-01-08T11:11:08Z

In my tests it usually starts withing ~3-4 minutes.

A workaround is to spin up the cluster with 1 replica, and then expand to 3 replicas when a node is ready.

I think the problem was introduced in #206 with PodManagementPolicy: v1.ParallelPodManagement,. If I change the code to the default pod management policy, a statefulset will create 1 pod first, that will be elected as master and become ready, and then the cluster will expand.

This will hit the code path with if len(redisesIP) == 1 { which makes that node a master.

Now, instead, it calls if err2 := r.rfHealer.SetOldestAsMaster(rf); after a while, but this code typically fails the first time because initially the pod does not have an IP when this code is called, so it will be requeued. It would be nice if this code is executed as soon as the pod is scheduled and Running, and if it was requeued with a shorter delay when there is no master.

level=debug msg="time 360000000 more than expected. Not even one master, fixing..." operator=redis-operator src="handler.go:80"
level=debug msg="New master is rfr-redisfailover-0 with ip " operator=redis-operator src="checker.go:125"

redis-operator/operator/redisfailover/checker.go

Lines 102 to 132 in b3dad57

    
           nMasters, err := r.rfChecker.GetNumberMasters(rf) 
        
           if err != nil { 
        
           	return err 
        
           } 
        
           switch nMasters { 
        
           case 0: 
        
           	redisesIP, err := r.rfChecker.GetRedisesIPs(rf) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	if len(redisesIP) == 1 { 
        
           		if err := r.rfHealer.MakeMaster(redisesIP[0], rf); err != nil { 
        
           			return err 
        
           		} 
        
           		break 
        
           	} 
        
           	minTime, err2 := r.rfChecker.GetMinimumRedisPodTime(rf) 
        
           	if err2 != nil { 
        
           		return err2 
        
           	} 
        
           	if minTime > timeToPrepare { 
        
           		r.logger.Debugf("time %.f more than expected. Not even one master, fixing...", minTime.Round(time.Second).Seconds()) 
        
           		// We can consider there's an error 
        
           		if err2 := r.rfHealer.SetOldestAsMaster(rf); err2 != nil { 
        
           			return err2 
        
           		} 
        
           	} else { 
        
           		// We'll wait until failover is done 
        
           		r.logger.Debug("No master found, wait until failover") 
        
           		return nil 
        
           	}

github-actions · 2022-01-14T01:54:45Z

This issue is stale because it has been open for 45 days with no activity.

github-actions · 2022-01-29T01:48:51Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sieabah · 2022-02-02T03:03:07Z

@XBeg9 Sorry for not catching this reply earlier. I was not, I wasn't using redis for anything other than pub/sub so I ended up rolling my own exchange with Elixir and Phoenix.

Couldn't find any reliable solution for redis on small clusters. I use singleton instances and when nodes cannot connect to redis they abruptly kill any in-progress work until they can reconnect. It's the only way I can get redis to work at all.

github-actions bot added the stale label Jan 14, 2022

github-actions bot closed this as completed Jan 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean install does not work #265

Clean install does not work #265

Sieabah commented May 11, 2020 •

edited

Loading

Sieabah commented May 11, 2020 •

edited

Loading

Sieabah commented May 14, 2020

yuki-xin commented Oct 27, 2020

diego-maravankin commented Nov 2, 2020

XBeg9 commented Nov 4, 2020

chlunde commented Jan 8, 2021

github-actions bot commented Jan 14, 2022

github-actions bot commented Jan 29, 2022

Sieabah commented Feb 2, 2022

Clean install does not work #265

Clean install does not work #265

Comments

Sieabah commented May 11, 2020 • edited Loading

Expected behaviour

Actual behaviour

Steps to reproduce the behaviour

Environment

Logs

Sieabah commented May 11, 2020 • edited Loading

Sieabah commented May 14, 2020

yuki-xin commented Oct 27, 2020

diego-maravankin commented Nov 2, 2020

XBeg9 commented Nov 4, 2020

chlunde commented Jan 8, 2021

github-actions bot commented Jan 14, 2022

github-actions bot commented Jan 29, 2022

Sieabah commented Feb 2, 2022

Sieabah commented May 11, 2020 •

edited

Loading

Sieabah commented May 11, 2020 •

edited

Loading