Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean install does not work #265

Closed
Sieabah opened this issue May 11, 2020 · 9 comments
Closed

Clean install does not work #265

Sieabah opened this issue May 11, 2020 · 9 comments
Labels

Comments

@Sieabah
Copy link

Sieabah commented May 11, 2020

Expected behaviour

To be able to connect to redis without having to do additional work of creating additional charts and services to work around the 127.0.0.1 "redis host"

Actual behaviour

All redis master instances are 127.0.0.1, which fails 100% of the time and is incompatible with actual redis sidecar caches.

Steps to reproduce the behaviour

Clean install, directly into the default namespace. Use ioredis to connect to the sentinel, pretty sure this doesn't work with any library

Environment

How are the pieces configured?

  • Redis Operator version: latest, whatever is available at the time of this issue
  • Kubernetes version: 1.15, latest GKE (cannot "upgrade" to fix the issue)
  • Kubernetes configuration used (eg: Is RBAC active?): whatever is in the master branch, unmodified except for adding a password
  • 3 nodes

Logs

Redis operates fine, the sentinels operate fine. The value returns from getting the master ip is useless because all masters are local to the sentinels. (3 instances in a cluster?)

Is the idea to have more than 3 instances and have a minimum of 6 or more? Do we need to scale the cluster well past 6 to absolutely ensure 127.0.0.1 is never the result by having the scheduler kill any pods that are on the same nodes?

In the mean time I'm going to go with another solution as this seems fundamentally broken on smaller clusters.

The cluster permissions are also excessively terrible

Is this operator meant to be used in production, it doesn't seem like it can guarantee an outage if pods are scheduled on the same node?

Edit:
Additional issues were found with the "all resources" yaml. The order in which the resources are created creates a race for the operator deployment. If the resources are not created in time the operator spits out an error that it doesn't have permission to run. (It's not a "failed state", must be terminated)

Due to this failure the CRD's are never created, which also makes it so you're unable to apply the redisfailover resource. Why are the CRDs not provided externally from the operator, why must the operator be the one which creates the CRD?

@Sieabah
Copy link
Author

Sieabah commented May 11, 2020

After bringing it up/down a few times it seems to be hit or miss whether the master election actually occurs. It depends on the number of nodes and whether the application requesting is hitting a sentinel that is colocated with the master.

@Sieabah
Copy link
Author

Sieabah commented May 14, 2020

Yeah, I just think this isn't compatible on smaller node count clusters, the risk that a sentinel and redis are scheduled together is too high.

You can get this error to happen easily by starting the operator in kubernetes on docker.

1:S 14 May 2020 03:59:57.407 * Connecting to MASTER 127.0.0.1:6379
1:S 14 May 2020 03:59:57.407 * MASTER <-> REPLICA sync started
1:S 14 May 2020 03:59:57.407 * Non blocking connect for SYNC fired the event.
1:S 14 May 2020 03:59:57.407 * Master replied to PING, replication can continue...
1:S 14 May 2020 03:59:57.407 * Partial resynchronization not possible (no cached master)
1:S 14 May 2020 03:59:57.407 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master
```

@yuki-xin
Copy link

same issue

@diego-maravankin
Copy link

Are you storing to disk?

I had the same issue, and it was caused by bad permissions in the redis storage folder (host level). Once I corrected that, the cluster booted and synchronized. I'll keep testing, but that solved my issue.

@XBeg9
Copy link

XBeg9 commented Nov 4, 2020

I have the same issue, were you able to fix it? @Sieabah

@chlunde
Copy link

chlunde commented Jan 8, 2021

In my tests it usually starts withing ~3-4 minutes.

A workaround is to spin up the cluster with 1 replica, and then expand to 3 replicas when a node is ready.

I think the problem was introduced in #206 with PodManagementPolicy: v1.ParallelPodManagement,. If I change the code to the default pod management policy, a statefulset will create 1 pod first, that will be elected as master and become ready, and then the cluster will expand.

This will hit the code path with if len(redisesIP) == 1 { which makes that node a master.

Now, instead, it calls if err2 := r.rfHealer.SetOldestAsMaster(rf); after a while, but this code typically fails the first time because initially the pod does not have an IP when this code is called, so it will be requeued. It would be nice if this code is executed as soon as the pod is scheduled and Running, and if it was requeued with a shorter delay when there is no master.

level=debug msg="time 360000000 more than expected. Not even one master, fixing..." operator=redis-operator src="handler.go:80"
level=debug msg="New master is rfr-redisfailover-0 with ip " operator=redis-operator src="checker.go:125"

nMasters, err := r.rfChecker.GetNumberMasters(rf)
if err != nil {
return err
}
switch nMasters {
case 0:
redisesIP, err := r.rfChecker.GetRedisesIPs(rf)
if err != nil {
return err
}
if len(redisesIP) == 1 {
if err := r.rfHealer.MakeMaster(redisesIP[0], rf); err != nil {
return err
}
break
}
minTime, err2 := r.rfChecker.GetMinimumRedisPodTime(rf)
if err2 != nil {
return err2
}
if minTime > timeToPrepare {
r.logger.Debugf("time %.f more than expected. Not even one master, fixing...", minTime.Round(time.Second).Seconds())
// We can consider there's an error
if err2 := r.rfHealer.SetOldestAsMaster(rf); err2 != nil {
return err2
}
} else {
// We'll wait until failover is done
r.logger.Debug("No master found, wait until failover")
return nil
}

@github-actions
Copy link

This issue is stale because it has been open for 45 days with no activity.

@github-actions github-actions bot added the stale label Jan 14, 2022
@github-actions
Copy link

This issue was closed because it has been inactive for 14 days since being marked as stale.

@Sieabah
Copy link
Author

Sieabah commented Feb 2, 2022

@XBeg9 Sorry for not catching this reply earlier. I was not, I wasn't using redis for anything other than pub/sub so I ended up rolling my own exchange with Elixir and Phoenix.

Couldn't find any reliable solution for redis on small clusters. I use singleton instances and when nodes cannot connect to redis they abruptly kill any in-progress work until they can reconnect. It's the only way I can get redis to work at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants