Operator panicking during critical downtime #122

rhefner1 · 2019-02-26T13:18:14Z

I'm running redis-operator 0.5.5 (latest version) in Kubernetes 1.11.

I had a production incident this morning in one of our Redis clusters. The nodes started running out of memory and there was MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error. in the logs.` I deployed a change to increase the memory of the cluster and the statefulset began restarting.

Never mind that, I understand the problem there (though I think recovery could be greatly improved). What did happen that was concerning was that redis-operator started panicking right as the Redis cluster was restarting. Here is a stacktrace [1]. Any idea about the cause here?

[1]

E0226 08:29:38.703198       1 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/go/src/github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/go/src/github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:573
/usr/local/go/src/runtime/panic.go:505
/usr/local/go/src/runtime/panic.go:63
/usr/local/go/src/runtime/signal_unix.go:388
/go/src/github.com/spotahome/redis-operator/operator/redisfailover/service/check.go:215
/go/src/github.com/spotahome/redis-operator/operator/redisfailover/checker.go:48
/go/src/github.com/spotahome/redis-operator/operator/redisfailover/handler.go:82
/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:299
/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:277
/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:247
/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:223
/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:207
/go/src/github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/go/src/github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/go/src/github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:207
/usr/local/go/src/runtime/asm_amd64.s:2361
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xef0fe3]
 goroutine 68 [running]:
github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x107
panic(0x1023900, 0x19d33e0)
	/usr/local/go/src/runtime/panic.go:505 +0x229
github.com/spotahome/redis-operator/operator/redisfailover/service.(*RedisFailoverChecker).GetMinimumRedisPodTime(0xc420407710, 0xc4200d0000, 0xc420390560, 0x2, 0x2)
	/go/src/github.com/spotahome/redis-operator/operator/redisfailover/service/check.go:215 +0x163
github.com/spotahome/redis-operator/operator/redisfailover.(*RedisFailoverHandler).CheckAndHeal(0xc420215830, 0xc4200d0000, 0xc4206087b0, 0xc4201b1270)
	/go/src/github.com/spotahome/redis-operator/operator/redisfailover/checker.go:48 +0xcb8
github.com/spotahome/redis-operator/operator/redisfailover.(*RedisFailoverHandler).Add(0xc420215830, 0x1272080, 0xc420608750, 0x125cf80, 0xc4200d0000, 0x1284640, 0x1a03d10)
	/go/src/github.com/spotahome/redis-operator/operator/redisfailover/handler.go:82 +0x139
github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller.(*generic).handleAdd(0xc4201abe00, 0x1272080, 0xc420608750, 0xc42047dfe0, 0x1a, 0x125cf80, 0xc4200d0000, 0x0, 0x0)
	/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:299 +0x43b
github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller.(*generic).processJob(0xc4201abe00, 0x1272080, 0xc420608720, 0xc42047dfe0, 0x1a, 0xc420608720, 0x1284640)
	/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:277 +0xee
github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller.(*generic).getAndProcessNextJob(0xc4201abe00, 0xc4204fc100)
	/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:247 +0x219
github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller.(*generic).runWorker(0xc4201abe00)
	/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:223 +0x2b
github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller.(*generic).(github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller.runWorker)-fm()
	/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:207 +0x2a
github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc4204d17b0)
	/go/src/github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc4209bbfb0, 0x3b9aca00, 0x0, 0x1, 0xc4202366c0)
	/go/src/github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc4204d17b0, 0x3b9aca00, 0xc4202366c0)
	/go/src/github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller.(*generic).run.func1(0xc4201abe00, 0xc4202366c0)
	/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:207 +0x54
created by github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller.(*generic).run
	/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:206 +0x1d7

The text was updated successfully, but these errors were encountered:

jchanam · 2019-02-27T10:46:47Z

Hi @rhefner1,

It looks that it was when iterating over all pods from the statefulset to get for how long they've been running:

redis-operator/operator/redisfailover/service/check.go

Lines 208 to 223 in a970bf6

    
           func (r *RedisFailoverChecker) GetMinimumRedisPodTime(rf *redisfailoverv1alpha2.RedisFailover) (time.Duration, error) { 
        
           	minTime := 100000 * time.Hour // More than ten years 
        
           	rps, err := r.k8sService.GetStatefulSetPods(rf.Namespace, GetRedisName(rf)) 
        
           	if err != nil { 
        
           		return minTime, err 
        
           	} 
        
           	for _, redisNode := range rps.Items { 
        
           		start := redisNode.Status.StartTime.Round(time.Second) 
        
           		alive := time.Now().Sub(start) 
        
           		r.logger.Debugf("Pod %s has been alive for %.f seconds", redisNode.Status.PodIP, alive.Seconds()) 
        
           		if alive < minTime { 
        
           			minTime = alive 
        
           		} 
        
           	} 
        
           	return minTime, nil 
        
           }

I think it just got a pod that was without status or start time, so it panicked when trying to get that information.

Did de operator restarted more than once? Is your failover still with issues? Did the failover have a disruption of service?

rhefner1 · 2019-02-28T14:57:38Z

@jchanam The Redis cluster was restarting because I changed the memory request. I suppose that if a pod is still initializing, it may not have this information. I'm not sure about what guarantees the Kubernetes API has about those things.

And yes, the operator restarted every time it iterated over that specific redis failover (I have about 8 failovers) until all of the pods had started up. Then the operator worked correctly.

The failover did have a disruption of service due to being out of memory (my fault). When I deployed the new memory request, I had to wait for all three Redis nodes to restart and then they sat there for a few minutes. Finally, redis-operator configured the cluster and a master was elected. Full recovery after that happened just a few minutes later.

I expected that recovery would start shortly after the first node started restarting. Maybe the panic caused this slowdown?

jchanam mentioned this issue Mar 5, 2019

Check for nil pointer when getting minimum redis pod time #126

Merged

jchanam closed this as completed in #126 Mar 6, 2019

hashemi-soroush mentioned this issue Aug 2, 2023

Is it ready for production? #634

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator panicking during critical downtime #122

Operator panicking during critical downtime #122

rhefner1 commented Feb 26, 2019

jchanam commented Feb 27, 2019

rhefner1 commented Feb 28, 2019

Operator panicking during critical downtime #122

Operator panicking during critical downtime #122

Comments

rhefner1 commented Feb 26, 2019

jchanam commented Feb 27, 2019

rhefner1 commented Feb 28, 2019