Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator panicking during critical downtime #122

Closed
rhefner1 opened this issue Feb 26, 2019 · 2 comments
Closed

Operator panicking during critical downtime #122

rhefner1 opened this issue Feb 26, 2019 · 2 comments

Comments

@rhefner1
Copy link
Contributor

I'm running redis-operator 0.5.5 (latest version) in Kubernetes 1.11.

I had a production incident this morning in one of our Redis clusters. The nodes started running out of memory and there was MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error. in the logs.` I deployed a change to increase the memory of the cluster and the statefulset began restarting.

Never mind that, I understand the problem there (though I think recovery could be greatly improved). What did happen that was concerning was that redis-operator started panicking right as the Redis cluster was restarting. Here is a stacktrace [1]. Any idea about the cause here?

[1]

E0226 08:29:38.703198       1 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/go/src/github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/go/src/github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:573
/usr/local/go/src/runtime/panic.go:505
/usr/local/go/src/runtime/panic.go:63
/usr/local/go/src/runtime/signal_unix.go:388
/go/src/github.com/spotahome/redis-operator/operator/redisfailover/service/check.go:215
/go/src/github.com/spotahome/redis-operator/operator/redisfailover/checker.go:48
/go/src/github.com/spotahome/redis-operator/operator/redisfailover/handler.go:82
/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:299
/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:277
/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:247
/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:223
/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:207
/go/src/github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/go/src/github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/go/src/github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:207
/usr/local/go/src/runtime/asm_amd64.s:2361
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xef0fe3]
 goroutine 68 [running]:
github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x107
panic(0x1023900, 0x19d33e0)
	/usr/local/go/src/runtime/panic.go:505 +0x229
github.com/spotahome/redis-operator/operator/redisfailover/service.(*RedisFailoverChecker).GetMinimumRedisPodTime(0xc420407710, 0xc4200d0000, 0xc420390560, 0x2, 0x2)
	/go/src/github.com/spotahome/redis-operator/operator/redisfailover/service/check.go:215 +0x163
github.com/spotahome/redis-operator/operator/redisfailover.(*RedisFailoverHandler).CheckAndHeal(0xc420215830, 0xc4200d0000, 0xc4206087b0, 0xc4201b1270)
	/go/src/github.com/spotahome/redis-operator/operator/redisfailover/checker.go:48 +0xcb8
github.com/spotahome/redis-operator/operator/redisfailover.(*RedisFailoverHandler).Add(0xc420215830, 0x1272080, 0xc420608750, 0x125cf80, 0xc4200d0000, 0x1284640, 0x1a03d10)
	/go/src/github.com/spotahome/redis-operator/operator/redisfailover/handler.go:82 +0x139
github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller.(*generic).handleAdd(0xc4201abe00, 0x1272080, 0xc420608750, 0xc42047dfe0, 0x1a, 0x125cf80, 0xc4200d0000, 0x0, 0x0)
	/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:299 +0x43b
github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller.(*generic).processJob(0xc4201abe00, 0x1272080, 0xc420608720, 0xc42047dfe0, 0x1a, 0xc420608720, 0x1284640)
	/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:277 +0xee
github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller.(*generic).getAndProcessNextJob(0xc4201abe00, 0xc4204fc100)
	/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:247 +0x219
github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller.(*generic).runWorker(0xc4201abe00)
	/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:223 +0x2b
github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller.(*generic).(github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller.runWorker)-fm()
	/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:207 +0x2a
github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc4204d17b0)
	/go/src/github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc4209bbfb0, 0x3b9aca00, 0x0, 0x1, 0xc4202366c0)
	/go/src/github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc4204d17b0, 0x3b9aca00, 0xc4202366c0)
	/go/src/github.com/spotahome/redis-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller.(*generic).run.func1(0xc4201abe00, 0xc4202366c0)
	/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:207 +0x54
created by github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller.(*generic).run
	/go/src/github.com/spotahome/redis-operator/vendor/github.com/spotahome/kooper/operator/controller/generic.go:206 +0x1d7
@jchanam
Copy link
Collaborator

jchanam commented Feb 27, 2019

Hi @rhefner1,

It looks that it was when iterating over all pods from the statefulset to get for how long they've been running:

func (r *RedisFailoverChecker) GetMinimumRedisPodTime(rf *redisfailoverv1alpha2.RedisFailover) (time.Duration, error) {
minTime := 100000 * time.Hour // More than ten years
rps, err := r.k8sService.GetStatefulSetPods(rf.Namespace, GetRedisName(rf))
if err != nil {
return minTime, err
}
for _, redisNode := range rps.Items {
start := redisNode.Status.StartTime.Round(time.Second)
alive := time.Now().Sub(start)
r.logger.Debugf("Pod %s has been alive for %.f seconds", redisNode.Status.PodIP, alive.Seconds())
if alive < minTime {
minTime = alive
}
}
return minTime, nil
}

I think it just got a pod that was without status or start time, so it panicked when trying to get that information.

Did de operator restarted more than once? Is your failover still with issues? Did the failover have a disruption of service?

@rhefner1
Copy link
Contributor Author

@jchanam The Redis cluster was restarting because I changed the memory request. I suppose that if a pod is still initializing, it may not have this information. I'm not sure about what guarantees the Kubernetes API has about those things.

And yes, the operator restarted every time it iterated over that specific redis failover (I have about 8 failovers) until all of the pods had started up. Then the operator worked correctly.

The failover did have a disruption of service due to being out of memory (my fault). When I deployed the new memory request, I had to wait for all three Redis nodes to restart and then they sat there for a few minutes. Finally, redis-operator configured the cluster and a master was elected. Full recovery after that happened just a few minutes later.

I expected that recovery would start shortly after the first node started restarting. Maybe the panic caused this slowdown?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants