-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operator panicking during critical downtime #122
Comments
Hi @rhefner1, It looks that it was when iterating over all pods from the statefulset to get for how long they've been running: redis-operator/operator/redisfailover/service/check.go Lines 208 to 223 in a970bf6
I think it just got a pod that was without status or start time, so it panicked when trying to get that information. Did de operator restarted more than once? Is your failover still with issues? Did the failover have a disruption of service? |
@jchanam The Redis cluster was restarting because I changed the memory request. I suppose that if a pod is still initializing, it may not have this information. I'm not sure about what guarantees the Kubernetes API has about those things. And yes, the operator restarted every time it iterated over that specific redis failover (I have about 8 failovers) until all of the pods had started up. Then the operator worked correctly. The failover did have a disruption of service due to being out of memory (my fault). When I deployed the new memory request, I had to wait for all three Redis nodes to restart and then they sat there for a few minutes. Finally, redis-operator configured the cluster and a master was elected. Full recovery after that happened just a few minutes later. I expected that recovery would start shortly after the first node started restarting. Maybe the panic caused this slowdown? |
I'm running redis-operator 0.5.5 (latest version) in Kubernetes 1.11.
I had a production incident this morning in one of our Redis clusters. The nodes started running out of memory and there was
MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
in the logs.` I deployed a change to increase the memory of the cluster and the statefulset began restarting.Never mind that, I understand the problem there (though I think recovery could be greatly improved). What did happen that was concerning was that redis-operator started panicking right as the Redis cluster was restarting. Here is a stacktrace [1]. Any idea about the cause here?
[1]
The text was updated successfully, but these errors were encountered: