-
Notifications
You must be signed in to change notification settings - Fork 880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panic while trying to reconcile Rollout, most stable pods restarted #1696
Comments
Thanks for the detailed information in the bug, it is helpful. Here's the more readable stacktrace:
It's not that the label was missing. The only way it could have panicked is if: func (c *rolloutContext) createAnalysisRun(rolloutAnalysis *v1alpha1.RolloutAnalysis, infix string, labels map[string]string) (*v1alpha1.AnalysisRun, error) {
args := analysisutil.BuildArgumentsForRolloutAnalysisRun(rolloutAnalysis.Args, c.stableRS, c.newRS, c.rollout)
podHash := replicasetutil.GetPodTemplateHash(c.newRS)
if podHash == "" {
return nil, fmt.Errorf("Latest ReplicaSet '%s' has no pod hash in the labels", c.newRS.Name) // nil pointer derefernce! c.newRS must be nil.
} There is one interesting point in the logs which is
Scaling events are not as common, and are detected when rollout spec.replicas changes from previous value. There is a separate code path to handle scaling events which could be exposing a corner case. |
I feel the So a fundamental fix is to reduce the work done here, so that we have fewer branches in the codepath. |
Summary
I'm not 100% sure how it got into this state but I had a Rollout which was panicing the controller. I tried to follow the stack trace which seems to be pointing at this line saying that the latest replicaset didn't have a pod hash label, but I checked and it did have a correct
rollouts-pod-template-hash
label. So I'm not sure what happened there.I was able to get out of this situation but it caused most of the stable pods to restart and I was left with very few running pods, which would have caused an outage if this wasn't a test environment.
Details
Here's what the rollout looked like while the controller was panicing:
I was able to get the controller to stop panicing by updating the source deployment back to image:one (I'm using workload referencing). Right after I did that, the rollout was degraded but it picked up the fact that I had stable version pods running:
Soon afterwards, the status and message changed to:
But then immediately after that, all but one of the stable version pods were restarted (seems somewhat similar to problem 2 from #1681 ?). Note that an AnalysisRun was created but immediately marked as successful by the controller, not sure if that's related?
Then eventually once all the replacement pods started, the "extra" pod in revision 263 was terminated and the rollout became healthy again:
Diagnostics
Controller v1.1.1
https://gist.github.com/bpoland/7e3ef6ab03d42668ae8f0d1e867d4fa4
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: