Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canary Rollout with errors when there's an scale event during the experiment #1596

Closed
flaviolemos78 opened this issue Oct 19, 2021 · 2 comments · Fixed by #1597
Closed

Canary Rollout with errors when there's an scale event during the experiment #1596

flaviolemos78 opened this issue Oct 19, 2021 · 2 comments · Fixed by #1597
Labels
bug Something isn't working

Comments

@flaviolemos78
Copy link
Contributor

Summary

When using canary rollout with experiments, if during the experiment there's an scale event, the rollout status.canary.experiment metadata is deleted. On the next reconcile cycle, the canary reconciler will try to (re)create/recue the experiment which results in an error. The error continues on further reconciles until the canary step is incremented. Since every time a reconcile job fails, it gets requeued with an exponential backoff retry. In the limit, any rollout change can take up to 5 minutes to take effect, e.g: auto-scale events, promoting a canary....

Syncing replicas only (userPaused false, isScaling: true)
Patched: {"status":{"canary":{"currentExperiment":null},"conditions":[{"lastTransitionTime":"2021-10-15T05:33:44Z","lastUpdateTime":"2021-10-15T10:57:45Z","message":"Created new replica set \"portal-core-portal-df6c8ccc9\"","reason":"NewReplicaSetCreated","status":"True","type":"Progressing"},

Diagnostics

Noticed the errors on argo-rollouts 0.10.2 and on 1.1.0

rollout.status.canary.experiment is deleted after a scale event:

time="2021-10-19T07:49:57Z" level=info msg="Scaled up ReplicaSet infrastructure-kubernetes-asfvalidator-54f9fbbcc8 (revision 13) from 3 to 4" event_reason=ScalingReplicaSet namespace=infrastructure-kubernetes rollout=infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="Not finished reconciling stableRS" namespace=infrastructure-kubernetes rollout=infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="Enqueueing parent of infrastructure-kubernetes/infrastructure-kubernetes-asfvalidator-54f9fbbcc8: Rollout infrastructure-kubernetes/infrastructure-kubernetes-asfvalidator"
time="2021-10-19T07:49:57Z" level=info msg="Event(v1.ObjectReference{Kind:\"Rollout\", Namespace:\"infrastructure-kubernetes\", Name:\"infrastructure-kubernetes-asfvalidator\", UID:\"0e36ba35-650b-455a-bed3-0d66cc024e74\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"253722\", FieldPath:\"\"}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled up ReplicaSet infrastructure-kubernetes-asfvalidator-54f9fbbcc8 (revision 13) from 3 to 4"
time="2021-10-19T07:49:57Z" level=info msg="Start processing" resource=infrastructure-kubernetes/infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=error msg="Failed to process: secret \"argo-rollouts-notification-secret\" not found" resource=infrastructure-kubernetes/infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="Patched: {\"status\":{\"canary\":{\"currentExperiment\":null},\"conditions\":[{\"lastTransitionTime\":\"2021-10-19T07:47:56Z\",\"lastUpdateTime\":\"2021-10-19T07:47:56Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2021-10-18T10:05:01Z\",\"lastUpdateTime\":\"2021-10-19T07:47:56Z\",\"message\":\"Created new replica set \\\"infrastructure-kubernetes-asfvalidator-74f9778454\\\"\",\"reason\":\"NewReplicaSetCreated\",\"status\":\"True\",\"type\":\"Progressing\"},{\"lastTransitionTime\":\"2021-10-19T07:49:57Z\",\"lastUpdateTime\":\"2021-10-19T07:49:57Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"}],\"observedGeneration\":\"31\"}}" generation=31 namespace=infrastructure-kubernetes resourceVersion=253722 rollout=infrastructure-kubernetes-asfvalidator

Then on the next reconcile, the controller tries to "rescue" the canary experiment, but it fails when trying the canary rs changes:

time="2021-10-19T07:49:57Z" level=info msg="Started syncing rollout" generation=31 namespace=infrastructure-kubernetes resourceVersion=253727 rollout=infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="Reconciling experiment step (stepIndex: 0)" namespace=infrastructure-kubernetes rollout=infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="Enqueueing parent of infrastructure-kubernetes/infrastructure-kubernetes-asfvalidator-54f9fbbcc8: Rollout infrastructure-kubernetes/infrastructure-kubernetes-asfvalidator"
time="2021-10-19T07:49:57Z" level=info msg="Enqueueing parent of infrastructure-kubernetes/infrastructure-kubernetes-asfvalidator-54f9fbbcc8: Rollout infrastructure-kubernetes/infrastructure-kubernetes-asfvalidator"
time="2021-10-19T07:49:57Z" level=info msg="Encountered collision of existing experiment infrastructure-kubernetes-asfvalidator-74f9778454-14-0 (phase: Running, equal: true, controllerUIDEqual: true)" namespace=infrastructure-kubernetes rollout=infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="Created Experiment 'infrastructure-kubernetes-asfvalidator-74f9778454-14-0'" event_reason=ExperimentCreated namespace=infrastructure-kubernetes rollout=infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="Rescued infrastructure-kubernetes-asfvalidator-74f9778454-14-0 from inadvertent termination" namespace=infrastructure-kubernetes rollout=infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="Event(v1.ObjectReference{Kind:\"Rollout\", Namespace:\"infrastructure-kubernetes\", Name:\"infrastructure-kubernetes-asfvalidator\", UID:\"0e36ba35-650b-455a-bed3-0d66cc024e74\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"253727\", FieldPath:\"\"}): type: 'Normal' reason: 'ExperimentCreated' Created Experiment 'infrastructure-kubernetes-asfvalidator-74f9778454-14-0'"
time="2021-10-19T07:49:57Z" level=error msg="roCtx.reconcile err ReplicaSet.apps \"infrastructure-kubernetes-asfvalidator-74f9778454\" is invalid: spec.template.metadata.labels: Invalid value: map[string]string{\"app.kubernetes.io/instance\":\"infrastructure-kubernetes-asfvalidator-zerotraffic\", \"app.kubernetes.io/name\":\"asfvalidator-zerotraffic\", \"rollouts-pod-template-hash\":\"74f9778454\"}: `selector` does not match template `labels`" generation=31 namespace=infrastructure-kubernetes resourceVersion=253727 rollout=infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="Reconciliation completed" generation=31 namespace=infrastructure-kubernetes resourceVersion=253727 rollout=infrastructure-kubernetes-asfvalidator time_ms=57.6819
time="2021-10-19T07:49:57Z" level=info msg="Started syncing rollout" generation=31 namespace=infrastructure-kubernetes resourceVersion=253722 rollout=infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="Syncing replicas only (userPaused false, isScaling: true)" namespace=infrastructure-kubernetes rollout=infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="Scaled up ReplicaSet infrastructure-kubernetes-asfvalidator-54f9fbbcc8 (revision 13) from 3 to 4" event_reason=ScalingReplicaSet namespace=infrastructure-kubernetes rollout=infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="Not finished reconciling stableRS" namespace=infrastructure-kubernetes rollout=infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="Enqueueing parent of infrastructure-kubernetes/infrastructure-kubernetes-asfvalidator-54f9fbbcc8: Rollout infrastructure-kubernetes/infrastructure-kubernetes-asfvalidator"
time="2021-10-19T07:49:57Z" level=info msg="Event(v1.ObjectReference{Kind:\"Rollout\", Namespace:\"infrastructure-kubernetes\", Name:\"infrastructure-kubernetes-asfvalidator\", UID:\"0e36ba35-650b-455a-bed3-0d66cc024e74\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"253722\", FieldPath:\"\"}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled up ReplicaSet infrastructure-kubernetes-asfvalidator-54f9fbbcc8 (revision 13) from 3 to 4"
time="2021-10-19T07:49:57Z" level=info msg="Start processing" resource=infrastructure-kubernetes/infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=error msg="Failed to process: secret \"argo-rollouts-notification-secret\" not found" resource=infrastructure-kubernetes/infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="Patched: {\"status\":{\"canary\":{\"currentExperiment\":null},\"conditions\":[{\"lastTransitionTime\":\"2021-10-19T07:47:56Z\",\"lastUpdateTime\":\"2021-10-19T07:47:56Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"False\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2021-10-18T10:05:01Z\",\"lastUpdateTime\":\"2021-10-19T07:47:56Z\",\"message\":\"Created new replica set \\\"infrastructure-kubernetes-asfvalidator-74f9778454\\\"\",\"reason\":\"NewReplicaSetCreated\",\"status\":\"True\",\"type\":\"Progressing\"},{\"lastTransitionTime\":\"2021-10-19T07:49:57Z\",\"lastUpdateTime\":\"2021-10-19T07:49:57Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"}],\"observedGeneration\":\"31\"}}" generation=31 namespace=infrastructure-kubernetes resourceVersion=253722 rollout=infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="persisted to informer" generation=31 namespace=infrastructure-kubernetes resourceVersion=253727 rollout=infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="Reconciliation completed" generation=31 namespace=infrastructure-kubernetes resourceVersion=253722 rollout=infrastructure-kubernetes-asfvalidator time_ms=30.4438
time="2021-10-19T07:49:57Z" level=info msg="Started syncing rollout" generation=31 namespace=infrastructure-kubernetes resourceVersion=253727 rollout=infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="Reconciling experiment step (stepIndex: 0)" namespace=infrastructure-kubernetes rollout=infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="Enqueueing parent of infrastructure-kubernetes/infrastructure-kubernetes-asfvalidator-54f9fbbcc8: Rollout infrastructure-kubernetes/infrastructure-kubernetes-asfvalidator"
time="2021-10-19T07:49:57Z" level=info msg="Enqueueing parent of infrastructure-kubernetes/infrastructure-kubernetes-asfvalidator-54f9fbbcc8: Rollout infrastructure-kubernetes/infrastructure-kubernetes-asfvalidator"
time="2021-10-19T07:49:57Z" level=info msg="Encountered collision of existing experiment infrastructure-kubernetes-asfvalidator-74f9778454-14-0 (phase: Running, equal: true, controllerUIDEqual: true)" namespace=infrastructure-kubernetes rollout=infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="Created Experiment 'infrastructure-kubernetes-asfvalidator-74f9778454-14-0'" event_reason=ExperimentCreated namespace=infrastructure-kubernetes rollout=infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="Rescued infrastructure-kubernetes-asfvalidator-74f9778454-14-0 from inadvertent termination" namespace=infrastructure-kubernetes rollout=infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="Event(v1.ObjectReference{Kind:\"Rollout\", Namespace:\"infrastructure-kubernetes\", Name:\"infrastructure-kubernetes-asfvalidator\", UID:\"0e36ba35-650b-455a-bed3-0d66cc024e74\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"253727\", FieldPath:\"\"}): type: 'Normal' reason: 'ExperimentCreated' Created Experiment 'infrastructure-kubernetes-asfvalidator-74f9778454-14-0'"
time="2021-10-19T07:49:57Z" level=error msg="roCtx.reconcile err ReplicaSet.apps \"infrastructure-kubernetes-asfvalidator-74f9778454\" is invalid: spec.template.metadata.labels: Invalid value: map[string]string{\"app.kubernetes.io/instance\":\"infrastructure-kubernetes-asfvalidator-zerotraffic\", \"app.kubernetes.io/name\":\"asfvalidator-zerotraffic\", \"rollouts-pod-template-hash\":\"74f9778454\"}: `selector` does not match template `labels`" generation=31 namespace=infrastructure-kubernetes resourceVersion=253727 rollout=infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="Reconciliation completed" generation=31 namespace=infrastructure-kubernetes resourceVersion=253727 rollout=infrastructure-kubernetes-asfvalidator time_ms=57.6819
time="2021-10-19T07:49:57Z" level=error msg="rollout syncHandler error: ReplicaSet.apps \"infrastructure-kubernetes-asfvalidator-74f9778454\" is invalid: spec.template.metadata.labels: Invalid value: map[string]string{\"app.kubernetes.io/instance\":\"infrastructure-kubernetes-asfvalidator-zerotraffic\", \"app.kubernetes.io/name\":\"asfvalidator-zerotraffic\", \"rollouts-pod-template-hash\":\"74f9778454\"}: `selector` does not match template `labels`" namespace=infrastructure-kubernetes rollout=infrastructure-kubernetes-asfvalidator
time="2021-10-19T07:49:57Z" level=info msg="rollout syncHandler queue retries: 3 : key \"infrastructure-kubernetes/infrastructure-kubernetes-asfvalidator\"" namespace=infrastructure-kubernetes rollout=infrastructure-kubernetes-asfvalidator
E1019 07:49:57.623497       1 controller.go:174] ReplicaSet.apps "infrastructure-kubernetes-asfvalidator-74f9778454" is invalid: spec.template.metadata.labels: Invalid value: map[string]string{"app.kubernetes.io/instance":"infrastructure-kubernetes-asfvalidator-zerotraffic", "app.kubernetes.io/name":"asfvalidator-zerotraffic", "rollouts-pod-template-hash":"74f9778454"}: `selector` does not match template `labels`

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@perenesenko
Copy link
Member

Hi, @flaviolemos78 I'm trying to understand the issue/reproduce it. Could you help me?
What does "there's a scale event" mean? Maybe you could provide an example of the config.

@flaviolemos78
Copy link
Contributor Author

consider a Rollout with the following spec:

spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/instance: stable
  strategy:
    canary:
      maxSurge: 25
      maxUnavailable: 0
      steps:
        - experiment:
            duration: 30m
            templates:
              - metadata:
                  labels:
                    app.kubernetes.io/instance: experiment
                name: experiment
                replicas: 1
                selector:
                  matchLabels:
                    app.kubernetes.io/instance: experiment
                specRef: canary

During the rollout, the experiment step will start just fine, then if you go to the Rollout, and change the desired replicas to 4 (same that HPA does -> scaling event), the reconciler will try to apply the desired state, and it does with success, but then in later reconciliations, it will enter in an error loop when trying to reconcile the experiment on step 0

khhirani pushed a commit that referenced this issue Nov 3, 2021
Fixes #1596 (#1597)

* fix: rollout experiment template changing reference rs template labels

Signed-off-by: Flavio Lemos <[email protected]>

* docs: Add Farfetch to USERS.md

Signed-off-by: Flavio Lemos <[email protected]>
alexmt pushed a commit that referenced this issue Nov 29, 2021
Fixes #1596 (#1597)

* fix: rollout experiment template changing reference rs template labels

Signed-off-by: Flavio Lemos <[email protected]>

* docs: Add Farfetch to USERS.md

Signed-off-by: Flavio Lemos <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants