Flux restarts primary deployment before canary analysis begins #928

dmsalomon · 2021-06-04T17:33:07Z

Describe the bug

I have been experimenting with flagger using the gitops-istio repository. I have found that when updating an image tag, flux will reprovision the backend-primary deployment immediately on reconciliation before canary analysis begins. During canary analysis, both the backend and backend-primary deployments are running the new image (which defeats the point of canary analysis obviously).

If I update the image tag by editing the deployment manually (i.e. kubectl -n prod edit deployment backend), the canary analysis works as expected (backend deployment is updated and scaled up -> canary analysis proceeds -> backend-primary is updated if successful -> backend is scaled down).

To Reproduce

Run kubectl -n prod get deployment -w to watch the status of deployment. Then update the image tag in git and wait for reconciliation. You should see that the backend-primary deployment is restarted, and the image tag for the backend-primary deployment is immediately bumped to the new version.

Expected behavior

backend-primary image should only be updated after canary analysis completes (successfully).

Additional context

Using all version provided by 570060d on gitops-istio

Flagger version: v1.11.0
Kubernetes version: 1.20 (EKS)
Service Mesh provider: istio (v1.10.0)
Ingress provider: istio-ingressgateway (not sure what this is asking about)

The text was updated successfully, but these errors were encountered:

dmsalomon · 2021-06-04T19:34:37Z

I was able to get the expected behavior by updating the kustomize.toolkit.fluxcd.io/v1beta1
Kustomization overlay in apps.yaml instead of the kustomize.config.k8s.io/v1beta1 in kustomization.yaml. So I updated apps.yaml to look like this:

apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
kind: Kustomization
metadata:
  name: apps
  namespace: flux-system
spec:
  interval: 30m0s
  dependsOn:
    - name: istio-system
  sourceRef:
    kind: GitRepository
    name: flux-system
  path: ./apps
  prune: true
  validation: client
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: flagger-loadtester
      namespace: prod
  images:
  - name: ghcr.io/stefanprodan/podinfo
    newName: ghcr.io/stefanprodan/podinfo
    newTag: 5.0.1

and then I removed the images section from kustomization.yaml.

My guess is that making this change causes flux to edit the deployment in place instead of overwriting the deployment resource when the kustomization is edited "outside" of flux. I would appreciate any additional information about this behavior. I know this is more of a flux related issue, but it seems that flagger is designed to be interoperable with flux so I would appreciate more clarity into how flux and flagger interact with each other.

Whyeasy · 2021-06-17T09:49:29Z

I'm running in to the same issue. Sadly enough the fix by @stefanprodan didn't fix it for me. What I see what's happening at the moment is the following with flagger v1.12.0:

current primary deployment and pods are deleted
new primary deployment and pods are created
After these are online the canary is created.

Also using flux in this case and images are updated via updating the image tag in kustomization.yaml.

stefanprodan · 2021-06-17T09:59:25Z

@Whyeasy make sure to use flux v0.15.0

Whyeasy · 2021-06-17T12:49:30Z

Updated flux, recreated all resources to start from scratch. Still the same sequence.

stefanprodan · 2021-06-17T13:30:28Z

Ok can you please post here kubectl get deployment <your-app>-primary -oyaml.

Whyeasy · 2021-06-17T14:02:24Z

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
    deployment.kubernetes.io/revision: "2"
    kustomize.toolkit.fluxcd.io/checksum: 0a40893bfdc545d62125bd3e74eeb2ebaa7097c2
  creationTimestamp: "2021-06-17T12:47:52Z"
  generation: 3
  labels:
    app: booking-service-primary
    cluster: flagship-dev
    kustomize.toolkit.fluxcd.io/name: starfleet-dev
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: booking-service-primary
  namespace: starfleet-dev
  ownerReferences:
  - apiVersion: flagger.app/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: Canary
    name: booking-service
    uid: d1b9b88c-2571-48fb-a037-09613f1998d0
  resourceVersion: "828453816"
  selfLink: /apis/apps/v1/namespaces/starfleet-dev/deployments/booking-service-primary
  uid: 89bddb7b-f6a3-44bb-afa1-fb2470ad7b53
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: booking-service-primary
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
        flagger-id: 3e80a49a-4ec7-410b-a471-7e4514f6737a
      creationTimestamp: null
      labels:
        app: booking-service-primary
        cluster: flagship-dev
    spec:
      containers:
      - <containers>

stefanprodan · 2021-06-17T14:06:20Z

@Whyeasy thanks, just be sure this bug is in the latest version, can you post here kubectl get deployment flagger -oyaml please.

Whyeasy · 2021-06-17T14:12:12Z

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "3"
    meta.helm.sh/release-name: flagger
    meta.helm.sh/release-namespace: linkerd
  creationTimestamp: "2021-05-17T08:31:03Z"
  generation: 3
  labels:
    app.kubernetes.io/instance: flagger
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: flagger
    helm.sh/chart: flagger-1.12.0
    helm.toolkit.fluxcd.io/name: flagger
    helm.toolkit.fluxcd.io/namespace: linkerd
  name: flagger
  namespace: linkerd
  resourceVersion: "828428857"
  selfLink: /apis/apps/v1/namespaces/linkerd/deployments/flagger
  uid: be21a5c8-58a9-4721-b6d8-e3105754a668
spec:
  progressDeadlineSeconds: 600
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: flagger
      app.kubernetes.io/name: flagger
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        appmesh.k8s.aws/sidecarInjectorWebhook: disabled
        prometheus.io/port: "8080"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: flagger
        app.kubernetes.io/name: flagger
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  app.kubernetes.io/instance: flagger
                  app.kubernetes.io/name: flagger
              topologyKey: kubernetes.io/hostname
            weight: 100
      containers:
      - command:
        - ./flagger
        - -log-level=info
        - -mesh-provider=linkerd
        - -metrics-server=http://prometheus-kube-prometheus-prometheus.monitoring:9090
        - -enable-config-tracking=true
        - -slack-user=flagger
        - -enable-leader-election=true
        - -leader-election-namespace=linkerd
        image: ghcr.io/fluxcd/flagger:1.12.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          exec:
            command:
            - wget
            - --quiet
            - --tries=1
            - --timeout=4
            - --spider
            - http://localhost:8080/healthz
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: flagger
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - wget
            - --quiet
            - --tries=1
            - --timeout=4
            - --spider
            - http://localhost:8080/healthz
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            cpu: "1"
            memory: 512Mi
          requests:
            cpu: 10m
            memory: 32Mi
        securityContext:
          readOnlyRootFilesystem: true
          runAsUser: 10001
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      nodeSelector:
        node_pool: preemptible
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: flagger
      serviceAccountName: flagger
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: preemptible
        operator: Equal
        value: "true"

stefanprodan mentioned this issue Jun 15, 2021

Remove Flux GC markers from generated objects #936

Merged

stefanprodan closed this as completed in #936 Jun 15, 2021

stefanprodan reopened this Jun 17, 2021

stefanprodan mentioned this issue Jun 17, 2021

Remove the GitOps Toolkit metadata from generated objects #939

Merged

stefanprodan closed this as completed in #939 Jun 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flux restarts primary deployment before canary analysis begins #928

Flux restarts primary deployment before canary analysis begins #928

dmsalomon commented Jun 4, 2021

dmsalomon commented Jun 4, 2021 •

edited

Loading

Whyeasy commented Jun 17, 2021 •

edited

Loading

stefanprodan commented Jun 17, 2021

Whyeasy commented Jun 17, 2021

stefanprodan commented Jun 17, 2021

Whyeasy commented Jun 17, 2021

stefanprodan commented Jun 17, 2021

Whyeasy commented Jun 17, 2021

Flux restarts primary deployment before canary analysis begins #928

Flux restarts primary deployment before canary analysis begins #928

Comments

dmsalomon commented Jun 4, 2021

Describe the bug

To Reproduce

Expected behavior

Additional context

dmsalomon commented Jun 4, 2021 • edited Loading

Whyeasy commented Jun 17, 2021 • edited Loading

stefanprodan commented Jun 17, 2021

Whyeasy commented Jun 17, 2021

stefanprodan commented Jun 17, 2021

Whyeasy commented Jun 17, 2021

stefanprodan commented Jun 17, 2021

Whyeasy commented Jun 17, 2021

dmsalomon commented Jun 4, 2021 •

edited

Loading

Whyeasy commented Jun 17, 2021 •

edited

Loading