Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flux restarts primary deployment before canary analysis begins #928

Closed
dmsalomon opened this issue Jun 4, 2021 · 8 comments · Fixed by #936 or #939
Closed

Flux restarts primary deployment before canary analysis begins #928

dmsalomon opened this issue Jun 4, 2021 · 8 comments · Fixed by #936 or #939

Comments

@dmsalomon
Copy link

Describe the bug

I have been experimenting with flagger using the gitops-istio repository. I have found that when updating an image tag, flux will reprovision the backend-primary deployment immediately on reconciliation before canary analysis begins. During canary analysis, both the backend and backend-primary deployments are running the new image (which defeats the point of canary analysis obviously).

If I update the image tag by editing the deployment manually (i.e. kubectl -n prod edit deployment backend), the canary analysis works as expected (backend deployment is updated and scaled up -> canary analysis proceeds -> backend-primary is updated if successful -> backend is scaled down).

To Reproduce

Run kubectl -n prod get deployment -w to watch the status of deployment. Then update the image tag in git and wait for reconciliation. You should see that the backend-primary deployment is restarted, and the image tag for the backend-primary deployment is immediately bumped to the new version.

Expected behavior

backend-primary image should only be updated after canary analysis completes (successfully).

Additional context

Using all version provided by 570060d on gitops-istio

  • Flagger version: v1.11.0
  • Kubernetes version: 1.20 (EKS)
  • Service Mesh provider: istio (v1.10.0)
  • Ingress provider: istio-ingressgateway (not sure what this is asking about)
@dmsalomon
Copy link
Author

dmsalomon commented Jun 4, 2021

I was able to get the expected behavior by updating the kustomize.toolkit.fluxcd.io/v1beta1
Kustomization overlay in apps.yaml instead of the kustomize.config.k8s.io/v1beta1 in kustomization.yaml. So I updated apps.yaml to look like this:

apiVersion: kustomize.toolkit.fluxcd.io/v1beta1
kind: Kustomization
metadata:
  name: apps
  namespace: flux-system
spec:
  interval: 30m0s
  dependsOn:
    - name: istio-system
  sourceRef:
    kind: GitRepository
    name: flux-system
  path: ./apps
  prune: true
  validation: client
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: flagger-loadtester
      namespace: prod
  images:
  - name: ghcr.io/stefanprodan/podinfo
    newName: ghcr.io/stefanprodan/podinfo
    newTag: 5.0.1

and then I removed the images section from kustomization.yaml.

My guess is that making this change causes flux to edit the deployment in place instead of overwriting the deployment resource when the kustomization is edited "outside" of flux. I would appreciate any additional information about this behavior. I know this is more of a flux related issue, but it seems that flagger is designed to be interoperable with flux so I would appreciate more clarity into how flux and flagger interact with each other.

@Whyeasy
Copy link

Whyeasy commented Jun 17, 2021

I'm running in to the same issue. Sadly enough the fix by @stefanprodan didn't fix it for me. What I see what's happening at the moment is the following with flagger v1.12.0:

  • current primary deployment and pods are deleted
  • new primary deployment and pods are created
  • After these are online the canary is created.

Also using flux in this case and images are updated via updating the image tag in kustomization.yaml.

@stefanprodan
Copy link
Member

@Whyeasy make sure to use flux v0.15.0

@Whyeasy
Copy link

Whyeasy commented Jun 17, 2021

Updated flux, recreated all resources to start from scratch. Still the same sequence.

@stefanprodan
Copy link
Member

Ok can you please post here kubectl get deployment <your-app>-primary -oyaml.

@Whyeasy
Copy link

Whyeasy commented Jun 17, 2021

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
    deployment.kubernetes.io/revision: "2"
    kustomize.toolkit.fluxcd.io/checksum: 0a40893bfdc545d62125bd3e74eeb2ebaa7097c2
  creationTimestamp: "2021-06-17T12:47:52Z"
  generation: 3
  labels:
    app: booking-service-primary
    cluster: flagship-dev
    kustomize.toolkit.fluxcd.io/name: starfleet-dev
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: booking-service-primary
  namespace: starfleet-dev
  ownerReferences:
  - apiVersion: flagger.app/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: Canary
    name: booking-service
    uid: d1b9b88c-2571-48fb-a037-09613f1998d0
  resourceVersion: "828453816"
  selfLink: /apis/apps/v1/namespaces/starfleet-dev/deployments/booking-service-primary
  uid: 89bddb7b-f6a3-44bb-afa1-fb2470ad7b53
spec:
  progressDeadlineSeconds: 600
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: booking-service-primary
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
        flagger-id: 3e80a49a-4ec7-410b-a471-7e4514f6737a
      creationTimestamp: null
      labels:
        app: booking-service-primary
        cluster: flagship-dev
    spec:
      containers:
      - <containers>

@stefanprodan
Copy link
Member

@Whyeasy thanks, just be sure this bug is in the latest version, can you post here kubectl get deployment flagger -oyaml please.

@Whyeasy
Copy link

Whyeasy commented Jun 17, 2021

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "3"
    meta.helm.sh/release-name: flagger
    meta.helm.sh/release-namespace: linkerd
  creationTimestamp: "2021-05-17T08:31:03Z"
  generation: 3
  labels:
    app.kubernetes.io/instance: flagger
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: flagger
    helm.sh/chart: flagger-1.12.0
    helm.toolkit.fluxcd.io/name: flagger
    helm.toolkit.fluxcd.io/namespace: linkerd
  name: flagger
  namespace: linkerd
  resourceVersion: "828428857"
  selfLink: /apis/apps/v1/namespaces/linkerd/deployments/flagger
  uid: be21a5c8-58a9-4721-b6d8-e3105754a668
spec:
  progressDeadlineSeconds: 600
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: flagger
      app.kubernetes.io/name: flagger
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        appmesh.k8s.aws/sidecarInjectorWebhook: disabled
        prometheus.io/port: "8080"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: flagger
        app.kubernetes.io/name: flagger
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  app.kubernetes.io/instance: flagger
                  app.kubernetes.io/name: flagger
              topologyKey: kubernetes.io/hostname
            weight: 100
      containers:
      - command:
        - ./flagger
        - -log-level=info
        - -mesh-provider=linkerd
        - -metrics-server=http://prometheus-kube-prometheus-prometheus.monitoring:9090
        - -enable-config-tracking=true
        - -slack-user=flagger
        - -enable-leader-election=true
        - -leader-election-namespace=linkerd
        image: ghcr.io/fluxcd/flagger:1.12.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          exec:
            command:
            - wget
            - --quiet
            - --tries=1
            - --timeout=4
            - --spider
            - http://localhost:8080/healthz
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: flagger
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - wget
            - --quiet
            - --tries=1
            - --timeout=4
            - --spider
            - http://localhost:8080/healthz
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            cpu: "1"
            memory: 512Mi
          requests:
            cpu: 10m
            memory: 32Mi
        securityContext:
          readOnlyRootFilesystem: true
          runAsUser: 10001
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      nodeSelector:
        node_pool: preemptible
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: flagger
      serviceAccountName: flagger
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: preemptible
        operator: Equal
        value: "true"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants