Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argo DB init conflict when deploy workflow-controller with multiple replicas #11177

Closed
3 tasks done
astraw99 opened this issue Jun 4, 2023 · 6 comments · Fixed by #11178, #11569 or #11760
Closed
3 tasks done

Argo DB init conflict when deploy workflow-controller with multiple replicas #11177

astraw99 opened this issue Jun 4, 2023 · 6 comments · Fixed by #11178, #11569 or #11760
Labels
area/controller Controller issues, panics P3 Low priority type/bug

Comments

@astraw99
Copy link
Contributor

astraw99 commented Jun 4, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

DB init encounter conflict, when deploy the workflow-controller with multiple replicas (leader election enabled by default).

kubectl get po -n argo

NAME                                   READY   STATUS             RESTARTS   AGE
argo-server-849b557f68-9q8wk           1/1     Running            0          10m
argo-server-849b557f68-cjkw2           1/1     Running            0          10m
argo-server-849b557f68-v5vmp           1/1     Running            0          10m
minio-588f94977f-9bd6r                 1/1     Running            0          10m
workflow-controller-7796ff4cbb-87kbj   0/1     CrashLoopBackOff   6          10m
workflow-controller-7796ff4cbb-dfjgb   0/1     CrashLoopBackOff   6          10m
workflow-controller-7796ff4cbb-hwbd4   0/1     CrashLoopBackOff   6          10m

The conflict log is pasted below.

I will raise a PR to fix this: move the DB init logic to after the leader election completed.

Version

v3.4.6

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

workflow-controller crashed by `kubectl get po -n argo`:

NAME                                   READY   STATUS             RESTARTS   AGE
workflow-controller-7796ff4cbb-87kbj   0/1     CrashLoopBackOff   6          10m
workflow-controller-7796ff4cbb-dfjgb   0/1     CrashLoopBackOff   6          10m
workflow-controller-7796ff4cbb-hwbd4   0/1     CrashLoopBackOff   6          10m

Logs from the workflow controller

kubectl logs -n argo workflow-controller-7796ff4cbb-87kbj

time="2023-05-26T09:11:26Z" level=info msg="index config" indexWorkflowSemaphoreKeys=true
time="2023-05-26T09:11:26Z" level=info msg="cron config" cronSyncPeriod=10s
time="2023-05-26T09:11:26Z" level=info msg="Memoization caches will be garbage-collected if they have not been hit after" gcAfterNotHitDuration=30s
time="2023-05-26T09:11:26.551Z" level=info msg="enabling pprof debug endpoints - do not do this in production"
time="2023-05-26T09:11:26.558Z" level=info msg="Get configmaps 200"
time="2023-05-26T09:11:26.564Z" level=info msg="Configuration:\nartifactRepository:\n  archiveLogs: true\n  s3:\n    accessKeySecret:\n      key: accesskey\n      name: my-minio-cred\n    bucket: my-bucket\n    endpoint: minio.argo:9000\n    insecure: true\n    secretKeySecret:\n      key: secretkey\n      name: my-minio-cred\nexecutor:\n  imagePullPolicy: IfNotPresent\n  name: \"\"\n  resources:\n    requests:\n      cpu: 10m\n      memory: 64Mi\ninitialDelay: 0s\nlinks:\n- name: Workflow Link\n  scope: workflow\n  url: http://logging-facility?namespace=${metadata.namespace}&workflowName=${metadata.name}&startedAt=${status.startedAt}&finishedAt=${status.finishedAt}\n- name: Pod Link\n  scope: pod\n  url: http://logging-facility?namespace=${metadata.namespace}&podName=${metadata.name}&startedAt=${status.startedAt}&finishedAt=${status.finishedAt}\n- name: Pod Logs Link\n  scope: pod-logs\n  url: http://logging-facility?namespace=${metadata.namespace}&podName=${metadata.name}&startedAt=${status.startedAt}&finishedAt=${status.finishedAt}\n- name: Event Source Logs Link\n  scope: event-source-logs\n  url: http://logging-facility?namespace=${metadata.namespace}&podName=${metadata.name}&startedAt=${status.startedAt}&finishedAt=${status.finishedAt}\n- name: Sensor Logs Link\n  scope: sensor-logs\n  url: http://logging-facility?namespace=${metadata.namespace}&podName=${metadata.name}&startedAt=${status.startedAt}&finishedAt=${status.finishedAt}\nmetricsConfig:\n  enabled: true\n  path: /metrics\n  port: 9090\nnamespaceParallelism: 999999999\nnodeEvents: {}\npersistence:\n  archive: true\n  archiveTTL: 2160h0m0s\n  connectionPool:\n    connMaxLifetime: 30s\n    maxIdleConns: 10\n    maxOpenConns: 100\n  mysql:\n    database: argo\n    host: demo-db.demo.io\n    passwordSecret:\n      key: password\n      name: argo-mysql-config\n    port: 3306\n    tableName: argo_workflows\n    userNameSecret:\n      key: username\n      name: argo-mysql-config\n  nodeStatusOffLoad: true\npodSpecLogStrategy: {}\nsso:\n  clientId:\n    key: \"\"\n  clientSecret:\n    key: \"\"\n  issuer: \"\"\n  redirectUrl: \"\"\n  sessionExpiry: 0s\ntelemetryConfig: {}\nworkflowDefaults:\n  metadata:\n    annotations:\n      argo: workflows\n    creationTimestamp: null\n  spec:\n    arguments: {}\n    parallelism: 3\n    podGC:\n      strategy: OnWorkflowSuccess\n    ttlStrategy:\n      secondsAfterCompletion: 259200\n  status:\n    finishedAt: null\n    startedAt: null\n"
time="2023-05-26T09:11:26.564Z" level=info msg="Persistence configuration enabled"
time="2023-05-26T09:11:26.566Z" level=info msg="Get secrets 200"
time="2023-05-26T09:11:26.568Z" level=info msg="Get secrets 200"
time="2023-05-26T09:11:26.577Z" level=info msg="Persistence Session created successfully"
time="2023-05-26T09:11:26.586Z" level=info msg="Migrating database schema" clusterName=default dbType=mysql
time="2023-05-26T09:11:26.661Z" level=info msg="applying database change" change="create index argo_workflows_i1 on argo_workflows (clustername,namespace,updatedat)" changeSchemaVersion=55
time="2023-05-26T09:11:26.692Z" level=fatal msg="Failed to update config: Error 1061 (42000): Duplicate key name 'argo_workflows_i1'"

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@terrytangyuan
Copy link
Member

@astraw99 Could you test the latest image tag?

@astraw99
Copy link
Contributor Author

astraw99 commented Jul 2, 2023

@astraw99 Could you test the latest image tag?

Sure, tested with the latest image tag (master branch, commit 609539d).
workflow-controller works ok with multiple replicas (3, 5, 10, 20), 10 for example:

workflow-controller-85bfb69457-94fzl   1/1     Running   0          47s
workflow-controller-85bfb69457-cwfmk   1/1     Running   0          47s
workflow-controller-85bfb69457-dj2z7   1/1     Running   0          47s
workflow-controller-85bfb69457-g2w5r   1/1     Running   0          47s
workflow-controller-85bfb69457-gl5mg   1/1     Running   0          47s
workflow-controller-85bfb69457-p6b7c   1/1     Running   0          47s
workflow-controller-85bfb69457-sw6z9   1/1     Running   0          47s
workflow-controller-85bfb69457-tvvl2   1/1     Running   0          47s
workflow-controller-85bfb69457-vkz76   1/1     Running   0          47s
workflow-controller-85bfb69457-z28cf   1/1     Running   0          47s

The argo DB init ok without conflict:

show tables;
---------------------------------
argo_archived_workflows
argo_archived_workflows_labels
argo_workflows
schema_history

@terrytangyuan
Copy link
Member

Thank you for confirming!

@terrytangyuan
Copy link
Member

This is re-opened. See #11553

@terrytangyuan terrytangyuan reopened this Aug 10, 2023
@terrytangyuan
Copy link
Member

This needs to be re-worked to foxu on DB init issue

@astraw99
Copy link
Contributor Author

Got it, will follow this issue and try to fix.

terrytangyuan pushed a commit that referenced this issue Sep 5, 2023
terrytangyuan pushed a commit that referenced this issue Sep 5, 2023
qudtjs0753 pushed a commit to qudtjs0753/argo-workflows that referenced this issue Sep 6, 2023
@agilgur5 agilgur5 added the area/controller Controller issues, panics label May 2, 2024
dpadhiar pushed a commit to dpadhiar/argo-workflows that referenced this issue May 9, 2024
dpadhiar pushed a commit to dpadhiar/argo-workflows that referenced this issue May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment