Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide failFast flag, allow DAG to run all branches (either success or failure) #1442

Closed
xianlubird opened this issue Jun 20, 2019 · 2 comments

Comments

@xianlubird
Copy link
Member

xianlubird commented Jun 20, 2019

I have a workflow yaml like this.

dag test case
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: dag-primay-branch-
spec:
  entrypoint: statis
  templates:
  - name: a
    container:
      image:  docker/whalesay:latest
      command: [cowsay]
      args: ["hello world"]
  - name: b
    retryStrategy:
      limit: 2
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["sleep 30; echo haha"]
  - name: c
    retryStrategy:
      limit: 3
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["echo intentional failure; exit 2"]
  - name: d
    container:
      image: docker/whalesay:latest
      command: [cowsay]
      args: ["hello world"]
  - name: statis
    dag:
      tasks:
      - name: A
        template: a
      - name: B
        dependencies: [A]
        template: b
      - name: C
        dependencies: [A]
        template: c
      - name: D
        dependencies: [B]
        template: d
      - name: E
        dependencies: [D]
        template: d

The dependencies are as follows :

step1:      A
          /     \
step2:  B       C
         |
step3:  D
         |
step4:  E

When C node is failed, the workflow will stop at B and won't process on. The output like this:

Name:                dag-primay-branch-b2l5l
Namespace:           default
ServiceAccount:      default
Status:              Failed
Created:             Thu Jun 20 19:21:39 +0800 (6 minutes ago)
Started:             Thu Jun 20 19:21:39 +0800 (6 minutes ago)
Finished:            Thu Jun 20 19:22:17 +0800 (6 minutes ago)
Duration:            38 seconds

STEP                        PODNAME                             DURATION  MESSAGE
 ✖ dag-primay-branch-b2l5l
 ├-✔ A                      dag-primay-branch-b2l5l-430930318   3s
 ├-✔ B(0)                   dag-primay-branch-b2l5l-1687888246  33s
 └-✖ C                                                                    No more retries left
   ├-✖ C(0)                 dag-primay-branch-b2l5l-2802686203  5s        failed with exit code 2
   ├-✖ C(1)                 dag-primay-branch-b2l5l-2333059966  4s        failed with exit code 2
   ├-✖ C(2)                 dag-primay-branch-b2l5l-3004311821  4s        failed with exit code 2
   └-✖ C(3)                 dag-primay-branch-b2l5l-118811616   3s        failed with exit code 2

But If I remove B template retryStrategy label the yaml like this

new test case
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: dag-primay-branch-
spec:
  entrypoint: statis
  templates:
  - name: a
    container:
      image:  docker/whalesay:latest
      command: [cowsay]
      args: ["hello world"]
  - name: b
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["sleep 30; echo haha"]
  - name: c
    retryStrategy:
      limit: 3
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["echo intentional failure; exit 2"]
  - name: d
    container:
      image: docker/whalesay:latest
      command: [cowsay]
      args: ["hello world"]
  - name: statis
    dag:
      tasks:
      - name: A
        template: a
      - name: B
        dependencies: [A]
        template: b
      - name: C
        dependencies: [A]
        template: c
      - name: D
        dependencies: [B]
        template: d
      - name: E
        dependencies: [D]
        template: d

The D E step will go on when C is failed. The output like this

Name:                dag-primay-branch-ksbcv
Namespace:           default
ServiceAccount:      default
Status:              Failed
Created:             Thu Jun 20 19:30:08 +0800 (48 seconds ago)
Started:             Thu Jun 20 19:30:08 +0800 (48 seconds ago)
Finished:            Thu Jun 20 19:30:56 +0800 (now)
Duration:            48 seconds

STEP                        PODNAME                             DURATION  MESSAGE
 ✖ dag-primay-branch-ksbcv
 ├-✔ A                      dag-primay-branch-ksbcv-4005668674  3s
 ├-✔ B                      dag-primay-branch-ksbcv-3988891055  35s
 ├-✖ C                                                                    No more retries left
 | ├-✖ C(0)                 dag-primay-branch-ksbcv-1785421399  3s        failed with exit code 2
 | ├-✖ C(1)                 dag-primay-branch-ksbcv-2389562778  3s        failed with exit code 2
 | ├-✖ C(2)                 dag-primay-branch-ksbcv-1718605113  3s        failed with exit code 2
 | └-✖ C(3)                 dag-primay-branch-ksbcv-2322746492  4s        failed with exit code 2
 ├-✔ D                      dag-primay-branch-ksbcv-3955335817  3s
 └-✔ E                      dag-primay-branch-ksbcv-3938558198  4s

This seems relate to retryStrategy on parent node on B node.

Could help me to solve this problem? Thank you @sarabala1979 @jessesuen

@jessesuen
Copy link
Member

jessesuen commented Jun 25, 2019

@xianlubird this behavior is actually the intended behavior. Essentially, the DAG logic has a built-in "fail fast" feature to stop scheduling new steps, as soon as it detects that one of the DAG nodes is failed. Then it waits until all DAG nodes are completed before failing the DAG itself.

The reason that you see difference in behavior affected by the retryStrategy of your workflow, is timing related, more than than the retryStrategy.

I believe what you are trying to achieve, is to allow a DAG to run all branches of the DAG to completion (either success or failure), regardless of the failed outcomes of branches in the DAG. For this, we need a new feature to disable the "fail fast" behavior, which would be a new feature.

We need a new flag for this, because the fail fast behavior of DAGs is desirable behavior for many use cases, and we do not want to break backwards compatibility.

@xianlubird xianlubird changed the title BUG: dag will missing some nodes when another branch node fails. New Feature: provide failFast flag, allow a DAG to run all branches of the DAG (either success or failure) Jun 25, 2019
@saiyidi

This comment was marked as off-topic.

@agilgur5 agilgur5 changed the title New Feature: provide failFast flag, allow a DAG to run all branches of the DAG (either success or failure) Provide failFast flag, allow a DAG to run all branches of the DAG (either success or failure) Oct 28, 2024
@agilgur5 agilgur5 changed the title Provide failFast flag, allow a DAG to run all branches of the DAG (either success or failure) Provide failFast flag, allow DAG to run all branches (either success or failure) Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants