Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design: Failure Strategy for TaskRuns in a PipelineRun #1684

Open
dibyom opened this issue Dec 4, 2019 · 56 comments
Open

Design: Failure Strategy for TaskRuns in a PipelineRun #1684

dibyom opened this issue Dec 4, 2019 · 56 comments
Assignees
Labels
area/api Indicates an issue or PR that deals with the API. area/roadmap Issues that are part of the project (or organization) roadmap (usually an epic) design This task is about creating and discussing a design

Comments

@dibyom
Copy link
Member

dibyom commented Dec 4, 2019

The goal is to come up with a design to handle failing task runs in a pipelinerun. Today, we simply fail the entire pipelinerun if a single taskrun fails.

Current Status

Summary in this comment: #1684 (comment)

Ideas

Here are a couple of ideas from @sbwsg and me:

  1. Introduce an errorStrategy field in PipelineTasks similar to the idea in Allow steps to run regardless of previous step errors #1573
  2. The errorStrategy could be under the runAfter field.
  3. To start off, we could have two error strategies : FailPipeline which is the default for today, and ContinuePipeline which will continue running the whole pipeline
  4. Later on, we could add branch based error strategies e.g. fail one one branch of the graph but continue running the remaining pipelines

Additional Info

@sbwsg has some strawperson YAMLs:
RunNextTasks for an integration test cleanup scenario
FailPipeline(default) for a unit test failing before a deploy task

Use Cases

Related Issues

The Epic #1376 has all the related issues

@dibyom dibyom added area/api Indicates an issue or PR that deals with the API. design This task is about creating and discussing a design labels Dec 4, 2019
@vdemeester vdemeester added this to the Pipelines 1.0/beta 🐱 milestone Dec 4, 2019
@bobcatfish
Copy link
Collaborator

Thanks for getting this started!!!

    - name: integration
      taskRef:
        name: run-integration-tests
      runAfter: uts
      errorStrategy: RunNextTasks # allow cleanup to occur
    - name: cleanup
      taskRef:
        name: cleanup-integration-test-junk
      runAfter: integration

A concern about this is for these two use cases:

  • Always run a step/task at the end e.g. to Report results
  • The example above: cleanup after a integration test

It seems like it doesn't work if there is one more Task in the Pipeline, e.g. (totally contrived?) but something like:

    - name: integration
      runAfter: uts
      errorStrategy: RunNextTasks # allow cleanup to occur
    - name: integration2 # pretend there was another set of tests?
      runAfter: uts
      errorStrategy: RunNextTasks # allow cleanup to occur
    - name: cleanup
      taskRef:
        name: cleanup-integration-test-junk
      runAfter: integration

RunNextTasks for the above set of Tasks means that if integration fails, integration2 will also run even tho what we really want is to jump straight to cleanup.

A couple different ideas:

  • An explicit finally clause you can put on a Task which forces it to run at the end (naiive??)
  • An errorStrategy that lets you jump straight to a branch in the pipeline? e.g. like "resume from cleanup or something

@bobcatfish
Copy link
Collaborator

(I do think an errorStrategy field makes a lot of sense! I think there are lots of potential kinds of strategies we might want to express - e.g. failing the entire Pipeline immediately vs. allowing any other Tasks in flight to finish)

@dibyom
Copy link
Member Author

dibyom commented Dec 4, 2019

RunNextTasks for the above set of Tasks means that if integration fails, integration2 will also run even tho what we really want is to jump straight to cleanup.

One workaround might be that integration2 is a conditional task that only runs if the previous step is successful? That being said, if this is a common patter, I think a separate errorStrategy for jumping to another task might be a simpler way to do this.

@pritidesai
Copy link
Member

pritidesai commented Dec 5, 2019

Defining errorStrategy has two sides to it (1) dictate the behavior of next tasks/steps in queue e.g. RunNextTasks (from the example above) (2) dictate its own behavior e.g. IgnorePriorTaskErrors (based on @sbwsg's step PR).

I am biased towards (2).

Defining these error strategies based on my understanding so far:

  • SkipOnPriorStepErrors: Within a task, halt the execution of a step if a prior step has failed. This strategy is only scoped to a Task.
  • IgonrePriorStepErrors: Within a task, continue execution of a step even if a prior step has failed. Just like SkipOnPriorStepErrors, this strategy is only scoped to a Task.
  • SkipOnPriorTaskErrors: Within a pipeline, halt the execution of a task if a prior task has failed. This gets little tricky with conditions but here the task is marked as failed even when the associated condition fails and hence the task is not even executed. This same strategy can be applied to a pipeline having two groups of tasks (A,B, and C) and (X, Y, and Z). For example, task A has a conditional execution and task B and C should be executed if task A succeeds. In this scenario, Task A would refer to a condition using conditionRef without any errorStrategy and Task B and C would have errorStrategy set to SkipOnPriorTaskErrors. Now, in case when Task A executes successfully, Task B is the next in queue and depending on Task B's execution result, Task C will be executed. In case when Task A fails, Task B will be skipped since its errorStrategy is marked to skip its execution if prior Task (A) failed, and Task C will be skipped too since Task B was never executed (pipeline marked Task B as failure?)
  • IgnorePriorTaskErrors: Within a pipeline, continue the execution of a task even if prior task has failed.

Also, here we have to be explicit and define what Next and Prior means to us (Pipeline and Task), Next all next tasks/steps or just the next, Prior all previous tasks/steps or just one previous ...

I have collected my thoughts here based on talking to @dibyom on slack, step PR from @sbwsg, comments from issues themselves and working group recordings.

@dibyom
Copy link
Member Author

dibyom commented Dec 10, 2019

@pritidesai Thanks for writing this up....I think your examples for error strategies are for tasks defining their own behavior and not for the subsequent tasks. With that approach how do we model the scenario that @bobcatfish mentioned above i.e. we have 3 tasks running sequentially A -> B -> C in the happy case but when A fails, we want to jump to C

Also, another idea @sbwsg had was adding the errorStrategies in the runAfter or from fields:

spec:
  tasks:
    # ... other tasks ...
    - name: cleanup
      taskRef:
        name: cleanup-integration-test-junk
      runAfter:
        task: integration-tests
        errorStrategy: Continue # or Skip / Fail

@pritidesai
Copy link
Member

how about modeling the scenario that @bobcatfish mentioned above with:

    - name: integration
      runAfter: uts
    - name: integration2 # pretend there was another set of tests?
      runAfter: uts
      errorStrategy: SkipOnPriorTaskErrors # do not execute if previous integration tests fail
    - name: cleanup
      taskRef:
        name: cleanup-integration-test-junk
      runAfter: integration
      errorStrategy: IgnorePriorTaskErrors

woo, I like errorStrategies in runAfter and from, let me give it a thought 🤔...

@ghost
Copy link

ghost commented Dec 11, 2019

Another alternative to consider: Go's defer and recover keywords model quite similar behaviour to what we're discussing here. I can imagine DeferredPipelineTask and RecoveryPipelineTask types that perform work regardless of prior outcome (Deferred) and in response to a task's failure (Recovery). Examples:

DeferredPipelineTask

# In this example, a "deferred" task is used to clean up environment after integration tests.
# Deferred tasks run regardless of outcome in prior tasks
spec:
  tasks:
    - name: integration-tests # can fail!
      taskRef:
        name: run-some-tests
    - name: cleanup-integration-environment
      deferred: true # will run regardless of failure in integration-tests. Will not run if integration-tests is never run (i.e. because a task prior to integration-tests failed)
      runAfter: integration-tests
      taskRef:
        name: delete-integration-namespaces

RecoveryPipelineTask

# In this example, a "recovery" task is used to handle errors during deployment to staging.
# Recovery tasks only execute if the task they runAfter fails
spec:
  tasks:
    - name: deploy-to-staging
      taskRef:
        name: deploy-to-k8s
    - name: rollback-staging
      recovery: true # will run only if deploy-to-staging fails
      runAfter: deploy-to-staging
      taskRef:
        name: rollback-deployment

Two further tweaks to this idea: First, a DeferredPipelineTask that doesn't declare a runAfter will always execute at the end of the pipeline. This is the "finally" clause equivalent. Second, a RecoveryPipelineTask with no runAfter will handle any error case in the pipeline. This is the equivalent of a giant catch { } block wrapped around your pipeline. We could even pass the error to the RecoveryPipelineTask as a PipelineResource or something to help it with reporting.

Also worth keeping in mind that while a DeferredPipelineTask or RecoverPipelineTask needs to be explicitly marked as such, I think they would also be allowed to be "roots" of their own trees. In other words another task could be runAfter a DeferredPipelineTask but does not need to include deferred: true. Similarly for recovery, a task could be runAfter a RecoveryPipelineTask but does not need to include recovery: true. In effect this allows entire branches of the execution DAG to be run only in the event of failure or for the purposes of cleanup etc.

So I think this would cover the following scenarios:

  1. Execute work after a specific task in the pipeline succeeds OR fails
    • DeferredPipelineTask with runAfter
    • Use cases: cleanup integration environment, upload unit test results
  2. Recover from failed tasks by jumping to a different branch
    • RecoveryPipelineTask with runAfter
    • Use case: roll back bad deployment
  3. Perform work at the end of a pipeline regardless of outcome
    • DeferredPipelineTask without runAfter
    • Use case: any naive finally scenario ("naive" here means it doesn't need specific knowledge of what ran or didn't run)
  4. Handle any error in the pipeline that occurs with a fallback task
    • RecoveryPipelineTask without runAfter
    • Use case: any naive catch { } scenario (example i can think of: send a message to slack that a pipeline has failed)

The deferred and recovery keys would need to be either-or in the yaml. I don't think you can support both recovery: true and deferred: true on the same task.

What I most like about this approach is that:

  1. it doesn't mess with runAfter, so avoids some possibly tricky schema changes in the yaml (particularly since from behaviour may also need to be modified to keep it in line with runAfter)
  2. it maintains the property that the "edge" in the graph is defined (with runAfter/from) in the same PipelineTask that the error handling or deferral behaviour is described
  3. it provides flexible catch-all handling to satisfy any jump / finally / catch requirements.
  4. It doesn't rely on tricky-to-remember constants like "IgnorePriorErrors".
  5. Finally (pun intended) what I like about this is that it drops the word "errorStrategy" completely. I think there are very legitimate use cases for these kinds of handlers that don't involve errors or failures or anything negative at all. It's just branching the DAG in response to specific outcomes of the graph nodes.

@ghost
Copy link

ghost commented Dec 13, 2019

Another phrasing of the above approach that @dibyom and I discussed would be to use keywords for defer / recover / skip (the default):

spec:
  tasks:
    - name: deploy-to-staging
      taskRef:
        name: deploy-to-k8s
    - name: rollback-staging
      runAfter: deploy-to-staging
      strategy: Recover # or Defer or Skip
      taskRef:
        name: rollback-deployment

This ^ says that rollback-staging PipelineTask will only execute if deploy-to-staging fails (it "Recovers" from deploy-to-staging's failure).

Having thought about it for a couple days I'm still pretty sure we could describe all of the use cases we've talked about so far with just these three strategies.

@pritidesai
Copy link
Member

pritidesai commented Dec 17, 2019

Another alternative to consider: Go's defer and recover keywords model quite similar behaviour to what we're discussing here. I can imagine DeferredPipelineTask and RecoveryPipelineTask types that perform work regardless of prior outcome (Deferred) and in response to a task's failure (Recovery). Examples:

DeferredPipelineTask

# In this example, a "deferred" task is used to clean up environment after integration tests.
# Deferred tasks run regardless of outcome in prior tasks
spec:
  tasks:
    - name: integration-tests # can fail!
      taskRef:
        name: run-some-tests
    - name: cleanup-integration-environment
      deferred: true # will run regardless of failure in integration-tests. Will not run if integration-tests is never run (i.e. because a task prior to integration-tests failed)
      runAfter: integration-tests
      taskRef:
        name: delete-integration-namespaces

thanks @sbwsg, defer, recover, and skip sounds great but at the same time will need little bit of clarification which can be provided with docs and examples.

Also, DeferPipelineTask could be interpreted as always executed i.e. cleanup-integration-environment is always run irrespective of the outcome of integration-tests or any previous tasks if there are any. I am trying to justify will not run because integration-tests never run, in Go, understanding of defer statement is, it pushes a function call onto a list and that list of calls are executed after the surrounding function returns. How would this impact on tasks defined after integration-tests?

RecoveryPipelineTask

# In this example, a "recovery" task is used to handle errors during deployment to staging.
# Recovery tasks only execute if the task they runAfter fails
spec:
  tasks:
    - name: deploy-to-staging
      taskRef:
        name: deploy-to-k8s
    - name: rollback-staging
      recovery: true # will run only if deploy-to-staging fails
      runAfter: deploy-to-staging
      taskRef:
        name: rollback-deployment

Two further tweaks to this idea: First, a DeferredPipelineTask that doesn't declare a runAfter will always execute at the end of the pipeline. This is the "finally" clause equivalent. Second, a RecoveryPipelineTask with no runAfter will handle any error case in the pipeline. This is the equivalent of a giant catch { } block wrapped around your pipeline. We could even pass the error to the RecoveryPipelineTask as a PipelineResource or something to help it with reporting.

Also worth keeping in mind that while a DeferredPipelineTask or RecoverPipelineTask needs to be explicitly marked as such, I think they would also be allowed to be "roots" of their own trees. In other words another task could be runAfter a DeferredPipelineTask but does not need to include deferred: true. Similarly for recovery, a task could be runAfter a RecoveryPipelineTask but does not need to include recovery: true. In effect this allows entire branches of the execution DAG to be run only in the event of failure or for the purposes of cleanup etc.

So I think this would cover the following scenarios:

  1. Execute work after a specific task in the pipeline succeeds OR fails

    • DeferredPipelineTask with runAfter
    • Use cases: cleanup integration environment, upload unit test results
  2. Recover from failed tasks by jumping to a different branch

    • RecoveryPipelineTask with runAfter
    • Use case: roll back bad deployment
  3. Perform work at the end of a pipeline regardless of outcome

    • DeferredPipelineTask without runAfter
    • Use case: any naive finally scenario ("naive" here means it doesn't need specific knowledge of what ran or didn't run)
  4. Handle any error in the pipeline that occurs with a fallback task

    • RecoveryPipelineTask without runAfter
    • Use case: any naive catch { } scenario (example i can think of: send a message to slack that a pipeline has failed)

The deferred and recovery keys would need to be either-or in the yaml. I don't think you can support both recovery: true and deferred: true on the same task.

What I most like about this approach is that:

  1. it doesn't mess with runAfter, so avoids some possibly tricky schema changes in the yaml (particularly since from behaviour may also need to be modified to keep it in line with runAfter)
  2. it maintains the property that the "edge" in the graph is defined (with runAfter/from) in the same PipelineTask that the error handling or deferral behaviour is described
  3. it provides flexible catch-all handling to satisfy any jump / finally / catch requirements.
  4. It doesn't rely on tricky-to-remember constants like "IgnorePriorErrors".
  5. Finally (pun intended) what I like about this is that it drops the word "errorStrategy" completely. I think there are very legitimate use cases for these kinds of handlers that don't involve errors or failures or anything negative at all. It's just branching the DAG in response to specific outcomes of the graph nodes.

@pritidesai
Copy link
Member

pritidesai commented Dec 17, 2019

Overall I like the idea of defining strategy with one of defer, recover, and skip.

@ghost
Copy link

ghost commented Dec 17, 2019

defer, recover, and skip sounds great but at the same time will need little bit of clarification

I agree, the keywords don't make much sense in isolation. How about "AlwaysRun" (defer), "RunOnFail" (recover), and "RunOnSuccess" (Tekton's current behaviour)?

I am trying to justify will not run because integration-tests never run, in Go, understanding of defer statement is, it pushes a function call onto a list and that list of calls are executed after the surrounding function returns. How would this impact on tasks defined after integration-tests?

I think the analogy here with go's defer breaks down. I somewhat regret drawing the comparison. In my mind the strategy only describes a single relationship between a task and its "parents" (those it declares with "runAfter" or "from"). iow given the following tasks:

- name: Task A
- name: Task B
  runAfter:
    - Task A
  strategy: RunOnFail # Task B only executes if Task A errors out
- name: Task C
  runAfter:
    - Task B
  strategy: AlwaysRun

I expect the following behaviour:

  1. Task A runs
  2. Task B will only run if Task A fails.
  3. Task C will only run if Task B runs.
    • Because Task C declares "AlwaysRun" with "runAfter: Task B".
    • If Task B never ran (Task A succeeded and B is only RunOnFail) then Task C never runs.

So I think that's another reason why using the go keywords probably doesn't make sense after all - they don't map perfectly on to Tekton's meanings. But AlwaysRun / RunOnFail / RunOnSuccess are a bit clearer maybe, especially when we consider them paired with runAfter.

@ghost
Copy link

ghost commented Dec 17, 2019

Hrm. AlwaysRun isn't that great for the Finally case - it doesn't make as much sense. Deferred may be better after all. Here's a comparison:

AlwaysRun

# This pipeline pings a URL when the pipeline finishes.
# This ping happens regardless of the pipeline's outcome.
apiVersion: tekton.dev/v1alpha1
kind: Pipeline
metadata:
  name: test-pipeline
spec:
  tasks:
    - name: ping-url-on-complete
      taskRef:
        name: send-ping
      strategy: AlwaysRun # AlwaysRun without a runAfter. Executes at end of pipeline.
    - name: uts
      taskRef:
        name: run-unit-tests
    - name: integration
      taskRef:
        name: run-integration-tests
      runAfter: uts

Deferred

# This pipeline pings a URL when the pipeline finishes.
# This ping happens regardless of the pipeline's outcome.
apiVersion: tekton.dev/v1alpha1
kind: Pipeline
metadata:
  name: test-pipeline
spec:
  tasks:
    - name: ping-url-on-complete
      taskRef:
        name: send-ping
      strategy: Deferred # Deferred without a runAfter. Executes at end of pipeline.
    - name: uts
      taskRef:
        name: run-unit-tests
    - name: integration
      taskRef:
        name: run-integration-tests
      runAfter: uts

@pritidesai
Copy link
Member

Yes I agree, AlwaysRun is misleading for finally use case how about introducing one more strategy called defer or finally 🤔 in addition to AlwaysRun RunOnFail and RunOnSuccess?

@pritidesai
Copy link
Member

I kind of ran these strategies against the pipeline example you have and it looks something like this:

apiVersion: tekton.dev/v1alpha1
kind: Pipeline
metadata:
  name: ignore-errors-pipeline
spec:
  tasks:
    - name: uts
      taskRef:
        name: run-unit-tests
    - name: integration
      taskRef:
        name: run-integration-tests
      strategy: AlwayRun # irrespective of unit test results, run integration tests
    - name: cleanup
      taskRef:
        name: cleanup-integration-test-junk
      runAfter: integration
      strategy: AlwaysRun # since cleanup is grouped with integration test, AlwaysRun strategy would fit here otherwise we would have to go for defer.

@pritidesai
Copy link
Member

apiVersion: tekton.dev/v1alpha1
kind: Pipeline
metadata:
  name: ignore-errors-pipeline
spec:
  tasks:
    - name: uts
      taskRef:
        name: run-unit-tests
    - name: deploy
      taskRef:
        name: deploy-staging
      runAfter: uts
      strategy: RunOnSuccess
    - name: integration
      taskRef:
        name: run-integration-tests
      runAfter: uts
      strategy: RunOnSuccess # Run if unit tests succeeds 
    - name: cleanup
      taskRef:
        name: cleanup-integration-test-junk
      runAfter: integration
      strategy: AlwaysRun

@pritidesai
Copy link
Member

pritidesai commented Dec 18, 2019

Adding one more use case to build Javascript application and/or Java application depending on the runtime of an application:

Using strategy here to make sure the task and pipelinerun doesnt report failure if condition fails and the chain of tasks doesnt get executed.

apiVersion: tekton.dev/v1alpha1
kind: Pipeline
metadata:
  name: pipeline-for-various-runtimes
spec:
  tasks:
# build application if the source code is written in NodeJS
# Run tasks in order (1) install dependencies (2) build zip file and (3) build an image
    - name: install-npm-packages
      taskRef:
        name: task-install-npm-packages
      conditions:
        - conditionRef: is-nodejs-runtime
    - name: build-archive
      taskRef:
        name: task-build-archive
      runAfter: install-npm-packages
      strategy: RunOnSuccess  
    - name: build-nodejs-app-image
      taskRef:
        name: build-image
      runAfter: build-archive  
      strategy: RunOnSuccess
# build application if the source code is written in Java
# Run tasks in order (1) Create Jar with Maven (2) Build runtime with Maven (3) Embed function into runtime (4) Build an image
    - name: create-jar-with-maven
      taskRef:
        name: task-create-jar-with-maven
      conditions:
        - conditionRef: is-java-runtime
    - name: build-runtime-with-gradle
      taskRef:
        name: task-build-runtime-with-gradle
        runAfter: create-jar-with-maven
        strategy: RunOnSuccess
    - name: finalize-runtime-with-function
      taskRef:
        name: task-finalize-runtime-with-function
        runAfter: build-runtime-with-gradle
        strategy: RunOnSucess
    - name: build-java-app-image
      taskRef:
        name: build-image
      runAfter: finalize-runtime-with-function 
      strategy: RunOnSuccess  

@bigkevmcd
Copy link
Member

bigkevmcd commented Jan 6, 2020

Another slightly different case:

For things like updating GitHub status notifications it would be nice if we could do something like the following...admittedly this is a bit repetitive, but passing the "success" or "failure" of a task might work with the "recover" strategy mentioned earlier, which would mean that after each task, somehow it'd use the success/failure of the previous task to update the GitHub status appropriately.

Updating these kinds of statuses would be really useful if you want your pipeline to determine whether or not a commit can be merged (if you're not familiar with these, you can require specific contexts to be successful before a PR can be merged).

This also adds a runAfter pipeline-scoped taskRef, which could do the cleanup in a "Go defer" way, i.e. always after the pipeline has ended, irrespective of how what caused it to end.

The example below would trigger two parallel executions (lint and tests), which would report in their status to GitHub.

apiVersion: tekton.dev/v1alpha1
kind: Pipeline
metadata:
  name: pullrequest-pipeline
spec:
  runAfter:
    taskRef: cleanup-post-pullrequest
  tasks:
    - name: start-github-ci-status
      taskRef:
        name: update-github-status
        params:
        - name: STATUS
          value: pending
        - name: CONTEXT
          value: ci-tests
        - name: COMMIT_SHA
          value: $(inputs.params.commit_sha)
    - name: run-tests
      taskRef:
        name: golang-test
      errorStrategy:
        taskRef: update-commit-status
        params:
        - name: STATUS
          value: failed
        - name: CONTEXT
          value: ci-tests
        - name: COMMIT_SHA
          value: $(inputs.params.commit_sha)
    - name: mark-github-ci-status-success
      runAfter:
        - run-tests
      taskRef:
        name: update-github-status
        params:
        - name: STATUS
          value: success
        - name: CONTEXT
          value: ci-tests
        - name: COMMIT_SHA
          value: $(inputs.params.commit_sha)
    # repeat pending for ci-lint context
    - name: run-lint
      taskRef:
        name: golangci-lint
      errorStrategy:
        taskRef: update-commit-status
        params:
        - name: STATUS
          value: failed
        - name: CONTEXT
          value: ci-lint
        - name: COMMIT_SHA
          value: $(inputs.params.commit_sha)
    # repeat success for ci-lint context

@pierretasci
Copy link

One specific use case that I don't think has been explicitly mentioned above is in a fan-in/out scenario.

For example, if my pipeline is

apiVersion: tekton.dev/v1alpha1
kind: Pipeline
metadata:
  name: sharded-tests
spec:
  tasks:
    - name: pre-work
      taskRef:
        name: pre-work-step
    - name: run-tests-shard-1
      taskRef:
        name: golang-test
      params:
        - name: SHARD_SPEC
          value: 1
      runAfter: ["pre-work"]
    - name: run-tests-shard-2
      taskRef:
        name: golang-test
      params:
        - name: SHARD_SPEC
          value: 2
      runAfter: ["pre-work"]
    - name: upload-test-results
      taskRef:
        name: upload-test-results-step
      runAfter: ["run-tests-shard-1, run-tests-shard-2"]

Here, I would want to always run the upload-test-results task regardless of whether 0, 1, or both of the tasks preceding it failed.

To me, this reads a lot like conditional execution but more like, conditional failure. Perhaps, this could be served as an extension to the conditions that already exist. If you wanted to "always execute B after A" your condition could simply always return true to override the default behavior of "execute B after A if A is successful"

@dibyom
Copy link
Member Author

dibyom commented Jan 13, 2020

@pritidesai @bigkevmcd @pierretasci Thanks a lot for adding such detailed use cases! Very very helpful 🙏

  • @pritidesai For your use case -- the current behavior for conditionals is that if a task is skipped, it dependents (identified using the runAfter and from fields) are automatically skipped. The overall pipelinerun status will be determined from the status of the non-skipped tasks. And the default and only strategy today is the RunOnSuccess.
    Though I guess there could be strategies such as RunOnSuccessOrSkip or RunAlways which can be combined with conditionals for more complex pipelines.

  • @bigkevmcd Updating status is definitely a very important use case:

    • top level runAfter - this is the pipeline level finally use case. It seems like we'd have to add something like this. The alternative would be to have one task that has runAfters set so that it runs after all other tasks and a strategy set to RunAlways. This can be unwieldy since anytime you add a new Task to the pipeline, you'd have to manually make sure that the task is still that last thing that executes.

    • errorStrategy containing a taskRef - this is interesting! And in some ways more descriptive than adding a generic task with a runAfter and a errorStategy: RunOnFailure. Are there other benefits? One thing I like about keeping the taskRefs separate is that then we can have multiple tasks that can run/be chained together (e.g. you can have both a cleanup-test-env task as well as a update-github-task that runs when the test fails

    • On passing status to tasks -- we had a proposal in Expose pipeline run metadata to Tasks and Conditions #1020 though the current way of doing so is to pass in the pipelineRun name and then using kubectl within the task to fetch the status. (I think @afrittoli might also be doing something here re: Notifications design work)

  • @pierretasci Sounds like the RunAlways strategy is what you'd need for the upload-test-results-step in your example. I do like the idea of using conditionals as sort of the extension mechanism for more complicated strategies -- the basic strategies such as RunAlways, RunOnSuccess/Failure/Skip etc. are built-in while a user can use those plus a conditional to describe complex strategies (e.g. a strategy of RunAlways plus a conditional for if two of the three tasks failed or whatever)

@dibyom
Copy link
Member Author

dibyom commented Jan 14, 2020

One idea - instead of failure/error/executionStrategy, we could have a field like runOn (or simply on or when) that takes in a list of states that the parent taskruns have to be in for it to run (default is: success):

- name: task1
  conditions:
    conditionRef: "condition-that-sometime-fails"
  taskRef: { name: "my-task" }

- name: runIfTask1Fails 
  runAfter: task1
  runOn: ["failure"]

- name: runIfTask1Succeeds
  runAfter: task1
  runOn: ["success"]

- name: runIfTask1IsSkipped
  runAfter: task1
  runOn: ["skip"]

What I like about this is that the field name is more succinct and for the user instead of having to remember a bunch of magic strings (is it RunOnSuccessOrSkip or RunOnSkipAndSuccess ), they just need to remember the 3 taskrun states e.g. "success", "failure", "skip"

A few more examples here: https://gist.github.com/dibyom/92dfd6ea20f13f5c769a21389df53977

@bobcatfish
Copy link
Collaborator

I feel like it's fair to consider this closed now that we have finally, tho there are more features to add, and to get the complete set of flexibility someone might want, i think we need to add in #2134 as well

@bobcatfish bobcatfish added the area/roadmap Issues that are part of the project (or organization) roadmap (usually an epic) label Aug 24, 2020
@bobcatfish bobcatfish removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 15, 2020
@RafaeLeal
Copy link
Contributor

I feel like it's fair to consider this closed now that we have finally, tho there are more features to add, and to get the complete set of flexibility someone might want, i think we need to add in #2134 as well

I feel that finally helps a lot on cleanup tasks scenarios, but there are other scenarios that are not covered, even if we add the ability to nest pipelines.
Two scenarios that I'm facing this problem right now:

  • non-required PR checks interrupting the schedule of other PR tasks, because they can fail.
  • multi-cluster deploy (branched pipeline). We have tasks to deploy in staging/prod for multiple clusters, if one fails to deploy, it's expected the others to continue.
              build/test
                      |
                     push
                    /    \
deploy-staging-cluster1   deploy-staging-cluster2
          |                   |
deploy-prod-cluster1      deploy-prod-cluster2

If I understood correctly, the #2134 could help us organize this a little bit better, but the failure strategy would be the same: if one pipeline fails, it could still interrupt the schedule of other pipelines (instead of tasks)

@pritidesai
Copy link
Member

pritidesai commented Mar 9, 2021

@RafaeLeal we have multiple TEP proposed around failure strategies:

non-required PR checks interrupting the schedule of other PR tasks, because they can fail.

This can be addressed by ignoring non-required PR checks execution status with TEP-0050

multi-cluster deploy (branched pipeline). We have tasks to deploy in staging/prod for multiple clusters, if one fails to deploy, it's expected the others to continue.

Combining pipeline in pipeline with allowing task or sub-pipeline to fail but continue executing rest of the graph could help address this.

Please let me know if these three proposals does not help solve your use cases.

@email2smohanty
Copy link

We have a strict requirement of not stopping the pipeline if any task is failing and it can not be achieved through finally, also we can not run the tasks in parallel. Based on this issue and tekton documentation I am assuming that we do not any configuration or setting at pipeline level to continue the pipeline execution in case of task failure. So can anyone please suggest how to tackle this issue?

@afrittoli
Copy link
Member

We have a strict requirement of not stopping the pipeline if any task is failing and it can not be achieved through finally, also we can not run the tasks in parallel. Based on this issue and tekton documentation I am assuming that we do not any configuration or setting at pipeline level to continue the pipeline execution in case of task failure. So can anyone please suggest how to tackle this issue?

@email2smohanty the current behaviour of PipelineRun is that as soon as Task fails, no new TaskRun will be scheduled, and the ones that are currently running will run to completion. Depending on the topology of the PipelineRun, there may be TaskRun that could have been executed, but we're not because we already know that the pipeline would fail.

If I understand correctly, you would like the PipelineRun to continue running as many tasks as the pipeline topology allows, even in case of failure. In case task X fails, any task that depends from X in any way will not be executed, but any other task could still be executed.

There are some features in Tekton today that you could use to achieve something like that - as mentioned in #1684 (comment) - but they require changes to Tasks and Pipeline.

If you need this feature, would you mind filing a separate issue about it?

@pritidesai
Copy link
Member

Hey @email2smohanty we have proposal in implementable state for ignoring a task failure at the pipeline authoring time. Would this feature work for your use case?

@email2smohanty
Copy link

There are some features in Tekton today

Yes @pritidesai I am looking for a feature which you have mentioned i.e ignoring a task failure at the pipeline authoring time. Currently is this feature available as an alpha feature?

@email2smohanty
Copy link

We have a strict requirement of not stopping the pipeline if any task is failing and it can not be achieved through finally, also we can not run the tasks in parallel. Based on this issue and tekton documentation I am assuming that we do not any configuration or setting at pipeline level to continue the pipeline execution in case of task failure. So can anyone please suggest how to tackle this issue?

@email2smohanty the current behaviour of PipelineRun is that as soon as Task fails, no new TaskRun will be scheduled, and the ones that are currently running will run to completion. Depending on the topology of the PipelineRun, there may be TaskRun that could have been executed, but we're not because we already know that the pipeline would fail.

If I understand correctly, you would like the PipelineRun to continue running as many tasks as the pipeline topology allows, even in case of failure. In case task X fails, any task that depends from X in any way will not be executed, but any other task could still be executed.

There are some features in Tekton today that you could use to achieve something like that - as mentioned in #1684 (comment) - but they require changes to Tasks and Pipeline.

If you need this feature, would you mind filing a separate issue about it?

@afrittoli thanks for responding to my comment, you mentioned that There are some features in Tekton today that we could use to achieve something like that - as mentioned in #1684 (comment). But my understanding is that whatever the features mentioned in said comment are in Implementable state or Proposal state. Or these features are available in alpha release?

@pritidesai
Copy link
Member

There are some features in Tekton today

Yes @pritidesai I am looking for a feature which you have mentioned i.e ignoring a task failure at the pipeline authoring time. Currently is this feature available as an alpha feature?

Hey @email2smohanty this feature is not implemented yet. We are looking for help or if someone is available we can guide on how to implement this. Once implemented, yes it will be an alpha feature.

@samagana
Copy link
Contributor

Hey @pritidesai! I just stumbled upon this since our organization is also looking to have this feature of continuing a pipeline (executing tasks serially) in case one of the tasks fail. I would be much happy to help, could you please guide me a bit with where to start?

@jerop
Copy link
Member

jerop commented Dec 20, 2022

@samagana -- @pritidesai is ooo right now

here's the proposal to solve this issue: https://github.com/tektoncd/community/blob/main/teps/0050-ignore-task-failures.md#proposal

@QuanZhang-William worked on the design and can help further

Also, we'd appreciate it if you could add your organization in https://github.com/tektoncd/community/blob/main/adopters.md

@QuanZhang-William
Copy link
Member

@samagana -- @pritidesai is ooo right now

here's the proposal to solve this issue: https://github.com/tektoncd/community/blob/main/teps/0050-ignore-task-failures.md#proposal

@QuanZhang-William worked on the design and can help further

Also, we'd appreciate it if you could add your organization in https://github.com/tektoncd/community/blob/main/adopters.md

Hi @samagana! Yeah, I can help to get started with this feature. I will also be OOO for most of the time till the end of the year, but please don't hesitate to contact me via Slack 😄 .

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2023
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 20, 2023
@tculp
Copy link

tculp commented May 15, 2023

/remove-lifecycle rotten
This is still a desired feature

@tekton-robot tekton-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 15, 2023
@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 13, 2023
@tculp
Copy link

tculp commented Aug 28, 2023

/remove-lifecycle stale

@tekton-robot tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 28, 2023
mgoltzsche added a commit to mgoltzsche/pipeline that referenced this issue May 15, 2024
This is an attempt to support dependencies between finally Tasks but it is not fully functional!

Problems:
* Finally tasks cannot be cancelled by design but are supposed to run after cancellation (which doesn't meet our internal cb use-case).
* A finally task is not allowed to refer to normal DAG task's result for some reason. Though, this should be possible according to the documentation and may be a Tekton bug.

Relates to CBP-969
Relates to tektoncd#1684
Relates to tektoncd#4130
Relates to tektoncd#6919
Relates to tektoncd#6903
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Indicates an issue or PR that deals with the API. area/roadmap Issues that are part of the project (or organization) roadmap (usually an epic) design This task is about creating and discussing a design
Projects
Status: Todo
Development

No branches or pull requests