Allow finally tasks to execute after pipeline timeout #2989

gpaul · 2020-07-22T22:41:19Z

Feature request

Tekton recently introduced "Finally Tasks" that execute once a Pipeline succeeds or fails. The design document lists the following use cases:

Cleanup cluster resources after finishing (with success/failure) integration tests (Dogfooding Scenario)
Update Pull Request with what happened overall in the pipeline (pipeline level)
Report Test Results at the end of the test pipeline (Notifications Scenario)

These are all legitimate use cases that show the value of the "Finally Tasks at the Pipeline Level" feature.

However, these tasks will not execute in the event that a Pipeline's execution duration exceeds its Timeout value. Pipelines can have unpredictable or highly variable execution times for various reasons: network transfers, access to third-party services, or delayed execution due to resource contention can all lead to a Pipeline unexpectedly exceeding its configured Timeout.

For this reason, it is currently not possible to rely on Finally tasks for any of the aforementioned use cases. They are an elegant optimisation, but do not provide a contract that can be relied upon in general.

This feature request is to extend the existing Finally Task functionality to allow Finally Tasks to execute even if the enclosing PipelineRun times out. This rounds out the Finally Task feature by addressing the common "PipelineRun timeout" edge case.

Use case

Every reason that one wants Finally Tasks, with the added confidence that they will execute regardless of PipelineRun timeout.

dlorenc · 2020-07-23T00:30:24Z

This is an interesting one - I agree there should be a way to guarantee "finally" runs, but should there be a timeout for that as well?

vdemeester · 2020-07-23T08:29:45Z

/cc @bobcatfish @pritidesai

pritidesai · 2020-07-28T01:20:04Z

This is interesting indeed and I agree, we should be able to guarantee that finally tasks run even the Pipeline (that they part of) timeout. This can be configurable and disabled by default.

Today, pipelineRun timeout takes the highest precedence and exits with failure if its tasks not able to finish in time.

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  name: pipelinerun-with-timeout
spec:
  timeout: "0h0m3s"
  pipelineSpec:
    tasks:
      - name: echo-good-morning
        timeout: "0h0m30s"
        taskSpec:
          steps:
            - name: echo
              image: ubuntu
              script: |
                #!/usr/bin/env bash
                echo "Good Morning!"
                sleep 60
---

PipelineRun failure:

 tkn pr describe pipelinerun-with-timeout
Name:        pipelinerun-with-timeout
Namespace:   default
Timeout:     3s
Labels:
 tekton.dev/pipeline=pipelinerun-with-timeout

🌡️  Status

STARTED        DURATION    STATUS
1 minute ago   3 seconds   Failed(PipelineRunTimeout)

💌 Message

PipelineRun "pipelinerun-with-timeout" failed to finish within "3s" (TaskRun "pipelinerun-with-timeout-echo-good-morning-4ksv8" failed to finish within "30s")

📦 Resources

 No resources

⚓ Params

 No params

🗂  Taskruns

 NAME                                                 TASK NAME           STARTED        DURATION     STATUS
 ∙ pipelinerun-with-timeout-echo-good-morning-4ksv8   echo-good-morning   1 minute ago   30 seconds   Failed(TaskRunTimeout)

We can have finally task timeout take higher precedence (if specified) and overwrite pipelineRun timeout (default is 1 hour). finally task without any timeout would honor pipelineRun timeout (just like today).

But this option will break the backward compatibility of PipelineRun taking highest precedence and overwriting finally task timeout. If this is something we have to preserve, we can add a separate field finallyTimeout or something similar 🤔

gpaul · 2020-07-28T05:31:16Z

That seems very elegant.

vdemeester · 2020-07-28T10:13:55Z

We can have finally task timeout take higher precedence (if specified) and overwrite pipelineRun timeout (default is 1 hour). finally task without any timeout would honor pipelineRun timeout (just like today).

But this option will break the backward compatibility of PipelineRun taking highest precedence and overwriting finally task timeout. If this is something we have to preserve, we can add a separate field finallyTimeout or something similar thinking

I really like that idea 👼 I wonder how much it breaks the backward compatibility. How much users do use timeout in finally, … But I would love to have this 🙃

pritidesai · 2020-07-28T15:47:32Z

finally task timeout take higher precedence (if specified) and overwrite PipelineRun timeout

+1

@bobcatfish @afrittoli thoughts?

dlorenc · 2020-07-29T23:22:46Z

Just to makes sure I understand this - is this the idea?

A pipeline has a timeout set to 1 hour
It has a finally Task`, that has a finallyTimeout of 10 minutes.
The pipeline runs for an hour, and is terminated.
The finally task then gets executed, and can run for 10 minutes?

pritidesai · 2020-07-30T01:06:44Z

Yes.

Pipeline schedules finally after all non-finally are done within that hour otherwise non-finally tasks are terminated after an hour and finally runs for 10 minutes.

The idea is finallyTimeout does not modify timeout for non-finally tasks but confirm the execution of finally tasks for the specified amount of time.

dlorenc · 2020-08-01T01:36:11Z

+1 to that!

tekton-robot · 2020-10-30T01:54:07Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.

/lifecycle stale

Send feedback to tektoncd/plumbing.

pritidesai · 2020-10-30T02:59:06Z

/remove-lifecycle stale

jerop · 2021-01-26T16:50:08Z

/assign @souleb

thank you!!

tekton-robot · 2021-01-26T16:50:09Z

@jerop: GitHub didn't allow me to assign the following users: souleb.

Note that only tektoncd members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @souleb

thank you!!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

souleb · 2021-01-26T16:56:05Z

/assign

tekton-robot · 2021-04-26T17:01:36Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

jerop · 2021-04-26T17:06:17Z

/remove-lifecycle stale

@souleb is actively working on this:
TEP: https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md
PR: #3843

thank you @souleb 😁

tekton-robot · 2021-10-14T23:44:33Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

jerop · 2021-10-15T15:26:03Z

@souleb solved this issue -- thank you 🎉

TEP: https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md
PR: #3843

@ornew

Today, when a `Task` is retried, it executes past its timeout. Even worse, the `Task` executing past its timeout causes the execution of `Finally Tasks` to be timedout before executing for their allocated timeout. As such, the execution of `Finally Tasks` is not guaranteed even when a `finally timeout` has been allocated. As described by @ornew in the issue report: "The problem for me is that Tekton still doesn't guarantee the execution of `finally` at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing `timeout`." This issue is caused by the the interaction between retries and timeouts. When a `TaskRun` is retried: - The status of its failed execution is stored in `RetriesStatus`. - The status of the `TaskRun` is updated, including marking it as `Running` and resetting its `StartTime`. Resetting the `StartTime` is useful so we can record the `StartTime` of each retry in the `RetriesStatus`. However, using it as the reset start time of the `TaskRun` to check whether it has timedout is incorrect and causes the issues described above. In this change, we use the actual start time of the `TaskRun` to check whether the `TaskRun` has timedout. Alternative approaches considered include: - not resetting the start time of the `TaskRun` upon retry, but then it'll be challenging to know the execution times of each retry. - keeping track of the actual start time in an extra field, but this is information that's already available in the status References: - [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md) - [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071) - [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989) Co-authored-by: Lee Bernick <[email protected]>

@ornew

Today, when a `Task` is retried, it executes past its timeout. Even worse, the `Task` executing past its timeout causes the execution of `Finally Tasks` to be timedout before executing for as long as their allocated timeouts. As such, the execution of `Finally Tasks` is not guaranteed even when a `finally timeout` has been allocated. @ornew described the problem in the issue report: "The problem for me is that Tekton still doesn't guarantee the execution of `finally` at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing `timeout`." This prblem is caused by the the interaction between retries and timeouts. When a `TaskRun` is retried: - The status of its failed execution is stored in `RetriesStatus`. - The status of the `TaskRun` is updated, including marking it as `Running` and resetting its `StartTime`. Resetting the `StartTime` is useful in recording the `StartTime` of each retry in the `RetriesStatus`. However, using the reset time as the reset start time of the `TaskRun` when checking whether it has timedout is incorrect and causes the issues described above. In this change, we use the actual start time of the `TaskRun` to check whether the `TaskRun` has timedout. We do this by checking the start time of previous attempts as well, instead of the current attempt only. Alternative approaches considered include: - not resetting the start time of the `TaskRun` upon retry, however it will be challenging to know the execution times of each retry. - keeping track of the actual start time in an extra field, but this is information that's already available in the status References: - [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md) - [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071) - [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989) Co-authored-by: Lee Bernick <[email protected]>

@ornew

Today, when a `Task` is retried, it executes past its timeout. Even worse, the `Task` executing past its timeout causes the execution of `Finally Tasks` to be timedout before executing for as long as their allocated timeouts. As such, the execution of `Finally Tasks` is not guaranteed even when a `finally timeout` has been allocated. @ornew described the problem in the issue report: "The problem for me is that Tekton still doesn't guarantee the execution of `finally` at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing `timeout`." This problem is caused by the the interaction between retries and timeouts. When a `TaskRun` is retried: - The status of its failed execution is stored in `RetriesStatus`. - The status of the `TaskRun` is updated, including marking it as `Running` and resetting its `StartTime`. Resetting the `StartTime` is useful in recording the `StartTime` of each retry in the `RetriesStatus`. However, using the reset time as the reset start time of the `TaskRun` when checking whether it has timedout is incorrect and causes the issues described above. In this change, we use the actual start time of the `TaskRun` to check whether the `TaskRun` has timedout. We do this by checking the start time of previous attempts as well, instead of the current attempt only. Alternative approaches considered include: - not resetting the start time of the `TaskRun` upon retry, however it will be challenging to know the execution times of each retry. - keeping track of the actual start time in an extra field, but this is information that's already available in the status References: - [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md) - [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071) - [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989) Co-authored-by: Lee Bernick <[email protected]>

@ornew

Today, when a `Task` is retried, it executes past its timeout. Even worse, the `Task` executing past its timeout causes the execution of `Finally Tasks` to be timedout before executing for as long as their allocated timeouts. As such, the execution of `Finally Tasks` is not guaranteed even when a `finally timeout` has been allocated. @ornew described the problem in the issue report: "The problem for me is that Tekton still doesn't guarantee the execution of `finally` at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing `timeout`." This problem is caused by the the interaction between retries and timeouts. When a `TaskRun` is retried: - The status of its failed execution is stored in `RetriesStatus`. - The status of the `TaskRun` is updated, including marking it as `Running` and resetting its `StartTime`. Resetting the `StartTime` is useful in recording the `StartTime` of each retry in the `RetriesStatus`. However, using the reset time as the reset start time of the `TaskRun` when checking whether it has timedout is incorrect and causes the issues described above. In this change, we use the actual start time of the `TaskRun` to check whether the `TaskRun` has timedout. We do this by checking the start time of previous attempts as well, instead of the current attempt only. Alternative approaches considered include: - not resetting the start time of the `TaskRun` upon retry, however it will be challenging to know the execution times of each retry. - keeping track of the actual start time in an extra field, but this is information that's already available in the status References: - [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md) - [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071) - [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989) Co-authored-by: Lee Bernick <[email protected]>

@ornew

Today, when a `Task` is retried, it executes past its timeout. Even worse, the `Task` executing past its timeout causes the execution of `Finally Tasks` to be timedout before executing for as long as their allocated timeouts. As such, the execution of `Finally Tasks` is not guaranteed even when a `finally timeout` has been allocated. @ornew described the problem in the issue report: "The problem for me is that Tekton still doesn't guarantee the execution of `finally` at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing `timeout`." This problem is caused by the the interaction between retries and timeouts. When a `TaskRun` is retried: - The status of its failed execution is stored in `RetriesStatus`. - The status of the `TaskRun` is updated, including marking it as `Running` and resetting its `StartTime`. Resetting the `StartTime` is useful in recording the `StartTime` of each retry in the `RetriesStatus`. However, using the reset time as the reset start time of the `TaskRun` when checking whether it has timedout is incorrect and causes the issues described above. In this change, we use the actual start time of the `TaskRun` to check whether the `TaskRun` has timedout. We do this by checking the start time of previous attempts as well, instead of the current attempt only. Alternative approaches considered include: - not resetting the start time of the `TaskRun` upon retry, however it will be challenging to know the execution times of each retry. - keeping track of the actual start time in an extra field, but this is information that's already available in the status References: - [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md) - [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071) - [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989) Co-authored-by: Lee Bernick <[email protected]>

@ornew

Today, when a `Task` is retried, it executes past its timeout. Even worse, the `Task` executing past its timeout causes the execution of `Finally Tasks` to be timedout before executing for as long as their allocated timeouts. As such, the execution of `Finally Tasks` is not guaranteed even when a `finally timeout` has been allocated. @ornew described the problem in the issue report: "The problem for me is that Tekton still doesn't guarantee the execution of `finally` at all. This is a really serious problem for operating in production, as Tekton we operate has caused a lot of resource leaks and a lot of damage by not running finally. It's the biggest problem with existing `timeout`." This problem is caused by the the interaction between retries and timeouts. When a `TaskRun` is retried: - The status of its failed execution is stored in `RetriesStatus`. - The status of the `TaskRun` is updated, including marking it as `Running` and resetting its `StartTime`. Resetting the `StartTime` is useful in recording the `StartTime` of each retry in the `RetriesStatus`. However, using the reset time as the reset start time of the `TaskRun` when checking whether it has timedout is incorrect and causes the issues described above. In this change, we use the actual start time of the `TaskRun` to check whether the `TaskRun` has timedout. We do this by checking the start time of previous attempts as well, instead of the current attempt only. Alternative approaches considered include: - not resetting the start time of the `TaskRun` upon retry, however it will be challenging to know the execution times of each retry. - keeping track of the actual start time in an extra field, but this is information that's already available in the status References: - [TEP-0046: Finally tasks execution post pipelinerun timeout](https://github.com/tektoncd/community/blob/main/teps/0046-finallytask-execution-post-timeout.md) - [Issue: task time exceeded `timeouts.tasks` when task retried](tektoncd#4071) - [Issue: Allow finally tasks to execute after pipeline timeout](tektoncd#2989) Co-authored-by: Lee Bernick <[email protected]>

gpaul added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 22, 2020

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2020

tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2020

jerop mentioned this issue Jan 21, 2021

WhenExpressions: support ANY match behavior #3591

Open

tekton-robot assigned souleb Jan 26, 2021

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2021

tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2021

souleb mentioned this issue Apr 28, 2021

Add a Timeouts optional field to pipelinerun #3843

Merged

4 tasks

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 14, 2021

jerop closed this as completed Oct 15, 2021

jerop mentioned this issue Nov 9, 2021

Stop executing retried Tasks past their timeouts #4368

Closed

5 tasks

abayer mentioned this issue Dec 13, 2021

TEP-0046: Finally tasks execution post pipelinerun timeout tektoncd/community#326

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow finally tasks to execute after pipeline timeout #2989

Allow finally tasks to execute after pipeline timeout #2989

gpaul commented Jul 22, 2020

dlorenc commented Jul 23, 2020

vdemeester commented Jul 23, 2020

pritidesai commented Jul 28, 2020 •

edited

Loading

gpaul commented Jul 28, 2020

vdemeester commented Jul 28, 2020

pritidesai commented Jul 28, 2020

dlorenc commented Jul 29, 2020

pritidesai commented Jul 30, 2020

dlorenc commented Aug 1, 2020

tekton-robot commented Oct 30, 2020

pritidesai commented Oct 30, 2020

jerop commented Jan 26, 2021

tekton-robot commented Jan 26, 2021

souleb commented Jan 26, 2021

tekton-robot commented Apr 26, 2021

jerop commented Apr 26, 2021

tekton-robot commented Oct 14, 2021

jerop commented Oct 15, 2021

Allow finally tasks to execute after pipeline timeout #2989

Allow finally tasks to execute after pipeline timeout #2989

Comments

gpaul commented Jul 22, 2020

Feature request

Use case

dlorenc commented Jul 23, 2020

vdemeester commented Jul 23, 2020

pritidesai commented Jul 28, 2020 • edited Loading

gpaul commented Jul 28, 2020

vdemeester commented Jul 28, 2020

pritidesai commented Jul 28, 2020

dlorenc commented Jul 29, 2020

pritidesai commented Jul 30, 2020

dlorenc commented Aug 1, 2020

tekton-robot commented Oct 30, 2020

pritidesai commented Oct 30, 2020

jerop commented Jan 26, 2021

tekton-robot commented Jan 26, 2021

souleb commented Jan 26, 2021

tekton-robot commented Apr 26, 2021

jerop commented Apr 26, 2021

tekton-robot commented Oct 14, 2021

jerop commented Oct 15, 2021

pritidesai commented Jul 28, 2020 •

edited

Loading