-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEP-0114: Resolve the Flaky Test - TestWaitCustomTask_PipelineRun #5658
Conversation
/kind flake |
The following is the coverage report on the affected files.
|
6afa586
to
19d451c
Compare
The following is the coverage report on the affected files.
|
A clarification: a
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @XinruZhang for looking into this flake!
Please update the release notes to reflect that, with this change, implementors of Runs would have to fully implement Run timeouts themselves - PipelineRun reconciler would not set Run.Spec.Status == RunCancelled
upon Run timeout
Might be also worth clarifying that this change applies only to task-level timeouts, which should be handled by run reconcilers for custom tasks, in the same way that taskrun reconcilers handles task-level timeouts for tasks (since #5134)
Pipeline-level timeouts continue to work as is - through cancelling TaskRuns and Runs:
pipeline/pkg/reconciler/pipelinerun/timeout.go
Lines 129 to 136 in 58fd07c
for _, runName := range runNames { | |
logger.Infof("cancelling Run %s for timeout", runName) | |
if err := timeoutRun(ctx, runName, pr.Namespace, clientSet); err != nil { | |
errs = append(errs, fmt.Errorf("Failed to patch Run `%s` with cancellation: %s", runName, err).Error()) | |
continue | |
} | |
} |
A similar change (#5134) addressed flakiness in TestPipelineRunTimeout
cc @tektoncd/core-maintainers
Appreciate all the info @jerop! Updated the PR description and release note. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more descriptive commit message could be something like "Remove PipelineRun cancelation of Runs when Pipeline Task timeout is reached"
19d451c
to
37d3bb3
Compare
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for investigating and addressing this flake @XinruZhang!
I support this change, but will wait for @ScrapCodes (who may depend on this functionality) to chime in - @tektoncd/core-maintainers please take a look
For future reference, the wait custom task that we use for testing has fully implemented timeouts itself (can serve as an example of not relying on pipelinerun reconciler) -
pipeline/test/custom-task-ctrls/wait-task/pkg/reconciler/reconciler.go
Lines 101 to 139 in b648a5b
timeout := r.GetTimeout() | |
if duration == timeout { | |
r.Status.MarkRunFailed("InvalidTimeOut", "Spec.Timeout shouldn't equal duration") | |
return nil | |
} | |
elapsed := c.Clock.Since(r.Status.StartTime.Time) | |
// Custom Task is running and not timed out | |
if r.Status.StartTime != nil && elapsed <= duration && elapsed <= timeout { | |
logger.Infof("The Custom Task Run %s is running", r.GetName()) | |
waitTime := duration.Nanoseconds() | |
if timeout.Nanoseconds() < waitTime { | |
waitTime = timeout.Nanoseconds() | |
} | |
return controller.NewRequeueAfter(time.Duration(waitTime)) | |
} | |
if r.Status.StartTime != nil && elapsed > duration && elapsed <= timeout { | |
logger.Infof("The Custom Task Run %v finished", r.GetName()) | |
r.Status.CompletionTime = &metav1.Time{Time: c.Clock.Now()} | |
r.Status.MarkRunSucceeded("DurationElapsed", "The wait duration has elapsed") | |
return nil | |
} | |
// Custom Task timed out | |
if r.Status.StartTime != nil && elapsed > timeout { | |
logger.Infof("The Custom Task Run %v timed out", r.GetName()) | |
r.Status.CompletionTime = &metav1.Time{Time: c.Clock.Now()} | |
r.Status.MarkRunFailed("TimedOut", WaitTaskCancelledByRunTimeoutMsg) | |
// Retry if the current RetriesStatus hasn't reached the retries limit | |
if r.Spec.Retries > len(r.Status.RetriesStatus) { | |
logger.Infof("Run timed out, retrying... %#v", r.Status) | |
retryRun(r) | |
return controller.NewRequeueImmediately() | |
} | |
return nil | |
} |
@@ -1772,7 +1765,7 @@ status: | |||
}, { | |||
Operation: "add", | |||
Path: "/spec/statusMessage", | |||
Value: string(v1alpha1.RunCancelledByPipelineMsg), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this constant be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel It's better if we keep it same as other test cases. Oh actually, we may want to update the previous one as v1alpha1.RunReasonCancelled
, nice catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I meant can RunCancelledByPipelineMsg
be removed from the v1alpha1 package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, you are totally right, sorry I should've asked. We'll need to move all related usage of v1alpha1.Run
to v1beta1
later on as part of #5153, added #5153 (comment) to make sure this will happen.
…reached TestWaitCustomTask_PipelineRun/Wait_Task_Retries_on_Timeout has been flaky for a while. This commit stops the PipelineRun reconciler from cancelling Run when it detects that the task-level Timeout configured for the Run has passed, which will address the flake (similar to tektoncd#5134 which addresses TestPipelineRunTimeout).
37d3bb3
to
a1da340
Compare
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jerop, lbernick The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test tekton-pipeline-unit-tests cloud events flake |
@jerop and @XinruZhang I was wondering, without this feature how will custom task controller "reconcile" and take action on timed-out task. Basically, this |
Hi Prashant, thanks for keeping eye on this issue! If the Run controller handles the timeout correctly -- mark the Run as Failed on timeout, then we wouldn't have any problem. |
Hi Xinru, |
Oh I see, great point! After thinking more about it, I think that would be a concern of Custom Run controller right? A custom controller can requeue a key after a period of time, similar to pipeline/test/custom-task-ctrls/wait-task/pkg/reconciler/reconciler.go Lines 108 to 116 in db4c05c
|
This is already used by pipeline controller, so we are good. |
Changes
Fix #5653
I'm openning this PR as a tentative solution to address #5653:
TL;DR:
TestWaitCustomTask_PipelineRun/Wait_Task_Retries_on_Timeout
has been flaky for a while. This PR stops the PipelineRun reconciler from cancelling Run when it detects that the task-level Timeout configured for the Run has passed, which will address the flake (similar to #5134 which addressesTestPipelineRunTimeout
).Some other useful information, many thanks to @abayer, @lbernick and @jerop:
Run
should only be canceled whenPipelineRun
as a whole has timed out, either due toPipelineRun.Spec.Timeout
's value orPipelineRun.Spec.Timeouts.Pipeline
's value.PipelineTask
,PipelineRun.Spec.Timeouts.Tasks
is set, and that value has elapsed.PipelineTask
,PipelineRun.Spec.Timeouts.Finally
is set, and that value has elapsed.TaskRun
, pipelinerun reconciler don't cancel it when it times out.This change only applies to task-level Timeout, i.e. when
PipelineTask.Timeout
is set for a Run.cc @pritidesai Will this commit have any other side effect?
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
functionality, content, code)
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes