PipelineRun reconciler shouldn't care about how the `Timeout` field is implemented in Custom Task Run. #5653

XinruZhang · 2022-10-18T15:05:44Z

PipelineRun reconciler shouldn't care about how the Timeout field is implemented in Custom Task Run.

Currently in PipelineRun reconciler, it cancells the Run when it detects that Run is TimeOut:

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Lines 657 to 659 in 2b54123

    
           if err := c.processRunTimeouts(ctx, pr, pipelineRunState); err != nil { 
        
           	return err 
        
           }

by calling

pipeline/pkg/apis/pipeline/v1alpha1/run_types.go

Lines 230 to 241 in 2b54123

    
           func (r *Run) HasTimedOut(c clock.PassiveClock) bool { 
        
           	if r.Status.StartTime == nil || r.Status.StartTime.IsZero() { 
        
           		return false 
        
           	} 
        
           	timeout := r.GetTimeout() 
        
           	// If timeout is set to 0 or defaulted to 0, there is no timeout. 
        
           	if timeout == apisconfig.NoTimeoutDuration { 
        
           		return false 
        
           	} 
        
           	runtime := c.Since(r.Status.StartTime.Time) 
        
           	return runtime > timeout 
        
           }

This behavior leads to flaky unit test results now and then. See the test result of TestWaitCustomTask_PipelineRun/Wait_Task_Retries_on_Timeout in

Say we have defined the following PipelineRun

apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: custom-task-pipeline
spec:
  tasks:
  - name: run-wait
    timeout: "1s"
    retries: 1
    taskRef:
      apiVersion: wait.testing.tekton.dev/v1alpha1
      kind: Wait
    params:
    - name: duration
      value: "2s"
---
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  name: custom-task-pipelinerun
spec:
  pipelineRef:
    apiVersion: tekton.dev/v1beta1
    name: custom-task-pipeline

Expected Behavior

PR reconciler creates a Run: custom-task-pipelinerun-run-wait

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Lines 790 to 801 in 8399472

    
           case rpt.IsCustomTask() && rpt.IsMatrixed(): 
        
           	rpt.Runs, err = c.createRuns(ctx, rpt, pr) 
        
           	if err != nil { 
        
           		recorder.Eventf(pr, corev1.EventTypeWarning, "RunsCreationFailed", "Failed to create Runs %q: %v", rpt.RunNames, err) 
        
           		return fmt.Errorf("error creating Runs called %s for PipelineTask %s from PipelineRun %s: %w", rpt.RunNames, rpt.PipelineTask.Name, pr.Name, err) 
        
           	} 
        
           case rpt.IsCustomTask(): 
        
           	rpt.Run, err = c.createRun(ctx, rpt.RunName, nil, rpt, pr) 
        
           	if err != nil { 
        
           		recorder.Eventf(pr, corev1.EventTypeWarning, "RunCreationFailed", "Failed to create Run %q: %v", rpt.RunName, err) 
        
           		return fmt.Errorf("error creating Run called %s for PipelineTask %s from PipelineRun %s: %w", rpt.RunName, rpt.PipelineTask.Name, pr.Name, err) 
        
           	}

Run reconciler reconcils the Run, continues reconciling because the Run hasn't TimedOut.

Run reconciler detects the Run has timed out, marks it as failed on TimedOut after 1s, example:

pipeline/test/custom-task-ctrls/wait-task/pkg/reconciler/reconciler.go

Lines 126 to 139 in 8399472

    
           if r.Status.StartTime != nil && elapsed > timeout { 
        
           	logger.Infof("The Custom Task Run %v timed out", r.GetName()) 
        
           	r.Status.CompletionTime = &metav1.Time{Time: c.Clock.Now()} 
        
           	r.Status.MarkRunFailed("TimedOut", WaitTaskCancelledByRunTimeoutMsg) 
        
           	// Retry if the current RetriesStatus hasn't reached the retries limit 
        
           	if r.Spec.Retries > len(r.Status.RetriesStatus) { 
        
           		logger.Infof("Run timed out, retrying... %#v", r.Status) 
        
           		retryRun(r) 
        
           		return controller.NewRequeueImmediately() 
        
           	} 
        
           	return nil 
        
           }

PR reconciler detects the Run is marked as TimedOut, update its ChildReferences or RunsStatus:

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Lines 684 to 691 in 8399472

    
           if cfg.FeatureFlags.EmbeddedStatus == config.FullEmbeddedStatus || cfg.FeatureFlags.EmbeddedStatus == config.BothEmbeddedStatus { 
        
           	pr.Status.TaskRuns = pipelineRunFacts.State.GetTaskRunsStatus(pr) 
        
           	pr.Status.Runs = pipelineRunFacts.State.GetRunsStatus(pr) 
        
           } 
        
           if cfg.FeatureFlags.EmbeddedStatus == config.MinimalEmbeddedStatus || cfg.FeatureFlags.EmbeddedStatus == config.BothEmbeddedStatus { 
        
           	pr.Status.ChildReferences = pipelineRunFacts.State.GetChildReferences() 
        
           }

Actual Behavior

PR reconciler creates a Run: custom-task-pipelinerun-run-wait

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Lines 790 to 801 in 8399472

    
           case rpt.IsCustomTask() && rpt.IsMatrixed(): 
        
           	rpt.Runs, err = c.createRuns(ctx, rpt, pr) 
        
           	if err != nil { 
        
           		recorder.Eventf(pr, corev1.EventTypeWarning, "RunsCreationFailed", "Failed to create Runs %q: %v", rpt.RunNames, err) 
        
           		return fmt.Errorf("error creating Runs called %s for PipelineTask %s from PipelineRun %s: %w", rpt.RunNames, rpt.PipelineTask.Name, pr.Name, err) 
        
           	} 
        
           case rpt.IsCustomTask(): 
        
           	rpt.Run, err = c.createRun(ctx, rpt.RunName, nil, rpt, pr) 
        
           	if err != nil { 
        
           		recorder.Eventf(pr, corev1.EventTypeWarning, "RunCreationFailed", "Failed to create Run %q: %v", rpt.RunName, err) 
        
           		return fmt.Errorf("error creating Run called %s for PipelineTask %s from PipelineRun %s: %w", rpt.RunName, rpt.PipelineTask.Name, pr.Name, err) 
        
           	}

Run reconciler reconcils the Run, continues reconciling because the Run hasn't TimedOut.

[The flaky part] PR reconciler processRunTimeouts

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Lines 657 to 659 in 2b54123

    
           if err := c.processRunTimeouts(ctx, pr, pipelineRunState); err != nil { 
        
           	return err 
        
           }

at this time, The Run has TimedOut, but the Run reconciler hasn't update the Run's status yet, then PR reconciler tries to cancel the Run

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Lines 718 to 725 in 2b54123

    
           if rpt.Run != nil && !rpt.Run.IsCancelled() && rpt.Run.HasTimedOut(c.Clock) && !rpt.Run.IsDone() { 
        
           	logger.Infof("Cancelling run task: %s due to timeout.", rpt.RunName) 
        
           	err := cancelRun(ctx, rpt.RunName, pr.Namespace, c.PipelineClientSet) 
        
           	if err != nil { 
        
           		errs = append(errs, 
        
           			fmt.Errorf("failed to patch Run `%s` with cancellation: %s", rpt.RunName, err).Error()) 
        
           	} 
        
           }

The text was updated successfully, but these errors were encountered:

TestWaitCustomTask_PipelineRun/Wait_Task_Retries_on_Timeout has been flaky for a while. see tektoncd#5653 for more details. This commit stops the PipelineRun reconciler from cancelling Run when it detects the Run times out.

XinruZhang · 2022-10-18T20:05:07Z

Took another look at the timeout section of TEP-0002:

For a PipelineRun with either a pipeline level timeout configured and/or the custom task level timout configuration, timeout is updated to the run with same policy as it is for task runs. On timeout, the running run's status is updated with "RunCancelled"

Therefore this is actually aligned with what's designed, but this exact implementation causes the flaky test results.

The statement "timeout is updated to the run with same policy as it is for task runs" in the design is NOT true in the current implementation. For TaskRun, the PipelineRun reconciler doesn't care about how the TaskRunSpec.TimeOut -- TaskRun reconciler handles that.

cc @lbernick @jerop @pritidesai

tekton-robot · 2022-10-18T20:06:25Z

@XinruZhang: The label(s) kind/flaky cannot be applied, because the repository doesn't have them.

In response to this:

/kind flaky

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

XinruZhang · 2022-10-18T20:06:55Z

/kind flake

XinruZhang · 2022-10-18T20:10:30Z

Do we want to remove

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Lines 657 to 659 in 2b54123

    
           if err := c.processRunTimeouts(ctx, pr, pipelineRunState); err != nil { 
        
           	return err 
        
           }

and update TEP-0002 to reflect the change?

At least we need to update TEP-0002 to reflect that PipelineRun reconciler doesn't cancel TaskRun when its run time bigger than TaskRunSpec.Timeout.

It is worth mentiong here: prior to #5134, pipelinerun reconciler handles the timeout for taskrun.

XinruZhang · 2022-10-18T20:10:42Z

/assign

TestWaitCustomTask_PipelineRun/Wait_Task_Retries_on_Timeout has been flaky for a while. see tektoncd#5653 for more details. This commit stops the PipelineRun reconciler from cancelling Run when it detects the Run times out.

XinruZhang · 2022-10-19T17:14:31Z

Do we want to remove

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Lines 657 to 659 in 2b54123

if err := c.processRunTimeouts(ctx, pr, pipelineRunState); err != nil {

return err

}

and update TEP-0002 to reflect the change?
At least we need to update TEP-0002 to reflect that PipelineRun reconciler doesn't cancel TaskRun when its run time bigger than TaskRunSpec.Timeout.

It is worth mentiong here: prior to #5134, pipelinerun reconciler handles the timeout for taskrun.

@ScrapCodes, wondering if you have any thoughts 😀

ScrapCodes · 2022-10-20T06:23:37Z

Hi @XinruZhang , I want to respond to this and the other PR as well, currently I am working in a limited capacity and enjoying vacations upto Tuesday. Thanks!

XinruZhang added the kind/bug Categorizes issue or PR as related to a bug. label Oct 18, 2022

xchapter7x added this to Tekton Community Roadmap Oct 18, 2022

xchapter7x moved this to Todo in Tekton Community Roadmap Oct 18, 2022

tekton-robot added the kind/flake Categorizes issue or PR as related to a flakey test label Oct 18, 2022

tekton-robot assigned XinruZhang Oct 18, 2022

XinruZhang mentioned this issue Oct 18, 2022

TEP-0114: Resolve the Flaky Test - TestWaitCustomTask_PipelineRun #5658

Merged

7 tasks

tekton-robot removed the kind/bug Categorizes issue or PR as related to a bug. label Oct 20, 2022

tekton-robot closed this as completed in #5658 Oct 25, 2022

Repository owner moved this from Todo to Done in Tekton Community Roadmap Oct 25, 2022

XinruZhang mentioned this issue Oct 31, 2022

Conformance Policy around Custom Task Run #5700

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PipelineRun reconciler shouldn't care about how the `Timeout` field is implemented in Custom Task Run. #5653

PipelineRun reconciler shouldn't care about how the `Timeout` field is implemented in Custom Task Run. #5653

XinruZhang commented Oct 18, 2022 •

edited

Loading

XinruZhang commented Oct 18, 2022 •

edited

Loading

tekton-robot commented Oct 18, 2022

XinruZhang commented Oct 18, 2022

XinruZhang commented Oct 18, 2022 •

edited

Loading

XinruZhang commented Oct 18, 2022

XinruZhang commented Oct 19, 2022

ScrapCodes commented Oct 20, 2022

PipelineRun reconciler shouldn't care about how the Timeout field is implemented in Custom Task Run. #5653

PipelineRun reconciler shouldn't care about how the Timeout field is implemented in Custom Task Run. #5653

Comments

XinruZhang commented Oct 18, 2022 • edited Loading

Expected Behavior

Actual Behavior

XinruZhang commented Oct 18, 2022 • edited Loading

tekton-robot commented Oct 18, 2022

XinruZhang commented Oct 18, 2022

XinruZhang commented Oct 18, 2022 • edited Loading

XinruZhang commented Oct 18, 2022

XinruZhang commented Oct 19, 2022

ScrapCodes commented Oct 20, 2022

PipelineRun reconciler shouldn't care about how the `Timeout` field is implemented in Custom Task Run. #5653

PipelineRun reconciler shouldn't care about how the `Timeout` field is implemented in Custom Task Run. #5653

XinruZhang commented Oct 18, 2022 •

edited

Loading

XinruZhang commented Oct 18, 2022 •

edited

Loading

XinruZhang commented Oct 18, 2022 •

edited

Loading