Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PipelineRun reconciler shouldn't care about how the Timeout field is implemented in Custom Task Run. #5653

Closed
XinruZhang opened this issue Oct 18, 2022 · 7 comments · Fixed by #5658
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flakey test

Comments

@XinruZhang
Copy link
Member

XinruZhang commented Oct 18, 2022

PipelineRun reconciler shouldn't care about how the Timeout field is implemented in Custom Task Run.

Currently in PipelineRun reconciler, it cancells the Run when it detects that Run is TimeOut:

if err := c.processRunTimeouts(ctx, pr, pipelineRunState); err != nil {
return err
}

by calling
func (r *Run) HasTimedOut(c clock.PassiveClock) bool {
if r.Status.StartTime == nil || r.Status.StartTime.IsZero() {
return false
}
timeout := r.GetTimeout()
// If timeout is set to 0 or defaulted to 0, there is no timeout.
if timeout == apisconfig.NoTimeoutDuration {
return false
}
runtime := c.Since(r.Status.StartTime.Time)
return runtime > timeout
}

This behavior leads to flaky unit test results now and then. See the test result of TestWaitCustomTask_PipelineRun/Wait_Task_Retries_on_Timeout in

Say we have defined the following PipelineRun

apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: custom-task-pipeline
spec:
  tasks:
  - name: run-wait
    timeout: "1s"
    retries: 1
    taskRef:
      apiVersion: wait.testing.tekton.dev/v1alpha1
      kind: Wait
    params:
    - name: duration
      value: "2s"
---
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  name: custom-task-pipelinerun
spec:
  pipelineRef:
    apiVersion: tekton.dev/v1beta1
    name: custom-task-pipeline

Expected Behavior

  1. PR reconciler creates a Run: custom-task-pipelinerun-run-wait
    case rpt.IsCustomTask() && rpt.IsMatrixed():
    rpt.Runs, err = c.createRuns(ctx, rpt, pr)
    if err != nil {
    recorder.Eventf(pr, corev1.EventTypeWarning, "RunsCreationFailed", "Failed to create Runs %q: %v", rpt.RunNames, err)
    return fmt.Errorf("error creating Runs called %s for PipelineTask %s from PipelineRun %s: %w", rpt.RunNames, rpt.PipelineTask.Name, pr.Name, err)
    }
    case rpt.IsCustomTask():
    rpt.Run, err = c.createRun(ctx, rpt.RunName, nil, rpt, pr)
    if err != nil {
    recorder.Eventf(pr, corev1.EventTypeWarning, "RunCreationFailed", "Failed to create Run %q: %v", rpt.RunName, err)
    return fmt.Errorf("error creating Run called %s for PipelineTask %s from PipelineRun %s: %w", rpt.RunName, rpt.PipelineTask.Name, pr.Name, err)
    }
  2. Run reconciler reconcils the Run, continues reconciling because the Run hasn't TimedOut.
  3. Run reconciler detects the Run has timed out, marks it as failed on TimedOut after 1s, example:
    if r.Status.StartTime != nil && elapsed > timeout {
    logger.Infof("The Custom Task Run %v timed out", r.GetName())
    r.Status.CompletionTime = &metav1.Time{Time: c.Clock.Now()}
    r.Status.MarkRunFailed("TimedOut", WaitTaskCancelledByRunTimeoutMsg)
    // Retry if the current RetriesStatus hasn't reached the retries limit
    if r.Spec.Retries > len(r.Status.RetriesStatus) {
    logger.Infof("Run timed out, retrying... %#v", r.Status)
    retryRun(r)
    return controller.NewRequeueImmediately()
    }
    return nil
    }
  4. PR reconciler detects the Run is marked as TimedOut, update its ChildReferences or RunsStatus:
    if cfg.FeatureFlags.EmbeddedStatus == config.FullEmbeddedStatus || cfg.FeatureFlags.EmbeddedStatus == config.BothEmbeddedStatus {
    pr.Status.TaskRuns = pipelineRunFacts.State.GetTaskRunsStatus(pr)
    pr.Status.Runs = pipelineRunFacts.State.GetRunsStatus(pr)
    }
    if cfg.FeatureFlags.EmbeddedStatus == config.MinimalEmbeddedStatus || cfg.FeatureFlags.EmbeddedStatus == config.BothEmbeddedStatus {
    pr.Status.ChildReferences = pipelineRunFacts.State.GetChildReferences()
    }

Actual Behavior

  1. PR reconciler creates a Run: custom-task-pipelinerun-run-wait
    case rpt.IsCustomTask() && rpt.IsMatrixed():
    rpt.Runs, err = c.createRuns(ctx, rpt, pr)
    if err != nil {
    recorder.Eventf(pr, corev1.EventTypeWarning, "RunsCreationFailed", "Failed to create Runs %q: %v", rpt.RunNames, err)
    return fmt.Errorf("error creating Runs called %s for PipelineTask %s from PipelineRun %s: %w", rpt.RunNames, rpt.PipelineTask.Name, pr.Name, err)
    }
    case rpt.IsCustomTask():
    rpt.Run, err = c.createRun(ctx, rpt.RunName, nil, rpt, pr)
    if err != nil {
    recorder.Eventf(pr, corev1.EventTypeWarning, "RunCreationFailed", "Failed to create Run %q: %v", rpt.RunName, err)
    return fmt.Errorf("error creating Run called %s for PipelineTask %s from PipelineRun %s: %w", rpt.RunName, rpt.PipelineTask.Name, pr.Name, err)
    }
  2. Run reconciler reconcils the Run, continues reconciling because the Run hasn't TimedOut.
  3. [The flaky part] PR reconciler processRunTimeouts
    if err := c.processRunTimeouts(ctx, pr, pipelineRunState); err != nil {
    return err
    }
    at this time, The Run has TimedOut, but the Run reconciler hasn't update the Run's status yet, then PR reconciler tries to cancel the Run
    if rpt.Run != nil && !rpt.Run.IsCancelled() && rpt.Run.HasTimedOut(c.Clock) && !rpt.Run.IsDone() {
    logger.Infof("Cancelling run task: %s due to timeout.", rpt.RunName)
    err := cancelRun(ctx, rpt.RunName, pr.Namespace, c.PipelineClientSet)
    if err != nil {
    errs = append(errs,
    fmt.Errorf("failed to patch Run `%s` with cancellation: %s", rpt.RunName, err).Error())
    }
    }
@XinruZhang XinruZhang added the kind/bug Categorizes issue or PR as related to a bug. label Oct 18, 2022
XinruZhang added a commit to XinruZhang/pipeline that referenced this issue Oct 18, 2022
TestWaitCustomTask_PipelineRun/Wait_Task_Retries_on_Timeout has been
flaky for a while. see tektoncd#5653
for more details. This commit stops the PipelineRun reconciler from
cancelling Run when it detects the Run times out.
XinruZhang added a commit to XinruZhang/pipeline that referenced this issue Oct 18, 2022
TestWaitCustomTask_PipelineRun/Wait_Task_Retries_on_Timeout has been
flaky for a while. see tektoncd#5653
for more details. This commit stops the PipelineRun reconciler from
cancelling Run when it detects the Run times out.
@XinruZhang
Copy link
Member Author

XinruZhang commented Oct 18, 2022

Took another look at the timeout section of TEP-0002:

For a PipelineRun with either a pipeline level timeout configured and/or the custom task level timout configuration, timeout is updated to the run with same policy as it is for task runs. On timeout, the running run's status is updated with "RunCancelled"

Therefore this is actually aligned with what's designed, but this exact implementation causes the flaky test results.

The statement "timeout is updated to the run with same policy as it is for task runs" in the design is NOT true in the current implementation. For TaskRun, the PipelineRun reconciler doesn't care about how the TaskRunSpec.TimeOut -- TaskRun reconciler handles that.

cc @lbernick @jerop @pritidesai

@tekton-robot
Copy link
Collaborator

@XinruZhang: The label(s) kind/flaky cannot be applied, because the repository doesn't have them.

In response to this:

/kind flaky

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@XinruZhang
Copy link
Member Author

/kind flake

@tekton-robot tekton-robot added the kind/flake Categorizes issue or PR as related to a flakey test label Oct 18, 2022
@XinruZhang
Copy link
Member Author

XinruZhang commented Oct 18, 2022

Do we want to remove

if err := c.processRunTimeouts(ctx, pr, pipelineRunState); err != nil {
return err
}
and update TEP-0002 to reflect the change?

At least we need to update TEP-0002 to reflect that PipelineRun reconciler doesn't cancel TaskRun when its run time bigger than TaskRunSpec.Timeout.

It is worth mentiong here: prior to #5134, pipelinerun reconciler handles the timeout for taskrun.

@XinruZhang
Copy link
Member Author

/assign

XinruZhang added a commit to XinruZhang/pipeline that referenced this issue Oct 19, 2022
TestWaitCustomTask_PipelineRun/Wait_Task_Retries_on_Timeout has been
flaky for a while. see tektoncd#5653
for more details. This commit stops the PipelineRun reconciler from
cancelling Run when it detects the Run times out.
@XinruZhang
Copy link
Member Author

Do we want to remove

if err := c.processRunTimeouts(ctx, pr, pipelineRunState); err != nil {
return err
}

and update TEP-0002 to reflect the change?
At least we need to update TEP-0002 to reflect that PipelineRun reconciler doesn't cancel TaskRun when its run time bigger than TaskRunSpec.Timeout.

It is worth mentiong here: prior to #5134, pipelinerun reconciler handles the timeout for taskrun.

@ScrapCodes, wondering if you have any thoughts 😀

@ScrapCodes
Copy link
Contributor

Hi @XinruZhang , I want to respond to this and the other PR as well, currently I am working in a limited capacity and enjoying vacations upto Tuesday. Thanks!

@tekton-robot tekton-robot removed the kind/bug Categorizes issue or PR as related to a bug. label Oct 20, 2022
Repository owner moved this from Todo to Done in Tekton Community Roadmap Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flakey test
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants