TestPipelineRunTimeout is flaky #3460

mattmoor · 2020-10-26T18:54:34Z

Expected Behavior

TestPipelineRunTimeout consistently passes with the properly observed failure mode.

Actual Behavior

TestPipelineRunTimeout often observes a "Failed" status on the PipelineRun(!!!).

Steps to Reproduce the Problem

Run the test a lot.

I will post my analysis here shortly, as I believe I know WHY this is happening.

mattmoor · 2020-10-26T19:05:21Z

Every so often this test will fail with:

timeout_test.go:102: Error waiting for PipelineRun pipeline-run-timeout-qaakhkfe to finish: "pipeline-run-timeout-qaakhkfe" completed with the wrong reason: Failed

The setup is a task that sleeps for 10s and a pipeline that has a 5s timeout:

pipeline/test/v1alpha1/timeout_test.go

Lines 49 to 61 in adc4cfa

    
           t.Logf("Creating Task in namespace %s", namespace) 
        
           task := tb.Task(helpers.ObjectNameForTest(t), tb.TaskSpec( 
        
           	tb.Step("busybox", tb.StepCommand("/bin/sh"), tb.StepArgs("-c", "sleep 10")))) 
        
           if _, err := c.TaskClient.Create(ctx, task, metav1.CreateOptions{}); err != nil { 
        
           	t.Fatalf("Failed to create Task %q: %s", task.Name, err) 
        
           } 
        
           pipeline := tb.Pipeline(helpers.ObjectNameForTest(t), 
        
           	tb.PipelineSpec(tb.PipelineTask("foo", task.Name)), 
        
           ) 
        
           pipelineRun := tb.PipelineRun(helpers.ObjectNameForTest(t), tb.PipelineRunSpec(pipeline.Name, 
        
           	tb.PipelineRunTimeout(5*time.Second), 
        
           ))

This check expects the "Reason" to be "PipelineRunTimeout":

pipeline/test/v1alpha1/timeout_test.go

Lines 100 to 103 in adc4cfa

    
           t.Logf("Waiting for PipelineRun %s in namespace %s to be timed out", pipelineRun.Name, namespace) 
        
           if err := WaitForPipelineRunState(ctx, c, pipelineRun.Name, timeout, FailedWithReason("PipelineRunTimeout", pipelineRun.Name), "PipelineRunTimedOut"); err != nil { 
        
           	t.Errorf("Error waiting for PipelineRun %s to finish: %s", pipelineRun.Name, err) 
        
           }

However, when the test intermittently fails the status shows the TaskRun timed out, but the PipelineRun Failed:

        spec:
          pipelineRef:
            name: pipeline-run-timeout-kjlunngp
          serviceAccountName: default
          timeout: 5s     <-- Expected to timeout (the task sleeps for 10s)
        status:
          completionTime: "2020-10-24T12:37:13Z"
          conditions:
          - lastTransitionTime: "2020-10-24T12:37:13Z"
            message: 'Tasks Completed: 1 (Failed: 1, Cancelled 0), Skipped: 0'
            reason: Failed    <-- We get Failed instead of PipelineRunTimeout
            status: "False"
            type: Succeeded
          pipelineSpec:
            tasks:
            - name: foo
              taskRef:
                kind: Task
                name: pipeline-run-timeout-lfsmutbs
          startTime: "2020-10-24T12:37:09Z"
          taskRuns:
            pipeline-run-timeout-mncrmugc-foo-74kf8:
              pipelineTaskName: foo
              status:
                completionTime: "2020-10-24T12:37:13Z"
                conditions:
                - lastTransitionTime: "2020-10-24T12:37:13Z"
                  message: TaskRun "pipeline-run-timeout-mncrmugc-foo-74kf8" failed to finish within "5s"
                  reason: TaskRunTimeout   <-- The task run timed out (as expected) though
                  status: "False"
                  type: Succeeded

The irony is that the TaskRun gets its timeout from the PipelineRun here:

pipeline/pkg/reconciler/pipelinerun/pipelinerun.go

Lines 807 to 816 in 6d8f451

    
           func getTaskRunTimeout(ctx context.Context, pr *v1beta1.PipelineRun, rprt *resources.ResolvedPipelineRunTask) *metav1.Duration { 
        
           	var taskRunTimeout = &metav1.Duration{Duration: apisconfig.NoTimeoutDuration} 
        
           	var timeout time.Duration 
        
           	if pr.Spec.Timeout == nil { 
        
           		defaultTimeout := time.Duration(config.FromContextOrDefaults(ctx).Defaults.DefaultTimeoutMinutes) 
        
           		timeout = defaultTimeout * time.Minute 
        
           	} else { 
        
           		timeout = pr.Spec.Timeout.Duration 
        
           	}

When we transition the PipelineRun to "Failed" we skip this check:

pipeline/pkg/reconciler/pipelinerun/resources/pipelinerunstate.go

Lines 206 to 218 in 3185d05

    
           // We have 4 different states here: 
        
           // 1. Timed out -> Failed 
        
           // 2. All tasks are done and at least one has failed or has been cancelled -> Failed 
        
           // 3. All tasks are done or are skipped (i.e. condition check failed).-> Success 
        
           // 4. A Task or Condition is running right now or there are things left to run -> Running 
        
           if pr.IsTimedOut() { 
        
           	return &apis.Condition{ 
        
           		Type:    apis.ConditionSucceeded, 
        
           		Status:  corev1.ConditionFalse, 
        
           		Reason:  v1beta1.PipelineRunReasonTimedOut.String(), 
        
           		Message: fmt.Sprintf("PipelineRun %q failed to finish within %q", pr.Name, pr.Spec.Timeout.Duration.String()), 
        
           	} 
        
           }

... digging into why, it looks at the StartTime in status, and in the cases where we fail the StartTime is less than 5s from the time of completion(!):

          	PipelineRunStatusFields: v1beta1.PipelineRunStatusFields{
          		StartTime:      s"2020-10-25 20:35:07 +0000 UTC",
        - 		CompletionTime: nil,
        + 		CompletionTime: s"2020-10-25 20:35:11.874768406 +0000 UTC m=+601.330431507",
          		TaskRuns: map[string]*v1beta1.PipelineRunTaskRunStatus{

The same status update shows us propagating the following for the TaskRun:

          					TaskRunStatusFields: v1beta1.TaskRunStatusFields{
          						PodName:        "pipeline-run-timeout-qaakhkfe-foo-zzlbf-pod-577ph",
          						StartTime:      s"2020-10-25 20:35:06 +0000 UTC",
        - 						CompletionTime: nil,
        + 						CompletionTime: s"2020-10-25 20:35:11 +0000 UTC",

mattmoor · 2020-10-26T19:08:35Z

Generally this means that any PipelineRun specifying a Timeout may simple show up as Failed due to this race.

I think that my $0.02 on the appropriate "fix" would be that the PipelineRun's "StartTime" should never be greater than the "StartTime" of its constituent TaskRuns. This should guarantee that any copied-down timeout never manifests this way.

cc @vdemeester @afrittoli @imjasonh @bobcatfish for thoughts 🙏

mattmoor · 2020-10-26T19:12:37Z

Sorry, I should mention that the first bit of yaml is actually a dump from a prior run that I lifted from my older bug that this replaces. So don't try to rationalize it with the later "diffs".

The complete run for the latest failure (that I debugged) was here: https://github.com/mattmoor/mink/runs/1305814401?check_suite_focus=true#step:17:58

mattmoor · 2020-10-26T19:28:21Z

I think that this is derived from state informer caches and the retry logic for updating status that we have.

On a subsequent reconciliation of the PipelineRun (+1s) from the stale uninitialized version, we re-InitializeConditions, which resets StartTime, but a prior reconciliation had already created the TaskRuns (likely what triggered us to be reconciled).

The diff for this status update looks like:

    kubelogs.go:197: D 20:35:07.329 controlplane-2 [pipelinerun/reconciler.go:322] [arendelle-px7ff/pipeline-run-timeout-qaakhkfe] Updating status with:   v1beta1.PipelineRunStatus{
          	Status: v1beta1.Status{
          		ObservedGeneration: 0,
        - 		Conditions:         nil,
        + 		Conditions: v1beta1.Conditions{
        + 			{
        + 				Type:               "Succeeded",
        + 				Status:             "Unknown",
        + 				LastTransitionTime: apis.VolatileTime{Inner: s"2020-10-25 20:35:07.188165477 +0"...},
        + 				Reason:             "Running",
        + 				Message:            "Tasks Completed: 0 (Failed: 0, Cancelled 0), Incomplete: 1, Skip"...,
        + 			},
        + 		},
          		Annotations: nil,
          	},
          	PipelineRunStatusFields: v1beta1.PipelineRunStatusFields{
        - 		StartTime:      nil,
        + 		StartTime:      s"2020-10-25 20:35:07.055779791 +0000 UTC m=+596.511442892",
          		CompletionTime: nil,
        - 		TaskRuns:       nil,
        + 		TaskRuns: map[string]*v1beta1.PipelineRunTaskRunStatus{
        + 			"pipeline-run-timeout-qaakhkfe-foo-zzlbf": &{PipelineTaskName: "foo", Status: &v1beta1.TaskRunStatus{}},
        + 		},
          		PipelineResults: nil,
        - 		PipelineSpec:    nil,
        + 		PipelineSpec:    &v1beta1.PipelineSpec{Tasks: []v1beta1.PipelineTask{{Name: "foo", TaskRef: &v1beta1.TaskRef{...}}}},
          		SkippedTasks:    nil,
          	},
          }

Note the empty StartTime in the base, despite this in a prior update:

          	PipelineRunStatusFields: v1beta1.PipelineRunStatusFields{
        - 		StartTime:      nil,
        + 		StartTime:      s"2020-10-25 20:35:06.759884985 +0000 UTC m=+596.215548086",

Occasionally, it is possible for us to be reconciling a PipelineRun and have the status we intend to report reflect an inaccurate StartTime (see issue for details). This corrects for those circumstances by ensuring that the StartTime we report for a PipelineRun is never later than the smallest CreationTimestamp of a child TaskRun. Fixes: tektoncd#3460

Occasionally, it is possible for us to be reconciling a PipelineRun and have the status we intend to report reflect an inaccurate StartTime (see issue for details). This corrects for those circumstances by ensuring that the StartTime we report for a PipelineRun is never later than the smallest CreationTimestamp of a child TaskRun. Fixes: #3460

jerop · 2022-08-08T14:22:00Z

TestPipelineRunTimeout test is still flaky

jerop · 2022-08-19T20:48:01Z

fixed in #5134 🤞🏾

thanks @abayer 🎉

mattmoor added the kind/bug Categorizes issue or PR as related to a bug. label Oct 26, 2020

mattmoor mentioned this issue Oct 26, 2020

Adjust PipelineRun's StartTime based on TaskRun state. #3461

Merged

4 tasks

tekton-robot closed this as completed in #3461 Oct 27, 2020

jerop reopened this Aug 8, 2022

jerop mentioned this issue Aug 8, 2022

TEP-0116: Referencing Finally Task Results in Pipeline Results #5170

Merged

7 tasks

vsinghai mentioned this issue Aug 8, 2022

Modified test to allow for validation #5284

Merged

7 tasks

abayer mentioned this issue Aug 11, 2022

Switch PipelineRun timeout -> TaskRun logic to instead signal the TaskRuns to stop #5134

Merged

7 tasks

jerop closed this as completed Aug 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TestPipelineRunTimeout is flaky #3460

TestPipelineRunTimeout is flaky #3460

mattmoor commented Oct 26, 2020

mattmoor commented Oct 26, 2020

mattmoor commented Oct 26, 2020

mattmoor commented Oct 26, 2020

mattmoor commented Oct 26, 2020

jerop commented Aug 8, 2022

jerop commented Aug 19, 2022 •

edited

Loading

TestPipelineRunTimeout is flaky #3460

TestPipelineRunTimeout is flaky #3460

Comments

mattmoor commented Oct 26, 2020

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

mattmoor commented Oct 26, 2020

mattmoor commented Oct 26, 2020

mattmoor commented Oct 26, 2020

mattmoor commented Oct 26, 2020

jerop commented Aug 8, 2022

jerop commented Aug 19, 2022 • edited Loading

jerop commented Aug 19, 2022 •

edited

Loading