Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestPipelineRunTimeout is flaky #3460

Closed
mattmoor opened this issue Oct 26, 2020 · 6 comments · Fixed by #3461
Closed

TestPipelineRunTimeout is flaky #3460

mattmoor opened this issue Oct 26, 2020 · 6 comments · Fixed by #3461
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@mattmoor
Copy link
Member

Expected Behavior

TestPipelineRunTimeout consistently passes with the properly observed failure mode.

Actual Behavior

TestPipelineRunTimeout often observes a "Failed" status on the PipelineRun(!!!).

Steps to Reproduce the Problem

Run the test a lot.

I will post my analysis here shortly, as I believe I know WHY this is happening.

@mattmoor mattmoor added the kind/bug Categorizes issue or PR as related to a bug. label Oct 26, 2020
@mattmoor
Copy link
Member Author

Every so often this test will fail with:

timeout_test.go:102: Error waiting for PipelineRun pipeline-run-timeout-qaakhkfe to finish: "pipeline-run-timeout-qaakhkfe" completed with the wrong reason: Failed

The setup is a task that sleeps for 10s and a pipeline that has a 5s timeout:

t.Logf("Creating Task in namespace %s", namespace)
task := tb.Task(helpers.ObjectNameForTest(t), tb.TaskSpec(
tb.Step("busybox", tb.StepCommand("/bin/sh"), tb.StepArgs("-c", "sleep 10"))))
if _, err := c.TaskClient.Create(ctx, task, metav1.CreateOptions{}); err != nil {
t.Fatalf("Failed to create Task %q: %s", task.Name, err)
}
pipeline := tb.Pipeline(helpers.ObjectNameForTest(t),
tb.PipelineSpec(tb.PipelineTask("foo", task.Name)),
)
pipelineRun := tb.PipelineRun(helpers.ObjectNameForTest(t), tb.PipelineRunSpec(pipeline.Name,
tb.PipelineRunTimeout(5*time.Second),
))

This check expects the "Reason" to be "PipelineRunTimeout":

t.Logf("Waiting for PipelineRun %s in namespace %s to be timed out", pipelineRun.Name, namespace)
if err := WaitForPipelineRunState(ctx, c, pipelineRun.Name, timeout, FailedWithReason("PipelineRunTimeout", pipelineRun.Name), "PipelineRunTimedOut"); err != nil {
t.Errorf("Error waiting for PipelineRun %s to finish: %s", pipelineRun.Name, err)
}

However, when the test intermittently fails the status shows the TaskRun timed out, but the PipelineRun Failed:

        spec:
          pipelineRef:
            name: pipeline-run-timeout-kjlunngp
          serviceAccountName: default
          timeout: 5s     <-- Expected to timeout (the task sleeps for 10s)
        status:
          completionTime: "2020-10-24T12:37:13Z"
          conditions:
          - lastTransitionTime: "2020-10-24T12:37:13Z"
            message: 'Tasks Completed: 1 (Failed: 1, Cancelled 0), Skipped: 0'
            reason: Failed    <-- We get Failed instead of PipelineRunTimeout
            status: "False"
            type: Succeeded
          pipelineSpec:
            tasks:
            - name: foo
              taskRef:
                kind: Task
                name: pipeline-run-timeout-lfsmutbs
          startTime: "2020-10-24T12:37:09Z"
          taskRuns:
            pipeline-run-timeout-mncrmugc-foo-74kf8:
              pipelineTaskName: foo
              status:
                completionTime: "2020-10-24T12:37:13Z"
                conditions:
                - lastTransitionTime: "2020-10-24T12:37:13Z"
                  message: TaskRun "pipeline-run-timeout-mncrmugc-foo-74kf8" failed to finish within "5s"
                  reason: TaskRunTimeout   <-- The task run timed out (as expected) though
                  status: "False"
                  type: Succeeded

The irony is that the TaskRun gets its timeout from the PipelineRun here:

func getTaskRunTimeout(ctx context.Context, pr *v1beta1.PipelineRun, rprt *resources.ResolvedPipelineRunTask) *metav1.Duration {
var taskRunTimeout = &metav1.Duration{Duration: apisconfig.NoTimeoutDuration}
var timeout time.Duration
if pr.Spec.Timeout == nil {
defaultTimeout := time.Duration(config.FromContextOrDefaults(ctx).Defaults.DefaultTimeoutMinutes)
timeout = defaultTimeout * time.Minute
} else {
timeout = pr.Spec.Timeout.Duration
}

When we transition the PipelineRun to "Failed" we skip this check:

// We have 4 different states here:
// 1. Timed out -> Failed
// 2. All tasks are done and at least one has failed or has been cancelled -> Failed
// 3. All tasks are done or are skipped (i.e. condition check failed).-> Success
// 4. A Task or Condition is running right now or there are things left to run -> Running
if pr.IsTimedOut() {
return &apis.Condition{
Type: apis.ConditionSucceeded,
Status: corev1.ConditionFalse,
Reason: v1beta1.PipelineRunReasonTimedOut.String(),
Message: fmt.Sprintf("PipelineRun %q failed to finish within %q", pr.Name, pr.Spec.Timeout.Duration.String()),
}
}

... digging into why, it looks at the StartTime in status, and in the cases where we fail the StartTime is less than 5s from the time of completion(!):

          	PipelineRunStatusFields: v1beta1.PipelineRunStatusFields{
          		StartTime:      s"2020-10-25 20:35:07 +0000 UTC",
        - 		CompletionTime: nil,
        + 		CompletionTime: s"2020-10-25 20:35:11.874768406 +0000 UTC m=+601.330431507",
          		TaskRuns: map[string]*v1beta1.PipelineRunTaskRunStatus{

The same status update shows us propagating the following for the TaskRun:

          					TaskRunStatusFields: v1beta1.TaskRunStatusFields{
          						PodName:        "pipeline-run-timeout-qaakhkfe-foo-zzlbf-pod-577ph",
          						StartTime:      s"2020-10-25 20:35:06 +0000 UTC",
        - 						CompletionTime: nil,
        + 						CompletionTime: s"2020-10-25 20:35:11 +0000 UTC",

@mattmoor
Copy link
Member Author

Generally this means that any PipelineRun specifying a Timeout may simple show up as Failed due to this race.

I think that my $0.02 on the appropriate "fix" would be that the PipelineRun's "StartTime" should never be greater than the "StartTime" of its constituent TaskRuns. This should guarantee that any copied-down timeout never manifests this way.

cc @vdemeester @afrittoli @imjasonh @bobcatfish for thoughts 🙏

@mattmoor
Copy link
Member Author

Sorry, I should mention that the first bit of yaml is actually a dump from a prior run that I lifted from my older bug that this replaces. So don't try to rationalize it with the later "diffs".

The complete run for the latest failure (that I debugged) was here: https://github.com/mattmoor/mink/runs/1305814401?check_suite_focus=true#step:17:58

@mattmoor
Copy link
Member Author

I think that this is derived from state informer caches and the retry logic for updating status that we have.

On a subsequent reconciliation of the PipelineRun (+1s) from the stale uninitialized version, we re-InitializeConditions, which resets StartTime, but a prior reconciliation had already created the TaskRuns (likely what triggered us to be reconciled).

The diff for this status update looks like:

    kubelogs.go:197: D 20:35:07.329 controlplane-2 [pipelinerun/reconciler.go:322] [arendelle-px7ff/pipeline-run-timeout-qaakhkfe] Updating status with:   v1beta1.PipelineRunStatus{
          	Status: v1beta1.Status{
          		ObservedGeneration: 0,
        - 		Conditions:         nil,
        + 		Conditions: v1beta1.Conditions{
        + 			{
        + 				Type:               "Succeeded",
        + 				Status:             "Unknown",
        + 				LastTransitionTime: apis.VolatileTime{Inner: s"2020-10-25 20:35:07.188165477 +0"...},
        + 				Reason:             "Running",
        + 				Message:            "Tasks Completed: 0 (Failed: 0, Cancelled 0), Incomplete: 1, Skip"...,
        + 			},
        + 		},
          		Annotations: nil,
          	},
          	PipelineRunStatusFields: v1beta1.PipelineRunStatusFields{
        - 		StartTime:      nil,
        + 		StartTime:      s"2020-10-25 20:35:07.055779791 +0000 UTC m=+596.511442892",
          		CompletionTime: nil,
        - 		TaskRuns:       nil,
        + 		TaskRuns: map[string]*v1beta1.PipelineRunTaskRunStatus{
        + 			"pipeline-run-timeout-qaakhkfe-foo-zzlbf": &{PipelineTaskName: "foo", Status: &v1beta1.TaskRunStatus{}},
        + 		},
          		PipelineResults: nil,
        - 		PipelineSpec:    nil,
        + 		PipelineSpec:    &v1beta1.PipelineSpec{Tasks: []v1beta1.PipelineTask{{Name: "foo", TaskRef: &v1beta1.TaskRef{...}}}},
          		SkippedTasks:    nil,
          	},
          }

Note the empty StartTime in the base, despite this in a prior update:

          	PipelineRunStatusFields: v1beta1.PipelineRunStatusFields{
        - 		StartTime:      nil,
        + 		StartTime:      s"2020-10-25 20:35:06.759884985 +0000 UTC m=+596.215548086",

mattmoor added a commit to mattmoor/pipeline that referenced this issue Oct 26, 2020
Occasionally, it is possible for us to be reconciling a PipelineRun and have the
status we intend to report reflect an inaccurate StartTime (see issue for
details).  This corrects for those circumstances by ensuring that the StartTime
we report for a PipelineRun is never later than the smallest CreationTimestamp
of a child TaskRun.

Fixes: tektoncd#3460
tekton-robot pushed a commit that referenced this issue Oct 27, 2020
Occasionally, it is possible for us to be reconciling a PipelineRun and have the
status we intend to report reflect an inaccurate StartTime (see issue for
details).  This corrects for those circumstances by ensuring that the StartTime
we report for a PipelineRun is never later than the smallest CreationTimestamp
of a child TaskRun.

Fixes: #3460
@jerop
Copy link
Member

jerop commented Aug 8, 2022

TestPipelineRunTimeout test is still flaky

@jerop
Copy link
Member

jerop commented Aug 19, 2022

fixed in #5134 🤞🏾

thanks @abayer 🎉

@jerop jerop closed this as completed Aug 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants