Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TaskRun taking too long to complete #6316

Closed
RafaeLeal opened this issue Mar 8, 2023 · 5 comments
Closed

TaskRun taking too long to complete #6316

RafaeLeal opened this issue Mar 8, 2023 · 5 comments
Labels
area/performance Issues or PRs that are related to performance aspects. kind/bug Categorizes issue or PR as related to a bug.

Comments

@RafaeLeal
Copy link
Contributor

Expected Behavior

TaskRuns to finish shortly after the last step finishes executing

Actual Behavior

There're times that TaskRuns can take several minutes to update.
In the most extreme scenario, I've found this TaskRun.
Let me summarize it here:

{ ...
 "status":{
      "conditions":[
         {
            "type":"Succeeded",
            "status":"True",
            "lastTransitionTime":"2023-03-07T18:24:53Z",
            "reason":"Succeeded",
            "message":"All Steps have completed executing"
         }
      ],
      "startTime":"2023-03-07T16:35:23Z",
      "completionTime":"2023-03-07T18:24:53Z",
      "steps":[
          {
            "terminated":{
               "exitCode":0,
               "reason":"Completed",
               "message": "..."
               "startedAt":"2023-03-07T16:35:49Z",
               "finishedAt":"2023-03-07T16:35:51Z",
               "containerID":"..."
            },
            "name":"checkout",
            "container":"step-checkout",
            "imageID": "..."
         },
         ...
         {
            "terminated":{
               "exitCode":0,
               "reason":"Completed",
               "message":"...",
               "startedAt":"2023-03-07T16:35:58Z",
               "finishedAt":"2023-03-07T16:35:58Z",
               "containerID":"docker://51cc488b8ec3bed2edd7d61e399ea22653634f797bf8dc4844d9fece0dd1df54"
            },
            "name":"notify-user",
            "container":"step-notify-user",
            "imageID":"..."
         }
      ]
}

Note that:

  • The TaskRun succeeded at 2023-03-07T18:24:53Z
  • From the status.startTime to status.completionTime it's 1h49m30s duration.
    (2023-03-07T16:35:23Z to 2023-03-07T18:24:53Z)
  • From the first step status.steps[0].terminated.startedAt to the last step
    status.step[11].terminated.finishedAt it's 9s duration (2023-03-07T16:35:49Z to 2023-03-07T16:35:58Z)
  • We have a very large Tekton cluster, running up to 600k TaskRuns every week.
    It might be related to load, but I couldn't confirm this with metrics.
  • I've investigated the workqueue from the Tekton Controller, its depth is 1 (or zero)
    throughout this execution.
  • Other metrics from the tekton-pipelines-controller also seems pretty stable,
    except by the reconcile error count. I've investigated, and the problem is that
    since we return this controller.NewRequeueAfter during the execution, we count that as errors, but it shouldn't be a problem.

Steps to Reproduce the Problem

I'm still not sure how to reproduce this problem, but it does happen every day
in our infrastructure, so I can try some fixes.

Additional Info

  • Kubernetes version: 1.22.16-eks-ffeb93d
clientVersion:
  buildDate: "2022-10-12T10:47:25Z"
  compiler: gc
  gitCommit: 434bfd82814af038ad94d62ebe59b133fcb50506
  gitTreeState: clean
  gitVersion: v1.25.3
  goVersion: go1.19.2
  major: "1"
  minor: "25"
  platform: darwin/amd64
kustomizeVersion: v4.5.7
serverVersion:
  buildDate: "2022-11-29T18:41:42Z"
  compiler: gc
  gitCommit: 52e500d139bdef42fbc4540c357f0565c7867a81
  gitTreeState: clean
  gitVersion: v1.22.16-eks-ffeb93d
  goVersion: go1.16.15
  major: "1"
  minor: 22+
  platform: linux/amd64
  • Tekton Pipeline version:
v0.35.1
@RafaeLeal RafaeLeal added the kind/bug Categorizes issue or PR as related to a bug. label Mar 8, 2023
@lbernick lbernick added the area/performance Issues or PRs that are related to performance aspects. label Mar 8, 2023
@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 6, 2023
@vdemeester
Copy link
Member

/remove-lifecycle stale

@tekton-robot tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 20, 2023
@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 18, 2023
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 18, 2023
@vdemeester vdemeester removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 19, 2024
@vdemeester
Copy link
Member

The time to process a TaskRun (or other object) is dependent on a lot of things (such as the load of the api server, etcd, …) and higher number of tekton objects are a known factor in there. Every 10hs there is a global resync that goes through all objects, see #3676 or #8023 on this.

Given the version of this report, and the activity, I am going to close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Issues or PRs that are related to performance aspects. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants