Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected, irregular reprocessing of resources by tekton-pipelines-controller #3676

Closed
Fabian-K opened this issue Jan 12, 2021 · 7 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@Fabian-K
Copy link
Contributor

Hi,

at our tekton deployment, I noticed that sometimes the work queue (tekton_workqueue_depth) of the tekton pipelines controller spikes. Based on the "size" of the spike, this looks like a full reprocessing of e.g. the PipelineRuns in the cluster. In the log, I can see the following messages when this happens. This happens irregularly.

2021-01-12 14:11:58 Trace[1752598205]: [18.704479331s] [18.704479331s] END
2021-01-12 14:11:58 Trace[1752598205]: ---"Objects listed" 18681ms (13:11:00.652)
2021-01-12 14:11:58 I0112 13:11:58.675267       1 trace.go:201] Trace[1752598205]: "Reflector ListAndWatch" name:runtime/asm_amd64.s:1374 (12-Jan-2021 13:11:00.970) (total time: 18704ms):

I suspect that the connection between the controller and the API server fails however this is just a guess. Is there any way to find out what causes the reprocessing?

Thanks,
Fabian

Additional Info

  • Kubernetes version: v1.18.12
  • Tekton Pipeline version: v0.19.0
@Fabian-K Fabian-K added the kind/bug Categorizes issue or PR as related to a bug. label Jan 12, 2021
@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 12, 2021
@imjasonh
Copy link
Member

Sorry for not responding earler.

This should be considered expected behavior. Reconcilers are configured to periodically scan all resources in case they missed a previous update. Normally this has no effect, except a (hopefully) brief spike in the workqueue depth, and not a noticeable additional latency for in-progress PipelineRuns.

@Fabian-K
Copy link
Contributor Author

Thank you @imjasonh for picking this up!

Do you know if the periodic reprocessing is deterministic, e.g. every x hours (after the container started?)? Is this basically https://github.com/knative/pkg/blob/main/controller/controller.go#L52 or did I miss where this is overwritten by tekton?

Some background info: we currently have ~15k taskruns and 8k pipelineruns in the cluster. With these numbers, the reprocessing takes ~10 min. This is not ideal but ok. To lower the impact, we also apply sharding into 3 buckets. (really looking forward to the results project in this context ;) )

@imjasonh
Copy link
Member

I believe the resync period is from the time the informer started (which is the time the container started) and not for example since the last time the object was reconciled.

That 10-hour resync period is the one Tekton uses, we don't overwrite it AFAIK.

Results is probably going to be your best bet long-term, if you want to try that out and give feedback that'd probably be helpful to that project. Until then sharding is a reasonable band-aid, I'm sorry I don't have better solutions than that at the moment.

@Fabian-K
Copy link
Contributor Author

That's perfectly fine :). I´m just not sure if I´m seeing the (expected) periodic reprocessing or something in addition. My guess: lost connection to the API server. I´ll try to monitor this on my end and potentially re-open the issue.

@gerrnot
Copy link

gerrnot commented May 24, 2024

We learned to live with this, but the 10h window is still quite annoying as it is impossible for us to have a timeframe (we use scheduled controller restarts to achieve that as workaround) during a business day where no developer is interrupted (some start their day early, some late).

Imo it would be good to either

  • extend the full reconciliation loop to something like 12h+
  • make it configurable
  • or smoothen it out (add only stuff to the queue if it has little work to do, in small junks for longer)

I see it as defect and the user experience is quite bad, nobody likes to see a stuck build.

I suggest this gets reopened and improved.

@prgss
Copy link

prgss commented Jun 4, 2024

1500 Pipelines here and we are heavily impacted.
Devs and ops need to wait 20 minutes sometimes before the pipeline launch if they hit the sync loop issue.

  • +1 To extend the loop to 12h or 24h to have greater control of when the controller is performing the resync loop
  • +1 To make it configurable

vdemeester added a commit to vdemeester/tektoncd-pipeline that referenced this issue Jun 5, 2024
This should allow advanced user/cluster-admin to configure the
resyncPeriod to a value that fit their cluster instead of relying on
the default 10h one.

This is related to tektoncd#3676.

Signed-off-by: Vincent Demeester <[email protected]>
vdemeester added a commit to vdemeester/tektoncd-pipeline that referenced this issue Jun 5, 2024
This should allow advanced user/cluster-admin to configure the
resyncPeriod to a value that fit their cluster instead of relying on
the default 10h one.

This is related to tektoncd#3676.

Signed-off-by: Vincent Demeester <[email protected]>
vdemeester added a commit to vdemeester/tektoncd-pipeline that referenced this issue Jun 5, 2024
This should allow advanced user/cluster-admin to configure the
resyncPeriod to a value that fit their cluster instead of relying on
the default 10h one.

This is related to tektoncd#3676.

Signed-off-by: Vincent Demeester <[email protected]>
vdemeester added a commit to vdemeester/tektoncd-pipeline that referenced this issue Jun 5, 2024
This should allow advanced user/cluster-admin to configure the
resyncPeriod to a value that fit their cluster instead of relying on
the default 10h one.

This is related to tektoncd#3676.

Signed-off-by: Vincent Demeester <[email protected]>
vdemeester added a commit to vdemeester/tektoncd-pipeline that referenced this issue Jun 19, 2024
This should allow advanced user/cluster-admin to configure the
resyncPeriod to a value that fit their cluster instead of relying on
the default 10h one.

This is related to tektoncd#3676.

Signed-off-by: Vincent Demeester <[email protected]>
tekton-robot pushed a commit that referenced this issue Jun 24, 2024
This should allow advanced user/cluster-admin to configure the
resyncPeriod to a value that fit their cluster instead of relying on
the default 10h one.

This is related to #3676.

Signed-off-by: Vincent Demeester <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

5 participants