Unexpected, irregular reprocessing of resources by tekton-pipelines-controller #3676

Fabian-K · 2021-01-12T14:32:31Z

Hi,

at our tekton deployment, I noticed that sometimes the work queue (tekton_workqueue_depth) of the tekton pipelines controller spikes. Based on the "size" of the spike, this looks like a full reprocessing of e.g. the PipelineRuns in the cluster. In the log, I can see the following messages when this happens. This happens irregularly.

2021-01-12 14:11:58 Trace[1752598205]: [18.704479331s] [18.704479331s] END
2021-01-12 14:11:58 Trace[1752598205]: ---"Objects listed" 18681ms (13:11:00.652)
2021-01-12 14:11:58 I0112 13:11:58.675267       1 trace.go:201] Trace[1752598205]: "Reflector ListAndWatch" name:runtime/asm_amd64.s:1374 (12-Jan-2021 13:11:00.970) (total time: 18704ms):

I suspect that the connection between the controller and the API server fails however this is just a guess. Is there any way to find out what causes the reprocessing?

Thanks,
Fabian

Additional Info

Kubernetes version: v1.18.12
Tekton Pipeline version: v0.19.0

The text was updated successfully, but these errors were encountered:

tekton-robot · 2021-04-12T14:46:43Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

imjasonh · 2021-04-12T14:57:12Z

Sorry for not responding earler.

This should be considered expected behavior. Reconcilers are configured to periodically scan all resources in case they missed a previous update. Normally this has no effect, except a (hopefully) brief spike in the workqueue depth, and not a noticeable additional latency for in-progress PipelineRuns.

Fabian-K · 2021-04-13T14:50:49Z

Thank you @imjasonh for picking this up!

Do you know if the periodic reprocessing is deterministic, e.g. every x hours (after the container started?)? Is this basically https://github.com/knative/pkg/blob/main/controller/controller.go#L52 or did I miss where this is overwritten by tekton?

Some background info: we currently have ~15k taskruns and 8k pipelineruns in the cluster. With these numbers, the reprocessing takes ~10 min. This is not ideal but ok. To lower the impact, we also apply sharding into 3 buckets. (really looking forward to the results project in this context ;) )

imjasonh · 2021-04-13T15:02:41Z

I believe the resync period is from the time the informer started (which is the time the container started) and not for example since the last time the object was reconciled.

That 10-hour resync period is the one Tekton uses, we don't overwrite it AFAIK.

Results is probably going to be your best bet long-term, if you want to try that out and give feedback that'd probably be helpful to that project. Until then sharding is a reasonable band-aid, I'm sorry I don't have better solutions than that at the moment.

Fabian-K · 2021-04-13T15:18:57Z

That's perfectly fine :). I´m just not sure if I´m seeing the (expected) periodic reprocessing or something in addition. My guess: lost connection to the API server. I´ll try to monitor this on my end and potentially re-open the issue.

gerrnot · 2024-05-24T07:11:17Z

We learned to live with this, but the 10h window is still quite annoying as it is impossible for us to have a timeframe (we use scheduled controller restarts to achieve that as workaround) during a business day where no developer is interrupted (some start their day early, some late).

Imo it would be good to either

extend the full reconciliation loop to something like 12h+
make it configurable
or smoothen it out (add only stuff to the queue if it has little work to do, in small junks for longer)

I see it as defect and the user experience is quite bad, nobody likes to see a stuck build.

I suggest this gets reopened and improved.

prgss · 2024-06-04T15:05:30Z

1500 Pipelines here and we are heavily impacted.
Devs and ops need to wait 20 minutes sometimes before the pipeline launch if they hit the sync loop issue.

+1 To extend the loop to 12h or 24h to have greater control of when the controller is performing the resync loop
+1 To make it configurable

This should allow advanced user/cluster-admin to configure the resyncPeriod to a value that fit their cluster instead of relying on the default 10h one. This is related to tektoncd#3676. Signed-off-by: Vincent Demeester <[email protected]>

This should allow advanced user/cluster-admin to configure the resyncPeriod to a value that fit their cluster instead of relying on the default 10h one. This is related to #3676. Signed-off-by: Vincent Demeester <[email protected]>

Fabian-K added the kind/bug Categorizes issue or PR as related to a bug. label Jan 12, 2021

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 12, 2021

imjasonh closed this as completed Apr 12, 2021

vdemeester mentioned this issue Jun 5, 2024

Add a flag on controllers to configure resyncPeriod #8023

Merged

8 tasks

vdemeester mentioned this issue Jun 19, 2024

TaskRun taking too long to complete #6316

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected, irregular reprocessing of resources by tekton-pipelines-controller #3676

Unexpected, irregular reprocessing of resources by tekton-pipelines-controller #3676

Fabian-K commented Jan 12, 2021

tekton-robot commented Apr 12, 2021

imjasonh commented Apr 12, 2021

Fabian-K commented Apr 13, 2021

imjasonh commented Apr 13, 2021

Fabian-K commented Apr 13, 2021

gerrnot commented May 24, 2024

prgss commented Jun 4, 2024

Unexpected, irregular reprocessing of resources by tekton-pipelines-controller #3676

Unexpected, irregular reprocessing of resources by tekton-pipelines-controller #3676

Comments

Fabian-K commented Jan 12, 2021

Additional Info

tekton-robot commented Apr 12, 2021

imjasonh commented Apr 12, 2021

Fabian-K commented Apr 13, 2021

imjasonh commented Apr 13, 2021

Fabian-K commented Apr 13, 2021

gerrnot commented May 24, 2024

prgss commented Jun 4, 2024