TaskRun retries: Improve separation of concerns between PipelineRun and TaskRun reconciler #5248

lbernick · 2022-08-01T17:41:31Z

Today, retries in Pipeline Tasks are implemented via the following process:

If a TaskRun fails and the Pipeline Task has retries specified, the PipelineRun controller will clear the status of the TaskRun, add its old status to taskRun.status.retriesStatus, and mark it as running. (happens here)
The TaskRun reconciler will use the length of its RetriesStatus to determine what to name the pod (happens here

This pattern is a bit awkward because it means both reconcilers are partially responsible for implementing retries, and only the reconciler for a given CRD should be the one updating that CRD's status.

One way around this would be to create a new TaskRun for each retry instead. This was discussed as a potential implementation strategy when retries were created (in issue 221, initial PR 510, and final PR 658), but it's not clear why the final decision was made.

If we'd like to go this route, here's what I think we should do:

update the PipelineRun reconciler to create a new TaskRun for each retry, without changing the old TaskRun's status
The new TaskRun can still have RetriesStatus based on the status of the old TaskRun.
Because the new TaskRun will need a different name, it will create a different pod name anyway, so the logic around naming the pod based on the length of the retries status can be deleted.
Now, RetriesStatus is solely informational. It serves no purpose to the TaskRun reconciler, so it can be removed from TaskRun.Status.

The text was updated successfully, but these errors were encountered:

XinruZhang · 2022-08-08T15:08:25Z

/assign XinruZhang

lbernick · 2022-08-10T19:07:42Z

@XinruZhang I was just talking about this with @jerop: An equally valid way to support TaskRun retries and have separation of concerns is to add retries to taskrun.spec, and have the TaskRun reconciler implement retries. The advantage of this strategy is that we'd be consistent with how we handle retries for runs. This might need a bit more discussion before you start implementing--sorry!

XinruZhang · 2022-08-10T20:04:43Z

Thanks @lbernick for bringing up this issue! Indeed the behavior here is a little bit weird 😅. I'm more than happy to discuss it more :)

I totally agree this is really about which controller should be responsible for taking care of the retries functionality.

I agree either options would make sense here because It decouples the two reconciler on this functionality. I'm a little bit leaning towards the latter one -- adding retries to taskrun.spec, because it kills two birds with one stone -- both TaskRun and Pipeline Task are retriable after this impelementation.

lbernick · 2022-08-11T15:44:11Z

sg! @afrittoli @abayer @pritidesai just want to make sure you don't have any concerns about adding retries to taskRun.spec-- do you think this needs a TEP? (From talking to @jerop I think this is her preferred solution as well but ofc feel free to comment with any concerns)

abayer · 2022-08-11T19:39:17Z

I'm in favor of this approach. It does kinda feel like it deserves a TEP, but it's borderline to me.

vdemeester · 2022-08-18T14:52:43Z

Agreeing with @abayer, I think it deserves a TEP 👼🏼

XinruZhang · 2022-08-18T14:56:08Z

Thanks for everyone's input! I'll write a TEP for the new retries field XD.

afrittoli · 2022-10-03T12:48:15Z

Related: TEP-0121: Unified Retries for Runtime Objects

pritidesai · 2022-11-02T21:54:01Z

Thanks @lbernick!

This pattern is a bit awkward because it means both reconcilers are partially responsible for implementing retries

It feels awkward because there is no concept of pipelineTask reconciler. For each pipelineTask, pipelineRun controller creates a taskRun for a pipelineTask. In case of retries, pipelineRun controller creates a new taskRun which is embedded in the existing taskRun under taskRun.status.

only the reconciler for a given CRD should be the one updating that CRD's status.

+1

update the PipelineRun reconciler to create a new TaskRun for each retry, without changing the old TaskRun's status

+1

Because the new TaskRun will need a different name, it will create a different pod name anyway, so the logic around naming the pod based on the length of the retries status can be deleted.

I find it very useful to have pod name based on the length of the retries. The retry count signifies how many attempts were executed for a particular pipelineTask. Out of thousands of pods created, the pods belonging to a pipelineTask can be identified without querying labels of each pod.

Now, RetriesStatus is solely informational. It serves no purpose to the TaskRun reconciler, so it can be removed from TaskRun.Status.

How?

pipelineRun controller creating a single taskRun and updating the status of the same taskRun (addRetryHistory followed by clearStatus) helps knowing what is the status of that pipelineTask by querying a single taskRun. Its absolutely reasonable to move this logic to taskRun reconciler. But how? I am proposing an additional field and labels to keep this mapping available instead of constantly querying all the taskRuns in the cluster to identify which pipelineTask it belongs to and loosing the retry count.

lbernick · 2022-11-10T15:51:16Z

Because the new TaskRun will need a different name, it will create a different pod name anyway, so the logic around naming the pod based on the length of the retries status can be deleted.

I find it very useful to have pod name based on the length of the retries. The retry count signifies how many attempts were executed for a particular pipelineTask. Out of thousands of pods created, the pods belonging to a pipelineTask can be identified without querying labels of each pod.

Pod naming for taskruns isn't part of our api-- we don't make any guarantees about pod naming not changing and I don't think we should. Are other projects getting the pod associated with a taskrun by making an api call for a pod named taskrunname-pod? or are they using taskrun.status.podname? Maybe you could give a bit more detail on how the pod name is being used?

Now, RetriesStatus is solely informational. It serves no purpose to the TaskRun reconciler, so it can be removed from TaskRun.Status.

How?

I was thinking the pipelinerun controller would keep track of the number of retries of a taskrun, and keep track of each taskrun created for an attempt. The taskrun controller wouldn't need to use retries status for anything. However, your comment is making me realize that what I had in mind probably doesn't work, because I'm not sure the pipelinerun controller can create a taskrun with a status already set.

pipelineRun controller creating a single taskRun and updating the status of the same taskRun (addRetryHistory followed by clearStatus) helps knowing what is the status of that pipelineTask by querying a single taskRun. Its absolutely reasonable to move this logic to taskRun reconciler. But how?

I don't think the taskrun reconciler should handle retries or know anything about retries, and I don't think we should move this logic to the taskrun reconciler.

I am proposing an additional field and labels to keep this mapping available instead of constantly querying all the taskRuns in the cluster to identify which pipelineTask it belongs to and loosing the retry count.

I would imagine with the idea laid out here, the multiple taskruns would be referenced in the pipelinerun status, so you wouldn't have to query all the taskruns

XinruZhang · 2022-12-21T21:17:05Z

Fixed by #5844

lbernick mentioned this issue Aug 1, 2022

Retries for Custom Tasks #5218

Closed

tekton-robot assigned XinruZhang Aug 8, 2022

lbernick mentioned this issue Aug 8, 2022

Fix Existing Requests and Limits with LimitRange #5269

Closed

7 tasks

XinruZhang mentioned this issue Aug 31, 2022

TEP-0114: Custom Tasks Beta - Testing #5156

Closed

xchapter7x added this to Tekton Community Roadmap Sep 20, 2022

xchapter7x moved this to Todo in Tekton Community Roadmap Sep 20, 2022

afrittoli mentioned this issue Oct 3, 2022

TEP-0121: Refine Retries for TaskRuns and CustomRuns tektoncd/community#816

Merged

lbernick mentioned this issue Nov 29, 2022

TEP-0121: Implement Retries in TaskRun #5807

Closed

7 tasks

lbernick closed this as completed Dec 21, 2022

Repository owner moved this from Todo to Done in Tekton Community Roadmap Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TaskRun retries: Improve separation of concerns between PipelineRun and TaskRun reconciler #5248

TaskRun retries: Improve separation of concerns between PipelineRun and TaskRun reconciler #5248

lbernick commented Aug 1, 2022

XinruZhang commented Aug 8, 2022

lbernick commented Aug 10, 2022

XinruZhang commented Aug 10, 2022

lbernick commented Aug 11, 2022

abayer commented Aug 11, 2022

vdemeester commented Aug 18, 2022

XinruZhang commented Aug 18, 2022

afrittoli commented Oct 3, 2022 •

edited

Loading

pritidesai commented Nov 2, 2022

lbernick commented Nov 10, 2022

XinruZhang commented Dec 21, 2022

TaskRun retries: Improve separation of concerns between PipelineRun and TaskRun reconciler #5248

TaskRun retries: Improve separation of concerns between PipelineRun and TaskRun reconciler #5248

Comments

lbernick commented Aug 1, 2022

XinruZhang commented Aug 8, 2022

lbernick commented Aug 10, 2022

XinruZhang commented Aug 10, 2022

lbernick commented Aug 11, 2022

abayer commented Aug 11, 2022

vdemeester commented Aug 18, 2022

XinruZhang commented Aug 18, 2022

afrittoli commented Oct 3, 2022 • edited Loading

pritidesai commented Nov 2, 2022

lbernick commented Nov 10, 2022

XinruZhang commented Dec 21, 2022

afrittoli commented Oct 3, 2022 •

edited

Loading