Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TaskRun retries: Improve separation of concerns between PipelineRun and TaskRun reconciler #5248

Closed
lbernick opened this issue Aug 1, 2022 · 11 comments
Assignees

Comments

@lbernick
Copy link
Member

lbernick commented Aug 1, 2022

Today, retries in Pipeline Tasks are implemented via the following process:

  • If a TaskRun fails and the Pipeline Task has retries specified, the PipelineRun controller will clear the status of the TaskRun, add its old status to taskRun.status.retriesStatus, and mark it as running. (happens here)
  • The TaskRun reconciler will use the length of its RetriesStatus to determine what to name the pod (happens here

This pattern is a bit awkward because it means both reconcilers are partially responsible for implementing retries, and only the reconciler for a given CRD should be the one updating that CRD's status.

One way around this would be to create a new TaskRun for each retry instead. This was discussed as a potential implementation strategy when retries were created (in issue 221, initial PR 510, and final PR 658), but it's not clear why the final decision was made.

If we'd like to go this route, here's what I think we should do:

  • update the PipelineRun reconciler to create a new TaskRun for each retry, without changing the old TaskRun's status
  • The new TaskRun can still have RetriesStatus based on the status of the old TaskRun.
  • Because the new TaskRun will need a different name, it will create a different pod name anyway, so the logic around naming the pod based on the length of the retries status can be deleted.
  • Now, RetriesStatus is solely informational. It serves no purpose to the TaskRun reconciler, so it can be removed from TaskRun.Status.
@XinruZhang
Copy link
Member

/assign XinruZhang

@lbernick
Copy link
Member Author

@XinruZhang I was just talking about this with @jerop: An equally valid way to support TaskRun retries and have separation of concerns is to add retries to taskrun.spec, and have the TaskRun reconciler implement retries. The advantage of this strategy is that we'd be consistent with how we handle retries for runs. This might need a bit more discussion before you start implementing--sorry!

@XinruZhang
Copy link
Member

Thanks @lbernick for bringing up this issue! Indeed the behavior here is a little bit weird 😅. I'm more than happy to discuss it more :)

I totally agree this is really about which controller should be responsible for taking care of the retries functionality.

I agree either options would make sense here because It decouples the two reconciler on this functionality. I'm a little bit leaning towards the latter one -- adding retries to taskrun.spec, because it kills two birds with one stone -- both TaskRun and Pipeline Task are retriable after this impelementation.

@lbernick
Copy link
Member Author

sg! @afrittoli @abayer @pritidesai just want to make sure you don't have any concerns about adding retries to taskRun.spec-- do you think this needs a TEP? (From talking to @jerop I think this is her preferred solution as well but ofc feel free to comment with any concerns)

@abayer
Copy link
Contributor

abayer commented Aug 11, 2022

I'm in favor of this approach. It does kinda feel like it deserves a TEP, but it's borderline to me.

@vdemeester
Copy link
Member

Agreeing with @abayer, I think it deserves a TEP 👼🏼

@XinruZhang
Copy link
Member

Thanks for everyone's input! I'll write a TEP for the new retries field XD.

@afrittoli
Copy link
Member

afrittoli commented Oct 3, 2022

@pritidesai
Copy link
Member

Thanks @lbernick!

This pattern is a bit awkward because it means both reconcilers are partially responsible for implementing retries

It feels awkward because there is no concept of pipelineTask reconciler. For each pipelineTask, pipelineRun controller creates a taskRun for a pipelineTask. In case of retries, pipelineRun controller creates a new taskRun which is embedded in the existing taskRun under taskRun.status.

only the reconciler for a given CRD should be the one updating that CRD's status.

+1

update the PipelineRun reconciler to create a new TaskRun for each retry, without changing the old TaskRun's status

+1

Because the new TaskRun will need a different name, it will create a different pod name anyway, so the logic around naming the pod based on the length of the retries status can be deleted.

I find it very useful to have pod name based on the length of the retries. The retry count signifies how many attempts were executed for a particular pipelineTask. Out of thousands of pods created, the pods belonging to a pipelineTask can be identified without querying labels of each pod.

Now, RetriesStatus is solely informational. It serves no purpose to the TaskRun reconciler, so it can be removed from TaskRun.Status.

How?

pipelineRun controller creating a single taskRun and updating the status of the same taskRun (addRetryHistory followed by clearStatus) helps knowing what is the status of that pipelineTask by querying a single taskRun. Its absolutely reasonable to move this logic to taskRun reconciler. But how? I am proposing an additional field and labels to keep this mapping available instead of constantly querying all the taskRuns in the cluster to identify which pipelineTask it belongs to and loosing the retry count.

@lbernick
Copy link
Member Author

Because the new TaskRun will need a different name, it will create a different pod name anyway, so the logic around naming the pod based on the length of the retries status can be deleted.

I find it very useful to have pod name based on the length of the retries. The retry count signifies how many attempts were executed for a particular pipelineTask. Out of thousands of pods created, the pods belonging to a pipelineTask can be identified without querying labels of each pod.

Pod naming for taskruns isn't part of our api-- we don't make any guarantees about pod naming not changing and I don't think we should. Are other projects getting the pod associated with a taskrun by making an api call for a pod named taskrunname-pod? or are they using taskrun.status.podname? Maybe you could give a bit more detail on how the pod name is being used?

Now, RetriesStatus is solely informational. It serves no purpose to the TaskRun reconciler, so it can be removed from TaskRun.Status.

How?

I was thinking the pipelinerun controller would keep track of the number of retries of a taskrun, and keep track of each taskrun created for an attempt. The taskrun controller wouldn't need to use retries status for anything. However, your comment is making me realize that what I had in mind probably doesn't work, because I'm not sure the pipelinerun controller can create a taskrun with a status already set.

pipelineRun controller creating a single taskRun and updating the status of the same taskRun (addRetryHistory followed by clearStatus) helps knowing what is the status of that pipelineTask by querying a single taskRun. Its absolutely reasonable to move this logic to taskRun reconciler. But how?

I don't think the taskrun reconciler should handle retries or know anything about retries, and I don't think we should move this logic to the taskrun reconciler.

I am proposing an additional field and labels to keep this mapping available instead of constantly querying all the taskRuns in the cluster to identify which pipelineTask it belongs to and loosing the retry count.

I would imagine with the idea laid out here, the multiple taskruns would be referenced in the pipelinerun status, so you wouldn't have to query all the taskruns

@XinruZhang
Copy link
Member

Fixed by #5844

Repository owner moved this from Todo to Done in Tekton Community Roadmap Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants