Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0004: Tekton Integrations - Annotations and Statuses #147

Closed
wants to merge 7 commits into from

Conversation

wlynch
Copy link
Member

@wlynch wlynch commented Jul 13, 2020

TEP proposal for configuring post-pipeline integrations for Tekton
Pipelines via annotations and statuses.

/cc @afrittoli @dlorenc @imjasonh @bobcatfish @dibyom @bigkevmcd

@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign imjasonh
You can assign the PR to them by writing /assign @imjasonh in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jul 13, 2020
TEP proposal for configuring post-pipeline integrations for Tekton
Pipelines via annotations and statuses.
@wlynch wlynch changed the title TEP-0004: Tekton Integrations TEP-0004: Tekton Integrations - Annotations and Statuses Jul 14, 2020
@bobcatfish
Copy link
Contributor

Update from discussion offline: have been discussing similarities b/w this and tektoncd/pipeline#1740, might make sense to have this proposal directly tackle notification related requirements

@imjasonh
Copy link
Member

Do we need to do something to ensure that updating the Run with new integration update status doesn't trigger new reconciliations for all watchers?

For example, a Run starts, and the GitHub status reconciler is notified, and it POSTs to GitHub and updates .status.integrations[name=github] to indicate that GitHub was notified of the status change. This results in a new version of the object (new observedGeneration) which means the GitHub status reconciler gets another update, etc., ...

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that is "half" clear for me is what this TEP propose overall:

  • standard reserved annotation *.integration.tekton.dev
  • a new integration status field (optional)
  • more than this ? (who is responsible for what, …)

As commented (below), I don't see too much value in the integration status part of the *Run status. Or at least I don't see its value — other than "from a *Run what integration did run ; but I don't see too much use case for this (enlighten me please 👼), I don't see why we would "couple" integration with the *Runs.

More generic comments:

  • We could use we instead of I 😛
  • I think we can, most of the time, use the present tense (instead of the future), as, if the TEP gets accepted and implementable, it will be "the present" (aka what it's doing).

Comment on lines 230 to 246
**Examples:**

- GitHub
```
github.integrations.tekton.dev/owner: wlynch
github.integrations.tekton.dev/repo: test
github.integrations.tekton.dev/sha: 4d2f673c600a0f08c3f6737bc3f5c6a6a15d62e6
```
- Webhook
```
webhook.integrations.tekton.dev/url: https://example.com:8080
```
- Slack
```
slack.integrations.tekton.dev/url: https://tektoncd.slack.com
slack.integrations.tekton.dev/channel: build-cop
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is responsible for adding those ? Are those annotation provided by the user (or something like triggers, or both) or by the integration, or by both ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, my expectation is the integration would provide the annotation it supports in its documentation, then users would set these fields in their Runs (filling them in with params / TriggerTemplates).

Eventually, I'd like to get to a point where Triggers can do this automatically for users. Initially via ObjectMeta templating, but eventually via interceptors (or possibly some other mechanism) that can also be provided by an integration. e.g. If we know that a Trigger is for GitHub events, and we can filter for these types of events, we should be able to provide the relevant annotations to configure related integration functionality automatically. If users want to configure one-off builds for testing, they can still set the annotations manually.

This also designed in a way that other non-Tekton Trigger mechanisms could also leverage this if they wanted, or we can even mix and match mechanisms to trigger builds and related integrations (e.g. want to build a slack bot that retests pull requests? just set these annotations and you're good to go).

I think the exact details of how Triggers implements these features is a bit hand-wavy at the moment since a lot of it is work actively being discussed / designed, so I didn't want to go too deep in this proposal. The main goal here is to carve out the areas within the Runs API to define this data.

Comment on lines 265 to 269
- What integration the status is for (e.g. `github.integrations.tekton.dev`).
- What object generation the status is for.
- What was the outcome of the integration for this object (e.g. success,
failure).
- (optional) More details of status outcome.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to keep in mind is that TaskRun and PipelineRun are more or less "immutable", so I am not sure how fully useful the object generation will be as… except from the cancel field in spec, the rest of the spec won't change.

Copy link
Member Author

@wlynch wlynch Jul 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For TaskRun and PipelineRun spec, I 100% agree. status is where I see room for flexibility (since these are mutated during Pipeline execution anyway).

Re object/observed generation: I'm not particularly tied to this. I'm just borrowing this from our existing TaskRunStatus type which uses the Knative Status duck type.

1. Push a commit, notice pull request wasn't updated for a run.
2. Look for all Runs with Integration annotation for the repo / commit (if
empty, then you know there was no build kicked off).
3. Find latest Run, `kubectl get` it, inspect integrations status field(s) for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works when you got access to the cluster (which, might "usually" not be the case)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but I think it's a reasonable assumption to say that users should be able to get at the Run object (i.e. if not through the k8s API, by Tekton Results), and that this will generally be a common pattern for debugging.

Updated to mention results / alternative mechanisms.

pipeline run itself. This gives us an easy mechanism for users to do the
following:

1. Push a commit, notice pull request wasn't updated for a run.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independently of this (but maybe related), an integration should ensure to do the proper thing whenever the *Run is created. e.g. for GitHub, as soon as the *Run is created, the checks that this *Run provides should be reseted (to pending). But then, the question is how do you know which checks this *Run provides ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an integration should ensure to do the proper thing whenever the *Run is created

Agreed. There should definitely be some common behavior here to reset the state. As a first pass, I think clearing out the integration data should be sufficient, and should allow integrations to pick up a Run as if it was new.

But then, the question is how do you know which checks this *Run provides ?

I don't think we can or should enforce that the Run must declare the integrations it provides (though it declare some integrations in line if it wants to). This is to support integrations that are enforced automatically for users (e.g. CloudEvents, Results).

teps/0004-tekton-integrations-annotations-statuses.md Outdated Show resolved Hide resolved
retryCount: 1
```

IntegrationStatus:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first tmie we introduce this type / field — the detail about it is lower, so it feels a bit weird (because at that point of reading, I don't know what is this type, where it is and what's is "goal".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree this is weird. Rearranged this a bit to not reference the type exactly until later. Let me know if this makes more sense or if I should make more changes.

Comment on lines 470 to 474
Similar to `TaskRunResults` and `PipelineResourceResults`, integration statuses
are another form of external artifact generated as a result of the Pipeline.
Because of this, it makes sense to store these statuses as a part of the
Pipeline/TaskRun status, so that it is easy to find and see this data when
something goes wrong.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not entirely sure about this. TaskRunResults and PipelineResourceResults are, by definition, results that are coming from the execution of the Task/Pipeline (aka what the task is supposed to do, like clone a git repository, …) through TaskRun/PipelineRun) This would not be the case with integration, as they do not necessary result from the execution of it (aka what the task/pipeline does), but only if it executed or not, …

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt that this section was redundant with User Stories, so I took it out for the time being.

The reason why I feel this belongs in the Run status is because these integration actions are directly caused by the Run itself. Storing related data with it helps us couple the data to the object lifecycle, and makes it easy for users to get this information regardless of the tool they use.

I found out that this actually models an existing pattern of subresources (https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#types-kinds), so I added another section about this under design details.


## Design Details

### Integration Status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a little bit of trouble around this new field in status and how useful they are from the PipelineRun/TaskRun perspective. I would "tend" to think, that I would need to look at the "integration" service/dashboard/… to see what failed and dig from there.

In my head, I am comparing this to the "food supply tracability chain" 🙃. When you buy your cereals, you can trace back to where/when/how each element of your cereals got in (aka where was it planted, where was it transformed, by whom it was shipped, …). From the top, you can get to the bottom — but from the bottom you don't really have a way (nor the need) to get to the top. I see PipelineRun and TaskRun as "the bottom" in there (from the integration perspective).

Let's take GitHub status checks as examples, and let's assume a Github integration (à-la commit-status-tracker). I submit my PR, the checks appear as pending. If there is any failure (before the pod execution, during, …), it's the integration responsability to give the user the information back, pointing to what went wrong — and in there, which PipelineRun it is, it's state, how to dig into it, … The PipelineRun itself doesn't need to know what integration is watching it, and how it handles thing.

In addition, writing the status from the integration "part" means possible race, possible reconciling loop for all the watchers, etc…

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I actually agree that users will generally start from the integration dashboard and working their way back to the Run.

That said, I think providing information the other way is equally as important for investigating when things do not work as expected. For example, say the access token we are using for GitHub exceeded quota (or the token was revoked, GitHub is down, etc)- this could manifest as statuses never showing up in GitHub, so we lose that link from the integration back to the Run. We currently don't have a mechanism to let users know that something is wrong.
Today our answer is "go look at the controller logs for errors", but this isn't really a good solution if the user doesn't have access to the pipeline controller (and they likely won't if this is a multitenant environment). Having a place to say: "we tried to update the status for , but we got this error" within a Run helps integrations give users transparency about what is happening as a result of their Task/PipelineRun, even though these things aren't necessarily executing as part of their Run, regardless of the state of the external integration.

I concede that in cases where everything worked this is less meaningful, but I think still provides benefit for consistency to help users verify that something actually happened.

To use your cereal analogy - this is like distributors keeping a log of outgoing shipments to grocery stores, so that if stores claim they never received a shipment, they can prove when the shipment left the warehouse, what was inside, tracking numbers, etc.

Comment on lines 618 to 628
### Automatically injecting Tasks

This alternative keeps the idea of running user tasks inside of the Pipeline,
but we make it easier for users by automatically appending integration tasks to
the end of their pipeline / adding them as finally tasks.

This has a few drawbacks:

- We can't call tasks if the pipeline never runs.
- The user pipeline needs to have access to all secrets, which limits what we
can do with shared multitenant secrets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This describes what PipelineResource are doing 👼

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but with the caveats:

  • PipelineResources can't do anything if the Pipeline fails to start (e.g. invalid TaskRefs)
  • They can't do anything in response to "accepted, but not running yet" (e.g. GitHub CheckRuns make a distinction between Queued and In Progress pending statuses)
  • Requires users to insert all integration lifecycle steps (the eventual goal is for Triggers to make this easier for users).

Comment on lines +632 to +636
As an alternative to storing integration status information within a
Task/PipelineRun status, we could store this information in another CRD type
that references back to the run. The advantage of this is that it avoids making
an API change to the Task/PipelineRun object directly and the consequences that
may arise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need another type for this. It would be on the integration itself to decide how it wants to store/display/… the status.

@bobcatfish
Copy link
Contributor

What do you think if we were to address the notifications feature before we implement this TEP and then revisit it?

This TEP is aiming to solve these problems:

  • If the pipeline fails to start because of a configuration error, the external
    integration is never notified.
  • If the pipeline fails a step and never reaches the notification step, the
    external integration is never notified.
  • If a transient error (e.g. GitHub is down) prevents the task from completing
    successfully, this may result in the pipeline failing.

My understanding of the notifications feature is that the notifications feature is all about creating integrations with external systems, but not just integrations: integrations with a level of importance greater than other integrations, such as Tasks within a Pipeline. Since notifications are about "notifying", i.e. telling the outside world about something that is happening, the problems you listed above feel to me like legit requirements for the notifications feature, e.g.:

  • "If the pipeline fails to start because of a configuration error, the external
    integration is never notified" = if a pipeline fails to start, notifications should still occur
  • "If the pipeline fails a step and never reaches the notification step, the
    external integration is never notified." = we don't really have a "notification step" yet since we have no notifications features, but again I think this goes back to the requirement above, notifications should occur even when other things fail. We might even want to go further and say that failure of a notification should not be tolerated, e.g. it should be retried, there should be an external alerting system triggered in this case
  • If a transient error (e.g. GitHub is down) prevents the task from completing successfully, this may result in the pipeline failing. = failures of notifications should not change the status of the pipeline / task execution they are notifying about

@wlynch
Copy link
Member Author

wlynch commented Jul 16, 2020

@n3wscott You might be interested in this proposal, given your involvement in kubernetes/kubernetes#72637 (comment) and https://docs.google.com/document/d/1b1hRpJpSUE3xIG0v2IHFD0NBaxvRBTBnw3An7oZBV9I 😄

@wlynch
Copy link
Member Author

wlynch commented Jul 16, 2020

@bobcatfish

What do you think if we were to address the notifications feature before we implement this TEP and then revisit it?

Depends on how notifications intend on handling status / receipt of delivery data. To me notifications is just a type of integration, so there's an argument to be made that this should happen soon rather than later so it's consistent with other integrations we add later on.

I need to learn more about the current state/plans of the notifications effort to have a more informed opinion about this.
(/cc @afrittoli It'd be great to chat more about this!)

This TEP is aiming to solve these problems:

  • If the pipeline fails to start because of a configuration error, the external
    integration is never notified.
  • If the pipeline fails a step and never reaches the notification step, the
    external integration is never notified.
  • If a transient error (e.g. GitHub is down) prevents the task from completing
    successfully, this may result in the pipeline failing.

I would say this TEP is aiming to define a space within the Run API to enable other tools to address these problems, but not necessarily solve them directly (but I 100% agree that we should solve them). Even if we're only focused on in-pipeline notifications/actions for the time being, I think there's a benefit for having a space within the Run status / subresource to allow writing of status information and its effect on external services.

I'll look into adding a subsection to detail a more concrete example that can leverage this design (likely PR status updates and/or uploading of Tekton Results), but I don't want to get too bogged down by the exact details of the example.

@bobcatfish
Copy link
Contributor

To me notifications is just a type of integration, so there's an argument to be made that this should happen soon rather than later so it's consistent with other integrations we add later on.

Without a specific "integration" to talk about, I find it very hard to endorse this proposal.

If you start looking at other parts of Tekton Pipelines there are a lot of pieces you could call "integrations", for example:

  • Triggers are "integrations" with outside sources of triggering
  • Emitting cloud events enables "integration" with whatever is consuming the cloud events
  • PipelineResources (which sometimes we talk about renaming as "plugins") "integrate" with the outside world by 'representing it on disk'
  • Custom Tasks allow "integration" with other non-tekton systems, e.g. writing a custom controller that interacts with GCB
  • Tasks themselves allow "integration" with a many of the common integrations listed above, e.g. code review tools, chat tools, bug trackers

So this proposal being about "integrations" in general isn't clear to me.

I don't think this is a direction we should go in until we have a clear need for it, which I don't see. Any tools outside of Tekton can store information in whatever way they see fit; I'm not convinced they need to be able to mutate Tekton types to do that.

One situation which I think would convince me is if we built a tool inside of Tekton that needs it, for example notifications, and I think you've highlighted some very important requirements for that feature, but I don't think it makes sense to address those requirements on their own.

@bobcatfish
Copy link
Contributor

Any tools outside of Tekton can store information in whatever way they see fit; I'm not convinced they need to be able to mutate Tekton types to do that.

Side note that the annotations part of the proposal is something that a tool can do today, without requiring us to explicitly accept a TEP about it.

update Tasks into their Pipelines. In practice, this ends up not being a good
mechanism for ensuring that external integration notifications always occur, for
example:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+100 to all of these problems!

Annotations / how to run integrations is a separate problem, and can
come later. This helps address shorter term ideas with existing
Notifications / CloudEvent work, while still enabling additional
integrations to build on top of it independently.
@wlynch
Copy link
Member Author

wlynch commented Jul 20, 2020

Hi everyone! After reading into more of the current CloudEvents implementation, I decided to table the annotations / reconciler bits included in this proposal and instead focus on just integration statuses and the discoverability of this information in Runs.

The CloudEvents implementation actually already follows much of the reconciler pattern previously described for some of the concerns raised above (e.g. to run on Pipeline failure) since it's currently baked into the Pipelines controller, with a separate discussion to pull it out into its own controller. I think the piece still missing here is exposing and making this information discoverable to Runs in a way that can also be used for other integrations.
I've also tried to add more context into this proposal where this fits in to existing work with Notifications/Actions/CloudEvents.

I'll likely raise the GitHub annotations/reconciler bits again in the future as a different TEP, reframed towards improving UX for GitHub CI statuses. 😃

@wlynch
Copy link
Member Author

wlynch commented Jul 21, 2020

After some more discussion with @bobcatfish, I'm going to close this proposal. It's still early days for notifications / integrations, and it's not necessarily clear that modifying the Run status / using sub-resources is necessary / the best way to go about representing this.

I'll continue to think about how we surface this kind of information (e.g. whether a webhook/status update happened successfully), and see where we can fit this in with other notification efforts. Stay tuned!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants