Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start measuring Tekton Pipelines performance #540

Open
bobcatfish opened this issue Feb 21, 2019 · 10 comments
Open

Start measuring Tekton Pipelines performance #540

bobcatfish opened this issue Feb 21, 2019 · 10 comments
Labels
area/roadmap Issues that are part of the project (or organization) roadmap (usually an epic) design This task is about creating and discussing a design kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. meaty-juicy-coding-work This task is mostly about implementation!!! And docs and tests of course but that's a given okr This is for some internal Google project tracking priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@bobcatfish
Copy link
Collaborator

bobcatfish commented Feb 21, 2019

Expected Behavior

We should be measuring performance for Pipelines. This task includes both adding the actual measurement mechanism and also the design re. what exactly we want to measurement.

Some ideas for measurement:

  • Null Task / null Pipeline (i.e. it doesnt actually do anything)
  • Null Tasks that have linked inputs outputs
  • Stress testing (make recommendations about cluster size)
  • ?

Requirements

  • We should have a set of "happy SLOs" defined for Task and Pipeline execution
  • We should be regularly measuring these SLOs
  • Maintainers should be made aware when we are in violation of these SLOs

Actual Behavior

We do not measure or track this.

Additional Info

@bobcatfish bobcatfish added design This task is about creating and discussing a design meaty-juicy-coding-work This task is mostly about implementation!!! And docs and tests of course but that's a given labels Feb 21, 2019
@bobcatfish bobcatfish added the okr This is for some internal Google project tracking label Feb 21, 2019
@pradeepitm12
Copy link

Hello @bobcatfish
Need your thoughts on this.
1- A service outside of tekton that watches tekton object and expose it to prometheus.
2- Introduce an endpoint in the tekton pipeline itself to expose all the metric to Prometheus.

@bobcatfish
Copy link
Collaborator Author

My gut feeling is that I'd lean more toward exposing the metrics from Pipelines itself:

2- Introduce an endpoint in the tekton pipeline itself to expose all the metric to Prometheus.

Question: I'm not super familair with Prometheus, how vital would it be to making metrics usable? Could we simply emit the metrics, and allow the user to provide their own metrics gathering mechanism (which could be prometheus but could be something else), or would it make more sense for us to include Prometheus out of the box? (I've very sensitive to adding new dependencies, esp. since I'm under the impression that managing Prometheus is a job in itself, but maybe I'm wrong!)

Another option, which I think is a variation on your first suggestion @pradeepitm12 :
3 - (For now) only measure the performance in tests we write specifically for this purpose (i.e. we don't expose anything new for users of Tekton Pipelines, but we start doing our own measurements)

@bobcatfish bobcatfish removed this from the Pipelines 0.2 🎉 🎉 🎉 milestone Apr 25, 2019
@rawlingsj
Copy link
Contributor

rawlingsj commented May 17, 2019

+1 we're looking at the same thing and just started looking at prometheus too, hopefully we can help each other out here.

📈

@bobcatfish
Copy link
Collaborator Author

+1 we're looking at the same thing and just started looking at prometheus too, hopefully we can help each other out here.

Maybe the first thing to do would be to identify the metrics we're interested in? I'm not super familiar with prometheus but I would think before we want to monitor the metrics, we'd want to figure out what needs monitoring (maybe there's a Jenkins/Jenkins X precedent we can draw on :D?)

@ghost
Copy link

ghost commented Sep 19, 2019

We had our first meeting regarding observability, specifically metrics, today and work is now underway. There are a couple of other issues that overlap in theme with this one. I am linking them together here for us to review later and figure out which to keep and which to close.

Related issues:
#164
#540
#855

Metrics Design Doc

Notes from the initial metrics meeting

hrishin added a commit to hrishin/tekton-pipeline that referenced this issue Oct 7, 2019
Often, as a developer or administartor(ops) I want some insights
about pipeline behavior in terms time taken to execute pipleinerun/taskrun,
its success or failure ratio, pod latencies etc.
At present tekton pipelines has very limted ways to surface such information
or its hard to get those details looking at resources yamls.

This patch exposes above mentioned pipelines metrics on '/metrics'
endpoint using knative `pkg/metrics` package. User can collect such
metrics using prometheus, stackdriver or other supported metrics system.

To some extent its solves
 - tektoncd#540
 - tektoncd#164
hrishin added a commit to hrishin/tekton-pipeline that referenced this issue Oct 7, 2019
Often, as a developer or administartor(ops) I want some insights
about pipeline behavior in terms time taken to execute pipleinerun/taskrun,
its success or failure ratio, pod latencies etc.
At present tekton pipelines has very limted ways to surface such information
or its hard to get those details looking at resources yamls.

This patch exposes above mentioned pipelines metrics on '/metrics'
endpoint using knative `pkg/metrics` package. User can collect such
metrics using prometheus, stackdriver or other supported metrics system.

To some extent its solves
 - tektoncd#540
 - tektoncd#164
hrishin added a commit to hrishin/tekton-pipeline that referenced this issue Oct 7, 2019
Often, as a developer or administrator(ops) I want some insights
about pipeline behavior in terms of time taken to execute pipleinerun/taskrun,
its success or failure ratio, pod latencies etc.
At present tekton pipelines has very limited ways to surface such information
or it's hard to get those details looking at resources yamls.

This patch exposes above mentioned pipelines metrics on '/metrics'
endpoint using knative `pkg/metrics` package. User can collect such
metrics using prometheus, stackdriver or other supported metrics system.

To some extent its solves
 - tektoncd#540
 - tektoncd#164
@hrishin hrishin mentioned this issue Oct 7, 2019
3 tasks
hrishin added a commit to hrishin/tekton-pipeline that referenced this issue Oct 12, 2019
Often, as a developer or administrator(ops) I want some insights
about pipeline behavior in terms of time taken to execute pipleinerun/taskrun,
its success or failure ratio, pod latencies etc.
At present tekton pipelines has very limited ways to surface such information
or it's hard to get those details looking at resources yamls.

This patch exposes above mentioned pipelines metrics on '/metrics'
endpoint using knative `pkg/metrics` package. User can collect such
metrics using prometheus, stackdriver or other supported metrics system.

To some extent its solves
 - tektoncd#540
 - tektoncd#164
hrishin added a commit to hrishin/tekton-pipeline that referenced this issue Oct 16, 2019
Often, as a developer or administrator(ops) I want some insights
about pipeline behavior in terms of time taken to execute pipleinerun/taskrun,
its success or failure ratio, pod latencies etc.
At present tekton pipelines has very limited ways to surface such information
or it's hard to get those details looking at resources yamls.

This patch exposes above mentioned pipelines metrics on '/metrics'
endpoint using knative `pkg/metrics` package. User can collect such
metrics using prometheus, stackdriver or other supported metrics system.

To some extent its solves
 - tektoncd#540
 - tektoncd#164
hrishin added a commit to hrishin/tekton-pipeline that referenced this issue Oct 17, 2019
Often, as a developer or administrator(ops) I want some insights
about pipeline behavior in terms of time taken to execute pipleinerun/taskrun,
its success or failure ratio, pod latencies etc.
At present tekton pipelines has very limited ways to surface such information
or it's hard to get those details looking at resources yamls.

This patch exposes above mentioned pipelines metrics on '/metrics'
endpoint using knative `pkg/metrics` package. User can collect such
metrics using prometheus, stackdriver or other supported metrics system.

To some extent its solves
 - tektoncd#540
 - tektoncd#164
hrishin added a commit to hrishin/tekton-pipeline that referenced this issue Oct 17, 2019
Often, as a developer or administrator(ops) I want some insights
about pipeline behavior in terms of time taken to execute pipleinerun/taskrun,
its success or failure ratio, pod latencies etc.
At present tekton pipelines has very limited ways to surface such information
or it's hard to get those details looking at resources yamls.

This patch exposes above mentioned pipelines metrics on '/metrics'
endpoint using knative `pkg/metrics` package. User can collect such
metrics using prometheus, stackdriver or other supported metrics system.

To some extent its solves
 - tektoncd#540
 - tektoncd#164
hrishin added a commit to hrishin/tekton-pipeline that referenced this issue Oct 17, 2019
Often, as a developer or administrator(ops) I want some insights
about pipeline behavior in terms of time taken to execute pipleinerun/taskrun,
its success or failure ratio, pod latencies etc.
At present tekton pipelines has very limited ways to surface such information
or it's hard to get those details looking at resources yamls.

This patch exposes above mentioned pipelines metrics on '/metrics'
endpoint using knative `pkg/metrics` package. User can collect such
metrics using prometheus, stackdriver or other supported metrics system.

To some extent its solves
 - tektoncd#540
 - tektoncd#164
tekton-robot pushed a commit that referenced this issue Oct 17, 2019
Often, as a developer or administrator(ops) I want some insights
about pipeline behavior in terms of time taken to execute pipleinerun/taskrun,
its success or failure ratio, pod latencies etc.
At present tekton pipelines has very limited ways to surface such information
or it's hard to get those details looking at resources yamls.

This patch exposes above mentioned pipelines metrics on '/metrics'
endpoint using knative `pkg/metrics` package. User can collect such
metrics using prometheus, stackdriver or other supported metrics system.

To some extent its solves
 - #540
 - #164
@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 12, 2020
@bobcatfish
Copy link
Collaborator Author

We haven't worked on this lately but it is an item in our roadmap and I think we should keep it open.

/lifecycle frozen

@tekton-robot tekton-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 12, 2020
@bobcatfish bobcatfish added the area/roadmap Issues that are part of the project (or organization) roadmap (usually an epic) label Aug 24, 2020
chmouel pushed a commit to chmouel/tektoncd-pipeline that referenced this issue Oct 7, 2020
Use patch -p1 instead of git am to apply patch
@bobcatfish bobcatfish self-assigned this Nov 11, 2020
@bobcatfish
Copy link
Collaborator Author

I want to start gathering some requirements around this and get it moving :D

@bobcatfish
Copy link
Collaborator Author

#3521 has some use cases that we might be able use

bobcatfish added a commit to bobcatfish/community that referenced this issue Nov 20, 2020
This PR starts a TEP to begin to measure tekton pipelines performance
and address tektoncd/pipeline#540

This first iteration just tries to describe the problem vs suggesting
the solution.

It DOES recommend measuring SLOs and SLIs as a goal, which is kind of
part of the solution, so if we think it's useful we could step back even
further, but I think this is a reasonable path forward, curious what
other folks think!
bobcatfish added a commit to bobcatfish/community that referenced this issue Dec 1, 2020
This PR starts a TEP to begin to measure tekton pipelines performance
and address tektoncd/pipeline#540

This first iteration just tries to describe the problem vs suggesting
the solution.

It DOES recommend measuring SLOs and SLIs as a goal, which is kind of
part of the solution, so if we think it's useful we could step back even
further, but I think this is a reasonable path forward, curious what
other folks think!
bobcatfish added a commit to bobcatfish/community that referenced this issue Jan 4, 2021
This PR starts a TEP to begin to measure tekton pipelines performance
and address tektoncd/pipeline#540

This first iteration just tries to describe the problem vs suggesting
the solution.

It DOES recommend measuring SLOs and SLIs as a goal, which is kind of
part of the solution, so if we think it's useful we could step back even
further, but I think this is a reasonable path forward, curious what
other folks think!
bobcatfish added a commit to bobcatfish/community that referenced this issue Jan 4, 2021
This PR starts a TEP to begin to measure tekton pipelines performance
and address tektoncd/pipeline#540

This first iteration just tries to describe the problem vs suggesting
the solution.

It DOES recommend measuring SLOs and SLIs as a goal, which is kind of
part of the solution, so if we think it's useful we could step back even
further, but I think this is a reasonable path forward, curious what
other folks think!
@bobcatfish bobcatfish removed their assignment Jan 4, 2021
tekton-robot pushed a commit to tektoncd/community that referenced this issue Jan 6, 2021
This PR starts a TEP to begin to measure tekton pipelines performance
and address tektoncd/pipeline#540

This first iteration just tries to describe the problem vs suggesting
the solution.

It DOES recommend measuring SLOs and SLIs as a goal, which is kind of
part of the solution, so if we think it's useful we could step back even
further, but I think this is a reasonable path forward, curious what
other folks think!
@mengjieli0726
Copy link

@bobcatfish, any tektone performance white paper have? as so far, how many pipeline run or run we can support in middle cluster (just like: 1 master + 1 compute node.)
the node spec: 8 core + 64 G memory + 250 G disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/roadmap Issues that are part of the project (or organization) roadmap (usually an epic) design This task is about creating and discussing a design kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. meaty-juicy-coding-work This task is mostly about implementation!!! And docs and tests of course but that's a given okr This is for some internal Google project tracking priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
Status: Todo
Status: In Progress
Development

No branches or pull requests

7 participants