Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Metrics for Running PipelinesRuns at Pipeline and Namespace level #8280

Merged
merged 1 commit into from
Oct 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions config/config-observability.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,3 +59,4 @@ data:
metrics.pipelinerun.level: "pipeline"
metrics.pipelinerun.duration-type: "histogram"
metrics.count.enable-reason: "false"
metrics.running-pipelinerun.level: ""
31 changes: 18 additions & 13 deletions docs/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,26 +41,31 @@ A sample config-map has been provided as [config-observability](./../config/conf
metrics.taskrun.level: "task"
metrics.taskrun.duration-type: "histogram"
metrics.pipelinerun.level: "pipeline"
metrics.running-pipelinerun.level: ""
afrittoli marked this conversation as resolved.
Show resolved Hide resolved
metrics.pipelinerun.duration-type: "histogram"
metrics.count.enable-reason: "false"
```
Following values are available in the configmap:
| configmap data | value | description |
| -- | ----------- | ----------- |
| metrics.taskrun.level | `taskrun` | Level of metrics is taskrun |
| metrics.taskrun.level | `task` | Level of metrics is task and taskrun label isn't present in the metrics |
| metrics.taskrun.level | `namespace` | Level of metrics is namespace, and task and taskrun label isn't present in the metrics
| metrics.pipelinerun.level | `pipelinerun` | Level of metrics is pipelinerun |
| metrics.pipelinerun.level | `pipeline` | Level of metrics is pipeline and pipelinerun label isn't present in the metrics |
| metrics.pipelinerun.level | `namespace` | Level of metrics is namespace, pipeline and pipelinerun label isn't present in the metrics |
| metrics.taskrun.duration-type | `histogram` | `tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds` and `tekton_pipelines_controller_taskrun_duration_seconds` is of type histogram |
| configmap data | value | description |
| -- | ----------- |--------------------------------------------------------------------------------------------------------------------------------------------------------------|
| metrics.taskrun.level | `taskrun` | Level of metrics is taskrun |
| metrics.taskrun.level | `task` | Level of metrics is task and taskrun label isn't present in the metrics |
| metrics.taskrun.level | `namespace` | Level of metrics is namespace, and task and taskrun label isn't present in the metrics
afrittoli marked this conversation as resolved.
Show resolved Hide resolved
| metrics.pipelinerun.level | `pipelinerun` | Level of metrics is pipelinerun |
| metrics.pipelinerun.level | `pipeline` | Level of metrics is pipeline and pipelinerun label isn't present in the metrics |
| metrics.pipelinerun.level | `namespace` | Level of metrics is namespace, pipeline and pipelinerun label isn't present in the metrics |
| metrics.running-pipelinerun.level | `pipelinerun` | Level of running-pipelinerun metrics is pipelinerun |
| metrics.running-pipelinerun.level | `pipeline` | Level of running-pipelinerun metrics is pipeline and pipelinerun label isn't present in the metrics |
afrittoli marked this conversation as resolved.
Show resolved Hide resolved
| metrics.running-pipelinerun.level | `namespace` | Level of running-pipelinerun metrics is namespace, pipeline and pipelinerun label isn't present in the metrics |
| metrics.running-pipelinerun.level | `` | Level of running-pipelinerun metrics is cluster, namespace, pipeline and pipelinerun label isn't present in the metrics. |
| metrics.taskrun.duration-type | `histogram` | `tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds` and `tekton_pipelines_controller_taskrun_duration_seconds` is of type histogram |
| metrics.taskrun.duration-type | `lastvalue` | `tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds` and `tekton_pipelines_controller_taskrun_duration_seconds` is of type gauge or lastvalue |
| metrics.pipelinerun.duration-type | `histogram` | `tekton_pipelines_controller_pipelinerun_duration_seconds` is of type histogram |
| metrics.pipelinerun.duration-type | `lastvalue` | `tekton_pipelines_controller_pipelinerun_duration_seconds` is of type gauge or lastvalue |
| metrics.count.enable-reason | `false` | Sets if the `reason` label should be included on count metrics |
| metrics.taskrun.throttle.enable-namespace | `false` | Sets if the `namespace` label should be included on the `tekton_pipelines_controller_running_taskruns_throttled_by_quota` metric |
| metrics.pipelinerun.duration-type | `histogram` | `tekton_pipelines_controller_pipelinerun_duration_seconds` is of type histogram |
| metrics.pipelinerun.duration-type | `lastvalue` | `tekton_pipelines_controller_pipelinerun_duration_seconds` is of type gauge or lastvalue |
| metrics.count.enable-reason | `false` | Sets if the `reason` label should be included on count metrics |
| metrics.taskrun.throttle.enable-namespace | `false` | Sets if the `namespace` label should be included on the `tekton_pipelines_controller_running_taskruns_throttled_by_quota` metric |
afrittoli marked this conversation as resolved.
Show resolved Hide resolved

Histogram value isn't available when pipelinerun or taskrun labels are selected. The Lastvalue or Gauge will be provided. Histogram would serve no purpose because it would generate a single bar. TaskRun and PipelineRun level metrics aren't recommended because they lead to an unbounded cardinality which degrades the observability database.

Expand Down
11 changes: 11 additions & 0 deletions pkg/apis/config/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ const (
// metricsPipelinerunLevel determines to what level to aggregate metrics
// for pipelinerun
metricsPipelinerunLevelKey = "metrics.pipelinerun.level"
// metricsRunningPipelinerunLevelKey determines to what level to aggregate metrics
// for running pipelineruns
metricsRunningPipelinerunLevelKey = "metrics.running-pipelinerun.level"
// metricsDurationTaskrunType determines what type of
// metrics to use for aggregating duration for taskrun
metricsDurationTaskrunType = "metrics.taskrun.duration-type"
Expand All @@ -55,6 +58,9 @@ const (
// DefaultPipelinerunLevel determines to what level to aggregate metrics
// when it isn't specified in configmap
DefaultPipelinerunLevel = PipelinerunLevelAtPipeline
// DefaultRunningPipelinerunLevel determines to what level to aggregate metrics
// when it isn't specified in configmap
DefaultRunningPipelinerunLevel = ""
// PipelinerunLevelAtPipelinerun specify that aggregation will be done at
// pipelinerun level
PipelinerunLevelAtPipelinerun = "pipelinerun"
Expand Down Expand Up @@ -96,6 +102,7 @@ var DefaultMetrics, _ = newMetricsFromMap(map[string]string{})
type Metrics struct {
TaskrunLevel string
PipelinerunLevel string
RunningPipelinerunLevel string
DurationTaskrunType string
DurationPipelinerunType string
CountWithReason bool
Expand Down Expand Up @@ -130,6 +137,7 @@ func newMetricsFromMap(cfgMap map[string]string) (*Metrics, error) {
tc := Metrics{
TaskrunLevel: DefaultTaskrunLevel,
PipelinerunLevel: DefaultPipelinerunLevel,
RunningPipelinerunLevel: DefaultRunningPipelinerunLevel,
DurationTaskrunType: DefaultDurationTaskrunType,
DurationPipelinerunType: DefaultDurationPipelinerunType,
CountWithReason: false,
Expand All @@ -143,6 +151,9 @@ func newMetricsFromMap(cfgMap map[string]string) (*Metrics, error) {
if pipelinerunLevel, ok := cfgMap[metricsPipelinerunLevelKey]; ok {
tc.PipelinerunLevel = pipelinerunLevel
}
if runningPipelinerunLevel, ok := cfgMap[metricsRunningPipelinerunLevelKey]; ok {
tc.RunningPipelinerunLevel = runningPipelinerunLevel
}
if durationTaskrun, ok := cfgMap[metricsDurationTaskrunType]; ok {
tc.DurationTaskrunType = durationTaskrun
}
Expand Down
5 changes: 5 additions & 0 deletions pkg/apis/config/metrics_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ func TestNewMetricsFromConfigMap(t *testing.T) {
expectedConfig: &config.Metrics{
TaskrunLevel: config.TaskrunLevelAtTaskrun,
PipelinerunLevel: config.PipelinerunLevelAtPipelinerun,
RunningPipelinerunLevel: config.DefaultRunningPipelinerunLevel,
DurationTaskrunType: config.DurationPipelinerunTypeHistogram,
DurationPipelinerunType: config.DurationPipelinerunTypeHistogram,
CountWithReason: false,
Expand All @@ -47,6 +48,7 @@ func TestNewMetricsFromConfigMap(t *testing.T) {
expectedConfig: &config.Metrics{
TaskrunLevel: config.TaskrunLevelAtNS,
PipelinerunLevel: config.PipelinerunLevelAtNS,
RunningPipelinerunLevel: config.PipelinerunLevelAtNS,
DurationTaskrunType: config.DurationTaskrunTypeHistogram,
DurationPipelinerunType: config.DurationPipelinerunTypeLastValue,
CountWithReason: false,
Expand All @@ -58,6 +60,7 @@ func TestNewMetricsFromConfigMap(t *testing.T) {
expectedConfig: &config.Metrics{
TaskrunLevel: config.TaskrunLevelAtNS,
PipelinerunLevel: config.PipelinerunLevelAtNS,
RunningPipelinerunLevel: config.DefaultRunningPipelinerunLevel,
DurationTaskrunType: config.DurationTaskrunTypeHistogram,
DurationPipelinerunType: config.DurationPipelinerunTypeLastValue,
CountWithReason: true,
Expand All @@ -69,6 +72,7 @@ func TestNewMetricsFromConfigMap(t *testing.T) {
expectedConfig: &config.Metrics{
TaskrunLevel: config.TaskrunLevelAtNS,
PipelinerunLevel: config.PipelinerunLevelAtNS,
RunningPipelinerunLevel: config.PipelinerunLevelAtPipeline,
DurationTaskrunType: config.DurationTaskrunTypeHistogram,
DurationPipelinerunType: config.DurationPipelinerunTypeLastValue,
CountWithReason: true,
Expand All @@ -88,6 +92,7 @@ func TestNewMetricsFromEmptyConfigMap(t *testing.T) {
expectedConfig := &config.Metrics{
TaskrunLevel: config.TaskrunLevelAtTask,
PipelinerunLevel: config.PipelinerunLevelAtPipeline,
RunningPipelinerunLevel: config.DefaultRunningPipelinerunLevel,
DurationTaskrunType: config.DurationPipelinerunTypeHistogram,
DurationPipelinerunType: config.DurationPipelinerunTypeHistogram,
CountWithReason: false,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,5 @@ data:
metrics.taskrun.level: "namespace"
metrics.taskrun.duration-type: "histogram"
metrics.pipelinerun.level: "namespace"
metrics.running-pipelinerun.level: "namespace"
metrics.pipelinerun.duration-type: "lastvalue"
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ data:
metrics.taskrun.level: "namespace"
metrics.taskrun.duration-type: "histogram"
metrics.pipelinerun.level: "namespace"
metrics.running-pipelinerun.level: "pipeline"
metrics.pipelinerun.duration-type: "lastvalue"
metrics.count.enable-reason: "true"
metrics.taskrun.throttle.enable-namespace: "true"
64 changes: 59 additions & 5 deletions pkg/pipelinerunmetrics/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,13 @@ import (
"knative.dev/pkg/metrics"
)

const (
runningPRLevelPipelinerun = "pipelinerun"
runningPRLevelPipeline = "pipeline"
runningPRLevelNamespace = "namespace"
runningPRLevelCluster = ""
)

afrittoli marked this conversation as resolved.
Show resolved Hide resolved
var (
pipelinerunTag = tag.MustNewKey("pipelinerun")
pipelineTag = tag.MustNewKey("pipeline")
Expand Down Expand Up @@ -134,6 +141,7 @@ func NewRecorder(ctx context.Context) (*Recorder, error) {
}

cfg := config.FromContextOrDefaults(ctx)
r.cfg = cfg.Metrics
afrittoli marked this conversation as resolved.
Show resolved Hide resolved
errRegistering = viewRegister(cfg.Metrics)
if errRegistering != nil {
r.initialized = false
Expand All @@ -149,7 +157,6 @@ func viewRegister(cfg *config.Metrics) error {
defer r.mutex.Unlock()

var prunTag []tag.Key

switch cfg.PipelinerunLevel {
case config.PipelinerunLevelAtPipelinerun:
prunTag = []tag.Key{pipelinerunTag, pipelineTag}
Expand All @@ -164,6 +171,18 @@ func viewRegister(cfg *config.Metrics) error {
return errors.New("invalid config for PipelinerunLevel: " + cfg.PipelinerunLevel)
}

var runningPRTag []tag.Key
switch cfg.RunningPipelinerunLevel {
case config.PipelinerunLevelAtPipelinerun:
runningPRTag = []tag.Key{pipelinerunTag, pipelineTag, namespaceTag}
case config.PipelinerunLevelAtPipeline:
runningPRTag = []tag.Key{pipelineTag, namespaceTag}
case config.PipelinerunLevelAtNS:
runningPRTag = []tag.Key{namespaceTag}
default:
runningPRTag = []tag.Key{}
}

distribution := view.Distribution(10, 30, 60, 300, 900, 1800, 3600, 5400, 10800, 21600, 43200, 86400)

if cfg.PipelinerunLevel == config.PipelinerunLevelAtPipelinerun {
Expand Down Expand Up @@ -213,6 +232,7 @@ func viewRegister(cfg *config.Metrics) error {
Description: runningPRs.Description(),
Measure: runningPRs,
Aggregation: view.LastValue(),
TagKeys: runningPRTag,
}

runningPRsWaitingOnPipelineResolutionCountView = &view.View{
Expand Down Expand Up @@ -326,7 +346,7 @@ func (r *Recorder) updateConfig(cfg *config.Metrics) {

// DurationAndCount logs the duration of PipelineRun execution and
// count for number of PipelineRuns succeed or failed
// returns an error if its failed to log the metrics
// returns an error if it fails to log the metrics
func (r *Recorder) DurationAndCount(pr *v1.PipelineRun, beforeCondition *apis.Condition) error {
if !r.initialized {
return fmt.Errorf("ignoring the metrics recording for %s , failed to initialize the metrics recorder", pr.Name)
Expand Down Expand Up @@ -379,11 +399,10 @@ func (r *Recorder) DurationAndCount(pr *v1.PipelineRun, beforeCondition *apis.Co
}

// RunningPipelineRuns logs the number of PipelineRuns running right now
// returns an error if its failed to log the metrics
// returns an error if it fails to log the metrics
func (r *Recorder) RunningPipelineRuns(lister listers.PipelineRunLister) error {
r.mutex.Lock()
defer r.mutex.Unlock()

if !r.initialized {
return errors.New("ignoring the metrics recording, failed to initialize the metrics recorder")
}
Expand All @@ -396,9 +415,38 @@ func (r *Recorder) RunningPipelineRuns(lister listers.PipelineRunLister) error {
var runningPipelineRuns int
var trsWaitResolvingTaskRef int
var prsWaitResolvingPipelineRef int
countMap := map[string]int{}

for _, pr := range prs {
pipelineName := getPipelineTagName(pr)
pipelineRunKey := ""
mutators := []tag.Mutator{
tag.Insert(namespaceTag, pr.Namespace),
tag.Insert(pipelineTag, pipelineName),
tag.Insert(pipelinerunTag, pr.Name),
}
if r.cfg != nil {
afrittoli marked this conversation as resolved.
Show resolved Hide resolved
switch r.cfg.RunningPipelinerunLevel {
case runningPRLevelPipelinerun:
pipelineRunKey = pipelineRunKey + "#" + pr.Name
fallthrough
case runningPRLevelPipeline:
pipelineRunKey = pipelineRunKey + "#" + pipelineName
fallthrough
case runningPRLevelNamespace:
pipelineRunKey = pipelineRunKey + "#" + pr.Namespace
case runningPRLevelCluster:
default:
return fmt.Errorf("RunningPipelineRunLevel value \"%s\" is not valid ", r.cfg.RunningPipelinerunLevel)
}
afrittoli marked this conversation as resolved.
Show resolved Hide resolved
}
ctx_, err_ := tag.New(context.Background(), mutators...)
if err_ != nil {
return err
}
if !pr.IsDone() {
countMap[pipelineRunKey]++
metrics.Record(ctx_, runningPRs.M(float64(countMap[pipelineRunKey])))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The countMap is only complete at the end of the loop, so why is the metric reported here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric might have different tags. For instance, in case we configure the metric to be at "namespace" level, it will have one counter for each namespace. How does the metric API know what tag value it is that we're counting here? Shouldn't we pass the pipelineRunKey somehow as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metric might have different tags. For instance, in case we configure the metric to be at "namespace" level, it will have one counter for each namespace. How does the metric API know what tag value it is that we're counting here? Shouldn't we pass the pipelineRunKey somehow as well?

pipelineRunKey is generated on the basis of configuration if config is at namespace level then namespace is the key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The countMap is only complete at the end of the loop, so why is the metric reported here?

There is no way of knowing if there is any more pipeline runs for same key.

Incase there are more metrics for same key then on next instance metric is reported with updated value and we are just considering the last value of the metric so same is updated accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, we'll keep this as it is for now.
I still think that recording once the correct value would be better than recording incremental values, but we can address that separately.

runningPipelineRuns++
succeedCondition := pr.Status.GetCondition(apis.ConditionSucceeded)
if succeedCondition != nil && succeedCondition.Status == corev1.ConditionUnknown {
Expand All @@ -409,6 +457,13 @@ func (r *Recorder) RunningPipelineRuns(lister listers.PipelineRunLister) error {
prsWaitResolvingPipelineRef++
}
}
} else {
// In case there are no running PipelineRuns for the pipelineRunKey, set the metric value to 0 to ensure
// the metric is set for the key.
if _, exists := countMap[pipelineRunKey]; !exists {
countMap[pipelineRunKey] = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The countMap is only complete at the end of the loop, so why is the metric reported here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same reason as above.
this code is reset the metric already reported.

lets assume we have 4 PRs running for a key at given point in time so metric is reported with 4.
in next call to function all PR are completed so they wont be reported in in block which mean their metric report will still be showing 4 on graph which is wrong data.

but resetting the value to 0 here solves the issues
if there is any running PR for the key then that will be handled by if block. if not then anyways we are setting to 0.

metrics.Record(ctx_, runningPRs.M(0))
}
}
}

Expand All @@ -421,7 +476,6 @@ func (r *Recorder) RunningPipelineRuns(lister listers.PipelineRunLister) error {
metrics.Record(ctx, runningPRsWaitingOnTaskResolutionCount.M(float64(trsWaitResolvingTaskRef)))
metrics.Record(ctx, runningPRsWaitingOnTaskResolution.M(float64(trsWaitResolvingTaskRef)))
metrics.Record(ctx, runningPRsCount.M(float64(runningPipelineRuns)))
metrics.Record(ctx, runningPRs.M(float64(runningPipelineRuns)))
afrittoli marked this conversation as resolved.
Show resolved Hide resolved

return nil
}
Expand Down
Loading
Loading