Metrics Transform Processor Proposal #332

JingboWangGoogle · 2020-06-12T19:30:21Z

Metrics Transform Processor

The first version of aggregation would only support single-batch processing on metrics to avoid conflicts of different batches with the same time-series (more description of this corner case).

Objectives

The objective of this metrics transform processor is to give OpenTelemetry Collector users the flexibility to rename and aggregate metrics in desired ways so that the metrics are more relevant and cheaper to the users.

Background

There are cases where metrics are overly detailed for a user’s needs, which also impose unnecessary costs. In these cases, the user wants to compress the metrics written into the collector in some meaningful way before writing them out to the metrics backend. In addition, users may want to significantly transform the format that metrics appear in. This may even be a requirement due to backends requiring metrics to match specific existing formats and names.
Related Work: Filter Processor (Issue, Pull Request).

Requirements

The specific requirements include:

rename metrics (e.g. rename cpu/usage to cpu/usage_time)
rename labels (e.g. rename cpu to core)
rename label values (e.g. rename done to complete)
aggregate across label sets (e.g. only want the label usage, but don’t care about the labels core, and cpu)
- aggregation_type: sum, average, max
aggregate across label values (e.g. want memory{slab}, but don’t care about memory{slab_reclaimable} & memory{slab_unreclaimable})
- aggregation_type: sum, average, max

Design Ideas

Proposed Configuration

transforms:
  # name is used to match with the metric to operate on. This implementation doesn’t # utilize the filtermetric’s MatchProperties struct because it doesn’t match well 
  # with what I need at this phase. All is needed for this processor at this stage is # a single name string that can be used to match with selected metrics. The list of # metric names and the match type in the filtermetric’s MatchProperties struct are # unnecessary. Also, based on the issue about improving filtering configuration, it # seems like this struct is subject to be slightly modified.
  - name: <current_metric_name>

  # action specifies if the operations are performed on the current copy of the 
  # metric or on a newly created metric that will be inserted
    action: {update, insert}

  # new_name is used to rename metrics (e.g. rename cpu/usage to cpu/usage_time)
  # if action is insert, new_name is required
    new_name: <new_metric_name_inserted>

  # operations contain a list of operations that will be performed on the selected 
  # metrics. Each operation block is a key-value pair, where the key can be any 
  # arbitrary string set by the users for readability, and the value is a struct with # fields required for operations. The action field is important for the processor 
  # to identify exactly which operation to perform 
    operations:

    # update_label action can be used to update the name of a label or the values           
    # of this label (e.g. rename label `cpu` to `core`)
      - action: update_label
        label: <current_label1>
        new_label: <new_label>
        value_actions:
         - value: <current_label_value>
           new_value: <new_label_value>

    # aggregate_labels action aggregates metrics across labels (e.g. only want  
    # the label `usage`, but don’t care about the labels `core`, and `cpu`)
      - action: aggregate_labels
     # label_set contains a list of labels that will remain after the aggregation.    
     # The excluded labels will be aggregated by the way specified by  
     # aggregation_type.
        label_set: [labels...]
        aggregation_type: {sum, average, max}

    # aggregate_label_values action aggregates labels across label values (e.g. want  
    # memory{slab}, but don’t care about memory{slab_reclaimable} &   
    # memory{slab_unreclaimable})
      - action: aggregate_label_values
        label: <label>
     # aggregated_values contains a list of label values that will be aggregated by  
     # the way specified by aggregation_type into new_value. The excluded label  
     # values will remain.
       aggregated_values: [values...]
       new_value: <new_value> 
       aggregation_type: {sum, average, max}

Examples

Insert New Metric

# create host/cpu/usage_time from host/cpu/usage
name: host/cpu/usage
action: insert
new_name: host/cpu/utilization
operations:
  ...

Rename Labels

# rename the label cpu to core
operations:
  - action: update_label
    label: cpu
    new_label: core

Aggregate Labels

# aggregate away everything but `state` using summation
...
operations:
  -action: aggregate_labels
   label_set: [ state ]
   aggregation_type: sum

Aggregate Label Values

# combine slab_reclaimable & slab_unreclaimable by summation
...
operations:
  -action: aggregate_label_values
   label: state
   aggregated_values: [ slab_reclaimable, slab_unreclaimable ]
   new_value: slab 
   aggregation_type: sum

Possible Extensions

Supporting custom types of aggregation (e.g. defined by a formula)
Support aggregations of non-simple metric types (distributions, etc)
Support aggregation over time
Utilizing regex to select metrics

Thanks to

@draffensperger @james-bebbington @quentinmit

The text was updated successfully, but these errors were encountered:

jrcamp · 2020-06-12T21:23:53Z

I wonder if this should be broken up into two different processors? One that's analogous to attributesprocessor but for metrics. I think this should live in core:

rename metrics (e.g. rename cpu/usage to cpu/usage_time)
rename labels (e.g. rename cpu to core)
rename label values (e.g. rename done to complete)

And a separate one for aggregation that lives in contrib for now as it may take some trial and error to get the right config and functionality:

aggregate across label sets (e.g. only want the label usage, but don’t care about the labels core, and cpu)
aggregation_type: sum, average, max
aggregate across label values (e.g. want memory{slab}, but don’t care about memory{slab_reclaimable} & memory{slab_unreclaimable})
aggregation_type: sum, average, max

Note that there are discussions about changes to attributesprocessor that we would probably want to mirror: open-telemetry/opentelemetry-collector#979 (comment)

bogdandrutu · 2020-06-14T15:20:16Z

aggregate across label sets (e.g. only want the label usage, but don’t care about the labels core, and cpu)
aggregation_type: sum, average, max

Keep in mind that this is not easy. There are corner cases where metrics received in two different batches and after label removing will generate the same timeseries.

First batch:
usage=foo;core=1

Second batch:
usage=foo;core=2

Without keeping a state between different batches will produce wrong aggregations, so this will work only if the receiving batch contains all the values.

JingboWangGoogle · 2020-06-15T18:45:59Z

@jrcamp I really appreciate your feedback on this. Initially, I thought that it might make sense to put aggregation and rename together because after aggregation, the semantics of that metric will very likely change, thus a more accurate name should be reassigned. But separating these makes it a cleaner design because these two are essentially doing different things. Therefore, I have made adjustments to the issue proposed. :)

quentinmit · 2020-06-15T19:13:29Z

@jrcamp @JingboWangGoogle
I think I was one of the people who suggested that rename and aggregate should be part of the same operation. More specifically I suggested that because it allows you to use the same source metric to produce two different destination metrics with different aggregation. I'm not sure how you do that if they're separate steps.

JingboWangGoogle · 2020-06-15T20:16:11Z

@bogdandrutu Thank you for pointing this out! For now, I am aiming to have the aggregation processor to only handle single-batch processing (e.g. the collector collects the host metrics, which is the only source of metrics into the collector). The corner case here can be solved by "aggregation over time" under the possible extensions section.

jrcamp · 2020-06-15T20:47:22Z

@JingboWangGoogle

btw, have you taken a look at the spec relating to aggregations in the SDK:

https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/metrics/api.md#aggregations
open-telemetry/oteps#89

I'm not deeply familiar with them and don't see anything conflicting between them but wanted to make sure you saw them.

@bogdandrutu it feels like we should ideally in the long run have parity between aggregations of collector and SDK's, what do you think?

I think I was one of the people who suggested that rename and aggregate should be part of the same operation. More specifically I suggested that because it allows you to use the same source metric to produce two different destination metrics with different aggregation. I'm not sure how you do that if they're separate steps.

@quentinmit With aggregation you are either 1) producing a new (derived) metric or 2) replacing the existing metric with the aggregated values. When renaming you're keeping all the same data but changing the metric name/labels. It seems they're different enough to warrant treating separately. Especially for somebody who only wanting to rename I think it'd be confusing to try and do renaming via aggregation (presumably by an aggregation that is actually a noop?) Am I missing something?

quentinmit · 2020-06-15T21:21:26Z

@quentinmit With aggregation you are either 1) producing a new (derived) metric or 2) replacing the existing metric with the aggregated values. When renaming you're keeping all the same data but changing the metric name/labels. It seems they're different enough to warrant treating separately. Especially for somebody who only wanting to rename I think it'd be confusing to try and do renaming via aggregation (presumably by an aggregation that is actually a noop?) Am I missing something?

@jrcamp I don't think the user would need to think about aggregation if they only want to rename, or renaming if they only want to aggregate. The default config for each operation should already be a noop. Maybe it's better to illustrate with some strawman configs:

metricsprocessor:
# Remove the core label from the CPU metric
- metric_name: cpu/utilization
  aggregate:
   fields: [core]
   invert_fields: true
   reducer: MEAN  # MEAN gives you 0-100, SUM/default gives you 0-n00
# Rename request_count
- metric_name: app/request_count
  source_metric_name: apache/request_count
- metric_name: apache/request_count
  drop: true  # overlaps with filter processor
# Denormalize request count to save space
- metric_name: istio/request_count_by_response_code
  source_metric_name: istio/request_count
  aggregate:
   fields: [response_code]
- metric_name: istio/request_count_by_source_workload
  source_metric_name: istio/request_count
  aggregate:
   fields: [source_workload]
- metric_name: istio/request_count
  drop: true

Remember that when filtering labels, you always need to simultaneously perform aggregation (since there can be two or more time series that turn into one).

I also think there's an argument to be made that modularity is worth trading off for the simplified configuration.

jrcamp · 2020-06-15T23:16:04Z

It also needs to be able to rename labels (perhaps scoped to a particular metric). How would that fit into the config?

quentinmit · 2020-06-16T17:47:37Z

@jrcamp I think renaming labels and especially renaming label values also is intertwined with aggregation (because renamed label values can collide with existing time series that have that value, or you might want to explicitly map two values to one).

There's a couple different ways you could do renaming labels; there's what was suggested above by @JingboWangGoogle:

  - action: update_label
    label: cpu
    new_label: core

and I could also imagine for some usecases wanting to do remap the labels at once:

- map_labels:
  core: cpu
  socket: index

(and therefore any label not mentioned would be dropped) I'm not sure which or both should be supported; I'm inclined to err on the side of making the config as easy to understand as possible, which weakly suggests that we should support both. I'm open to your thoughts on that.

jrcamp · 2020-06-16T18:42:54Z

In my mind renaming may happen as a side effect of aggregation. But you may also want to rename without aggregating. For example, the OT metric is named gc.runtime and you expected that metric to be named java/gc/runtime (just making up an example). Yes the user would have to make sure there are no collisions when renaming metrics, just as they must not emit metrics in the first place with collisions.

I think the renaming part is pretty straight forward and should go in core whereas the aggregation may need to be iterated on. Given that structural situation I think they must be separate (at least for now).

JingboWangGoogle · 2020-06-17T19:58:21Z

As a record to our conclusion from our meeting, we have decided to keep it as one processor, but this processor will stay in the contrib repo for now.

jmacd · 2020-06-23T17:30:33Z

In #290 we are discussing how to incorporate a (dog)statsd receiver into the collector. One of the features of statsd data is that it arrives unaggregated, as point data. Feeding statsd data directly into the collector is probably not what users want. I think what users do want is to configure which aggregations are applied to the data. This could be implemented specifically inside a statsd receiver, but it's appealing to think we could transform statsd data points into raw OTLP data and then rely on a downstream processor to aggregate the data in a way the user configures.

Does this sound like a reasonable use-case for the collector metrics processor?

quentinmit · 2020-06-23T17:35:57Z

It sounds quite reasonable to me, though can you clarify what you mean by "point data"?

We're initially targeting aggregation-across-dimensions with this processor, and not aggregation-across-time. So if you get e.g. 5 separate points in a minute and want an output of "5" that would be aggregation-across-time. We definitely want to support that but it's more complicated so it's not likely to be in the first version of this processor.

jmacd · 2020-06-23T17:47:41Z

For example, when statsd sends histogram data, it sends one value at a time (with an "h" designation) and the receiver is expected to build the histogram. We could have a statsd receiver emit raw data points into the OTLP representation, but I wouldn't expect the exporters to transform data back into a suitable format. For example, I would not expect a Prometheus exporter to have to compute a histogram over the raw data, that sounds like part of a processing pipeline.

jmacd · 2020-06-23T17:50:49Z

The "Across time" part is indeed trickier. The OTel metric clients are already doing similar things, so I understand why this is hard. You'd have to buffer a window of data (SDK term for this: an Accumulator) before applying temporal aggregations.

The statsd receiver could be configured with a small buffer of time, to work around this matter. It would accumulate 10 seconds of data as a single OTLP batch and send it for processing. The processing would not necessarily remove any labels, but it would be required to compute a histogram (or other aggregation) from raw points.

Sounds like this is a good fit.

jmacd · 2020-06-24T02:07:56Z

After reading through this proposal again, it appears to be another form of the Views API in disguise.

open-telemetry/oteps#89

It seems like we could try to find a common configuration language for describing views here.

draffensperger · 2020-06-24T14:51:42Z

Hi @jmacd that's a good point about this being related to views. @JingboWangGoogle is a Google intern working on this over the summer and our hope is that we can get a basic version of this that doesn't do across-time aggregations but that only does single-batch renames/aggregations.

I think since it's in contrib it wouldn't be viewed as final and should be refactored to better conform to the views spec. Maybe we could meet to chat more if you're interested!

JingboWangGoogle · 2020-07-14T16:53:58Z

Regarding the feature brought up by @jmacd , we have agreed to move this to a later version, specifically after #417, which concludes the MVP for this proposed processor.

* Allow metric processors to be specified in pipelines * Fix typo * Fix formatting * Add fix for unit test

jrcamp · 2020-07-30T22:47:46Z

@JingboWangGoogle is there any more work that needs to be done to close out this issue given the above PRs have been merged?

JingboWangGoogle · 2020-07-31T01:17:59Z

@jrcamp Based on the content in this proposal and the features implemented in the merged PRs, this proposal issue can be closed as all the proposals here are realized in the PRs. Also @draffensperger @quentinmit just for transparency.

bogdandrutu transferred this issue from open-telemetry/opentelemetry-collector Jun 16, 2020

JingboWangGoogle mentioned this issue Jun 17, 2020

metrics transform processor README #339

Closed

JingboWangGoogle mentioned this issue Jun 19, 2020

Rename feature of the metrics transform processor #347

Closed

sonofachamp mentioned this issue Jun 24, 2020

Add StatsD receiver #290

Closed

jmacd mentioned this issue Jun 24, 2020

Standard system metrics and semantic conventions open-telemetry/oteps#119

Merged

JingboWangGoogle mentioned this issue Jun 30, 2020

Rename feature of the metrics transform processor (metrics and labels) #376

Merged

jmacd mentioned this issue Jul 1, 2020

Metrics aggregation for same metric with 2 or more nodes open-telemetry/opentelemetry-collector#433

Closed

james-bebbington mentioned this issue Jul 7, 2020

Add option to change the data type (from int to float or vice-versa) #402

Merged

JingboWangGoogle mentioned this issue Jul 8, 2020

Aggregation across labels and label values in metrics transform processor #417

Merged

jrcamp assigned JingboWangGoogle Jul 21, 2020

jrcamp added the enhancement New feature or request label Jul 21, 2020

mxiamxia referenced this issue in mxiamxia/opentelemetry-collector-contrib Jul 22, 2020

Allow metric processors to be specified in pipelines (#332)

4db0414

* Allow metric processors to be specified in pipelines * Fix typo * Fix formatting * Add fix for unit test

jrcamp added this to the GA 1.0 milestone Jul 30, 2020

jrcamp closed this as completed Jul 31, 2020

This was referenced Aug 6, 2020

update README in metrics transform processor #660

Merged

keeping timeseries and points in order after aggregation in metrics transform processor #663

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics Transform Processor Proposal #332

Metrics Transform Processor Proposal #332

JingboWangGoogle commented Jun 12, 2020 •

edited

Loading

jrcamp commented Jun 12, 2020

bogdandrutu commented Jun 14, 2020 •

edited

Loading

JingboWangGoogle commented Jun 15, 2020

quentinmit commented Jun 15, 2020

JingboWangGoogle commented Jun 15, 2020

jrcamp commented Jun 15, 2020 •

edited

Loading

quentinmit commented Jun 15, 2020 •

edited

Loading

jrcamp commented Jun 15, 2020

quentinmit commented Jun 16, 2020

jrcamp commented Jun 16, 2020 •

edited

Loading

JingboWangGoogle commented Jun 17, 2020

jmacd commented Jun 23, 2020

quentinmit commented Jun 23, 2020

jmacd commented Jun 23, 2020

jmacd commented Jun 23, 2020

jmacd commented Jun 24, 2020

draffensperger commented Jun 24, 2020

JingboWangGoogle commented Jul 14, 2020

jrcamp commented Jul 30, 2020

JingboWangGoogle commented Jul 31, 2020

Metrics Transform Processor Proposal #332

Metrics Transform Processor Proposal #332

Comments

JingboWangGoogle commented Jun 12, 2020 • edited Loading