ability to capture all metric raw data without aggregation #617

mxiamxia · 2020-05-21T21:53:23Z

In OT Metric API Instruments Spec, we defined that each type of metric instrument should be binding with a default aggregation behavior / aggregator. In AWS CloudWatch, we support to capture all raw metric measurement events and assemble it as Embedded Metric Format(EMF) in a structured log and so CloudWatch backend can process all raw metric data and provide all metrics visualization capabilities.

We'd like to implement such Exporters for SDK and Collector to support EMF metric format but it requires that we can capture all the raw metric events without aggregating. I see we have Views API WIP to allow customers customizing aggregator. We do want to start the work on this ASAP and looking for the suggestions here if there is a way we can capture all the raw metrics event data(especially not SummaryDataPoint) and pass it to OT collector so we can bring in AWS exporters to support it? Thanks.

cijothomas · 2020-05-22T04:08:25Z

I believe this would be part of SDK spec. I am also interested in getting a hook in the Metric SDK to get raw metric events and also aggregated metric values.
For Traces, the SpanProcessors are the way to get access to Spans. For Metrics, we may need MetricProcessor (which gets metrics after aggregation), and Aggregator (which gets raw metrics)

My use case for this is different:
I have a need where same metric need to be aggregated more than once, with different labelsets, and with different aggregation interval too. View may help with former, but not the latter.

Having hooks to get raw events( similar to aggregator) and aggregated events (MetricProcessor) will help me achieve my scenario as well.

jmacd · 2020-05-26T22:17:08Z

The Aggregator interface that is currently described in the draft SDK specification and in the OTel-Go library does receive raw events. An aggregator that handles raw events and does not compute checkpoints could be referred to as a "pass-through" aggregator. This has been requested, but no prototypes exist.

The metrics SDK describes an "Integrator" which is like the span processor. The integrator receives checkpoints output by the aggregators (one per distinct label set). I believe it is possible to compute multiple aggregations as described without a pass-through aggregator. To achieve multiple aggregations for different label sets, one checkpoint (for one label set) will be merged into multiple output aggregations according to the labels. To achieve multiple intervals, combine shorter aggregations into longer aggregations.

Example: counter x.y.z has labels "a" and "b".
Over one collection interval, multiple events are aggregated corresponding to the distinct label sets, leading to intermediate checkpoints:

x.y.z{a=1,b=1} = 250
x.y.z{a=1,b=0} = 250
x.y.z{a=0,b=1} = 250
x.y.z{a=0,b=0} = 250

The collection interval has four aggregators, each with a count of 250 and a distinct label set.

To expose an aggregation over "a", dropping "b", combine two pairs:
x.y.z{a=1} = 500
x.y.z{a=0} = 500

Likewise, the aggregation over "b", dropping "a", combine two pairs:
x.y.z{b=1} = 500
x.y.z{b=0} = 500

Now suppose this covers a 1 second interval. You'd like to expose a 1 minute interval. Now take the output from the previous step, and combine 60 consecutive intervals together.

As demonstrated (I hope), you shouldn't need a pass-through aggregator to perform multiple aggregations--your goals can be accomplished by merging aggregator checkpoints I think.

The request for a pass-through aggregator would make more sense in an aggregator that computes something derived from time and is not window-oriented, like computing an exponential moving average over the last 5 minutes (regardless of interval).

jmacd · 2020-05-26T22:22:44Z

@cijothomas Mostly my response above was for your statement.
@mxiamxia Given your requirements, it would be perhaps a little simpler to implement your own SDK, since it will do work that you apparently do not want done. I am imagining a simple SDK that just emits a structured log record for every metric event. The default SDK has a lot of machinery for coalescing events that you will not need, so it would be additional expense to use the default SDK.

mxiamxia · 2020-05-26T23:15:24Z

@cijothomas Mostly my response above was for your statement.
@mxiamxia Given your requirements, it would be perhaps a little simpler to implement your own SDK, since it will do work that you apparently do not want done. I am imagining a simple SDK that just emits a structured log record for every metric event. The default SDK has a lot of machinery for coalescing events that you will not need, so it would be additional expense to use the default SDK.

Thanks @jmacd. If we want to have it work with OTel Collector, I assume there is no proper way that we can send the metrics with structured logs in the current metrics data model until Logs data model spec is ready for Collector.

I looked Metrics SDK Draft Spec, if I get it right will the new aggregator Exact be able to export all the raw event data with quantile in a collection interval?

Exact: This aggregator MUST store an array of all recorded measurements. This aggregator MUST support exact quantile computations and it MUST support exporting raw values in the order they were recorded, however it is not required to support both of these modes simultaneously (since computing quantiles requires sorting the measurements).

mxiamxia · 2020-08-05T04:08:32Z

// TODO: Decide if support for RawMeasurements (measurements recorded using
// the synchronous instruments) is necessary. It can be used to delegate the
// aggregation from the application to the agent/collector. See
// #617

It will be very nice If we can get the synchronous instruments for raw measurements support in GA. So it will help us to fully support EMF metrics with logs format in the collector and it is also possible the solution for us to support AWS Lambda customers with OT integration

bogdandrutu · 2020-08-05T17:39:16Z

@jmacd are you ok to support this for GA?

jmacd · 2020-08-08T03:52:02Z

I am conflicted about a few things here. We have @cnnradams working on an exemplars prototype, and I encouraged him to go a little further than simply capturing trace/span IDs in histogram buckets. As far as exemplars go, we can assert that there are sensible reasons to include exemplars of any measurement when there has been dimensionality reduction, and that we can include statistical weight to make such exemplars more useful. That said, not many consumers of this statistical sample data are asking for this, and there is not a widespread belief that we need this. What's the minimum viable product here? If we just fall back to trace_id and span_id, would we include these in the data points directly? Should we be limited to only one trace_id/span_id per data point? Should we try to include the removed labels when exporting exemplars? Do we care about statistical exemplars?

All of these questions come up when I think about how to represent raw values, since exemplars are practically the same where the statistical count of each exemplar is 1. There has been some debate about how such values should appear in OTLP: there is a concern that raw values will tend to repeat their label sets and that the current shape of the Metric data point does not offer a way to repeatedly use label sets. Does it matter if every raw data point copies the effective label set, or is it important to combine raw values into a group and list their label set only once? If we don't care about this performance/cost concern, then it's stragithforward to represent raw values and exemplars as a repeated RawValue in the Metric, alongside the scalar, Histogram, and Summary repeated values. In this case, if the data point is actually raw data, then the value Type field should reflect it (e.g., as RAW_INT64 or RAW_DOUBLE), whereas if the data point is actually another value type and the raw values are merely exemplars, then the value type should indicate a different type.

@bogdandrutu and @open-telemetry/specs-metrics-approvers please consider. I don't want this issue to block OTLP, but I want to give @cnnradams a target he can hit at the same time.

MrAlias · 2020-08-11T16:11:49Z

What's the minimum viable product here?

I think this is the important question here.

Does it matter if every raw data point copies the effective label set, or is it important to combine raw values into a group and list their label set only once?

For an MVP, I don't think so. It seems like from what @mxiamxia is asking for it would be satisfied with what you described:

If we don't care about this performance/cost concern, then it's stragithforward to represent raw values and exemplars as a repeated RawValue

It seems to me that if you are asking for "all the data" from the SDK you likely already assume it to be sent in a non-optimized format. My guess is this will be seen as a benefit as the users is looking for full control in all statistical sampling decisions here.

As for trace info:

If we just fall back to trace_id and span_id, would we include these in the data points directly? Should we be limited to only one trace_id/span_id per data point?

Could this be added afterwards in a backwards compatible way? I think it would be useful, but if it blocks the MVP I don't see it being that useful.

bogdandrutu · 2020-08-19T19:18:08Z

@MrAlias in the protocol we can add any new data type in a backwards compatible way (there is a need to deploy the collector first, but that is true for any distributed system). In terms of the SDK, this is another aggregation type "AllRawValues" so I don't see why cannot be added later.

MrAlias · 2020-08-20T02:47:41Z

@MrAlias in the protocol we can add any new data type in a backwards compatible way (there is a need to deploy the collector first, but that is true for any distributed system). In terms of the SDK, this is another aggregation type "AllRawValues" so I don't see why cannot be added later.

These are all good points.

IIRC AWS engineers were asking for this. I feel like they should probably be a part of the conversation.

cc @alolita (not sure who else to ping here).

jsuereth · 2021-04-06T16:09:38Z

From metrics data model sig:

This is non-blocking for declaring stable metrics protocol
We can add this in post-stability-of-protocol.
I'm going to remove the TODO from the metrics.proto around this, but we can still add this feature in the future.

reyang · 2021-06-08T16:23:39Z

The SDK spec allows processors to have access to the raw data with this PR #1673.
Removing the area:sdk tag and leave it for the data model / OTLP.

ramonguiu · 2022-08-04T13:13:00Z

There is an interesting discussion in CNCF community slack. Some folks have shared some potential workarounds for now. Linking it here in case if may be useful to readers of this issue.

jamielennox · 2022-11-29T15:33:34Z

The majority of discussion around this happened long before the stabilization of the metrics APIs and the OTLP protocol for metrics. Is there an appetite now for instating something like the Exact (the old name, i don't mind what it's called) aggregator and equivalent OTLP format now and how would we go about it?

At it's core our metrics system doesn't do bucketted histograms and so we're blocked on exporting OTLP from the SDK today which is very limiting. Exporting the complete measurement array is the best way to get the full data into the collector which allows us to write the export in only one place.

reyang · 2023-02-15T06:37:42Z

I feel this is interesting, I do understand the need, but I don't see a high demand/priority.

@jmacd @jsuereth I know you're busy with other workstreams, just curious if you think this can be achieved with either one of the following approaches?

Hacky way - folks can always write a custom ExemplarFilter and grab all the raw measurements, then do whatever they want.
A bit more systematic - maybe we should introduce the "MeasurementProcessor" concept, such processor will see all the measurements reported by sync/async APIs (metrics via producer API is not flowing through this path). All the existing aggregations and exemplars can be modeled as special types of MeasurementProcessor.

MatisseHack · 2023-05-13T02:14:31Z

I would really love to see this feature come to fruition! My team maintains a service with a very bursty usage pattern and we loose a lot of fidelity in our metrics because there is no way to turn off aggregation.

Similar to AWS CloudWatch, Azure Application Insights has powerful tools to query and visualize raw metric data. I just wish I could get all my metric data there more easily.

jmacd added api vs sdk bounds bug Something isn't working spec:metrics Related to the specification/metrics directory and removed bug Something isn't working labels May 26, 2020

jmacd mentioned this issue Jun 5, 2020

Integrate Exemplars with Metrics SDK open-telemetry/oteps#113

Merged

bogdandrutu removed the api vs sdk bounds label Jun 12, 2020

jmacd added the release:required-for-ga Must be resolved before GA release, or nice to have before GA label Jun 29, 2020

andrewhsu added the priority:p1 Highest priority level label Jul 24, 2020

bogdandrutu added priority:p2 Medium priority level and removed priority:p1 Highest priority level labels Jul 24, 2020

jmacd mentioned this issue Jul 29, 2020

Add Gauge. Change Metric type to be a more detailed structure with details about the aggregation and temporality. open-telemetry/opentelemetry-proto#182

Merged

bogdandrutu self-assigned this Aug 5, 2020

This was referenced Aug 8, 2020

Add Exemplar support to Metrics proto open-telemetry/opentelemetry-proto#159

Merged

Exemplars prototype open-telemetry/opentelemetry-python#936

Closed

bogdandrutu added priority:p1 Highest priority level and removed priority:p2 Medium priority level labels Oct 16, 2020

jmacd mentioned this issue Dec 18, 2020

OTLP: Support for exporting raw metric data open-telemetry/opentelemetry-go#1413

Closed

reyang added the area:sdk Related to the SDK label Mar 3, 2021

reyang added this to the Metrics API/SDK Feature Freeze milestone Mar 3, 2021

mxiamxia mentioned this issue Mar 11, 2021

REQUEST: New membership for mxiamxia open-telemetry/community#677

Closed

6 tasks

jsuereth added area:data-model For issues related to data model and removed release:required-for-ga Must be resolved before GA release, or nice to have before GA labels Apr 6, 2021

jsuereth added the release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs label Apr 6, 2021

jsuereth self-assigned this Apr 6, 2021

reyang removed the area:sdk Related to the SDK label Jun 8, 2021

reyang removed this from the Metrics API/SDK Feature Freeze milestone Jun 8, 2021

jsuereth mentioned this issue Jul 27, 2021

Metric Exemplars SDK Specification #1828

Merged

reyang mentioned this issue Aug 3, 2021

Remove MetricProcessor from the SDK spec (for now) #1840

Merged

reyang added this to the Metrics Future Release milestone Nov 7, 2021

trask mentioned this issue Apr 25, 2023

Add namespace for process metric attributes #3431

Closed

trask mentioned this issue May 24, 2023

Should metric attributes be required to be namespaced? open-telemetry/semantic-conventions#51

Closed

jack-berg mentioned this issue Dec 18, 2023

Include the raw measurement values as part of a histogram data point. #3797

Closed

trask unassigned jsuereth, bogdandrutu and reyang Apr 2, 2024

trask added triage:accepted:needs-sponsor Ready to be implemented, but does not yet have a specification sponsor and removed priority:p1 Highest priority level release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs labels Apr 2, 2024

cijothomas mentioned this issue May 15, 2024

Disable Cardinality limit / pre-aggregation? open-telemetry/opentelemetry-dotnet#5618

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ability to capture all metric raw data without aggregation #617

ability to capture all metric raw data without aggregation #617

mxiamxia commented May 21, 2020 •

edited

Loading

cijothomas commented May 22, 2020

jmacd commented May 26, 2020

jmacd commented May 26, 2020

mxiamxia commented May 26, 2020 •

edited

Loading

mxiamxia commented Aug 5, 2020 •

edited

Loading

bogdandrutu commented Aug 5, 2020

jmacd commented Aug 8, 2020

MrAlias commented Aug 11, 2020

bogdandrutu commented Aug 19, 2020

MrAlias commented Aug 20, 2020

jsuereth commented Apr 6, 2021

reyang commented Jun 8, 2021

ramonguiu commented Aug 4, 2022

jamielennox commented Nov 29, 2022

reyang commented Feb 15, 2023

MatisseHack commented May 13, 2023

ability to capture all metric raw data without aggregation #617

ability to capture all metric raw data without aggregation #617

Comments

mxiamxia commented May 21, 2020 • edited Loading

cijothomas commented May 22, 2020

jmacd commented May 26, 2020

jmacd commented May 26, 2020

mxiamxia commented May 26, 2020 • edited Loading

mxiamxia commented Aug 5, 2020 • edited Loading

bogdandrutu commented Aug 5, 2020

jmacd commented Aug 8, 2020

MrAlias commented Aug 11, 2020

bogdandrutu commented Aug 19, 2020

MrAlias commented Aug 20, 2020

jsuereth commented Apr 6, 2021

reyang commented Jun 8, 2021

ramonguiu commented Aug 4, 2022

jamielennox commented Nov 29, 2022

reyang commented Feb 15, 2023

MatisseHack commented May 13, 2023

mxiamxia commented May 21, 2020 •

edited

Loading

mxiamxia commented May 26, 2020 •

edited

Loading

mxiamxia commented Aug 5, 2020 •

edited

Loading