Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ability to capture all metric raw data without aggregation #617

Open
mxiamxia opened this issue May 21, 2020 · 16 comments
Open

ability to capture all metric raw data without aggregation #617

mxiamxia opened this issue May 21, 2020 · 16 comments
Labels
area:data-model For issues related to data model spec:metrics Related to the specification/metrics directory triage:accepted:needs-sponsor Ready to be implemented, but does not yet have a specification sponsor

Comments

@mxiamxia
Copy link
Member

mxiamxia commented May 21, 2020

In OT Metric API Instruments Spec, we defined that each type of metric instrument should be binding with a default aggregation behavior / aggregator. In AWS CloudWatch, we support to capture all raw metric measurement events and assemble it as Embedded Metric Format(EMF) in a structured log and so CloudWatch backend can process all raw metric data and provide all metrics visualization capabilities.

We'd like to implement such Exporters for SDK and Collector to support EMF metric format but it requires that we can capture all the raw metric events without aggregating. I see we have Views API WIP to allow customers customizing aggregator. We do want to start the work on this ASAP and looking for the suggestions here if there is a way we can capture all the raw metrics event data(especially not SummaryDataPoint) and pass it to OT collector so we can bring in AWS exporters to support it? Thanks.

@cijothomas
Copy link
Member

I believe this would be part of SDK spec. I am also interested in getting a hook in the Metric SDK to get raw metric events and also aggregated metric values.
For Traces, the SpanProcessors are the way to get access to Spans. For Metrics, we may need MetricProcessor (which gets metrics after aggregation), and Aggregator (which gets raw metrics)

My use case for this is different:
I have a need where same metric need to be aggregated more than once, with different labelsets, and with different aggregation interval too. View may help with former, but not the latter.

Having hooks to get raw events( similar to aggregator) and aggregated events (MetricProcessor) will help me achieve my scenario as well.

@jmacd
Copy link
Contributor

jmacd commented May 26, 2020

The Aggregator interface that is currently described in the draft SDK specification and in the OTel-Go library does receive raw events. An aggregator that handles raw events and does not compute checkpoints could be referred to as a "pass-through" aggregator. This has been requested, but no prototypes exist.

The metrics SDK describes an "Integrator" which is like the span processor. The integrator receives checkpoints output by the aggregators (one per distinct label set). I believe it is possible to compute multiple aggregations as described without a pass-through aggregator. To achieve multiple aggregations for different label sets, one checkpoint (for one label set) will be merged into multiple output aggregations according to the labels. To achieve multiple intervals, combine shorter aggregations into longer aggregations.

Example: counter x.y.z has labels "a" and "b".
Over one collection interval, multiple events are aggregated corresponding to the distinct label sets, leading to intermediate checkpoints:

x.y.z{a=1,b=1} = 250
x.y.z{a=1,b=0} = 250
x.y.z{a=0,b=1} = 250
x.y.z{a=0,b=0} = 250

The collection interval has four aggregators, each with a count of 250 and a distinct label set.

To expose an aggregation over "a", dropping "b", combine two pairs:
x.y.z{a=1} = 500
x.y.z{a=0} = 500

Likewise, the aggregation over "b", dropping "a", combine two pairs:
x.y.z{b=1} = 500
x.y.z{b=0} = 500

Now suppose this covers a 1 second interval. You'd like to expose a 1 minute interval. Now take the output from the previous step, and combine 60 consecutive intervals together.

As demonstrated (I hope), you shouldn't need a pass-through aggregator to perform multiple aggregations--your goals can be accomplished by merging aggregator checkpoints I think.

The request for a pass-through aggregator would make more sense in an aggregator that computes something derived from time and is not window-oriented, like computing an exponential moving average over the last 5 minutes (regardless of interval).

@jmacd jmacd added api vs sdk bounds bug Something isn't working spec:metrics Related to the specification/metrics directory and removed bug Something isn't working labels May 26, 2020
@jmacd
Copy link
Contributor

jmacd commented May 26, 2020

@cijothomas Mostly my response above was for your statement.
@mxiamxia Given your requirements, it would be perhaps a little simpler to implement your own SDK, since it will do work that you apparently do not want done. I am imagining a simple SDK that just emits a structured log record for every metric event. The default SDK has a lot of machinery for coalescing events that you will not need, so it would be additional expense to use the default SDK.

@mxiamxia
Copy link
Member Author

mxiamxia commented May 26, 2020

@cijothomas Mostly my response above was for your statement.
@mxiamxia Given your requirements, it would be perhaps a little simpler to implement your own SDK, since it will do work that you apparently do not want done. I am imagining a simple SDK that just emits a structured log record for every metric event. The default SDK has a lot of machinery for coalescing events that you will not need, so it would be additional expense to use the default SDK.

Thanks @jmacd. If we want to have it work with OTel Collector, I assume there is no proper way that we can send the metrics with structured logs in the current metrics data model until Logs data model spec is ready for Collector.

I looked Metrics SDK Draft Spec, if I get it right will the new aggregator Exact be able to export all the raw event data with quantile in a collection interval?

Exact: This aggregator MUST store an array of all recorded measurements. This aggregator MUST support exact quantile computations and it MUST support exporting raw values in the order they were recorded, however it is not required to support both of these modes simultaneously (since computing quantiles requires sorting the measurements).

@mxiamxia
Copy link
Member Author

mxiamxia commented Aug 5, 2020

// TODO: Decide if support for RawMeasurements (measurements recorded using
// the synchronous instruments) is necessary. It can be used to delegate the
// aggregation from the application to the agent/collector. See
// #617

It will be very nice If we can get the synchronous instruments for raw measurements support in GA. So it will help us to fully support EMF metrics with logs format in the collector and it is also possible the solution for us to support AWS Lambda customers with OT integration

@bogdandrutu bogdandrutu self-assigned this Aug 5, 2020
@bogdandrutu
Copy link
Member

@jmacd are you ok to support this for GA?

@jmacd
Copy link
Contributor

jmacd commented Aug 8, 2020

I am conflicted about a few things here. We have @cnnradams working on an exemplars prototype, and I encouraged him to go a little further than simply capturing trace/span IDs in histogram buckets. As far as exemplars go, we can assert that there are sensible reasons to include exemplars of any measurement when there has been dimensionality reduction, and that we can include statistical weight to make such exemplars more useful. That said, not many consumers of this statistical sample data are asking for this, and there is not a widespread belief that we need this. What's the minimum viable product here? If we just fall back to trace_id and span_id, would we include these in the data points directly? Should we be limited to only one trace_id/span_id per data point? Should we try to include the removed labels when exporting exemplars? Do we care about statistical exemplars?

All of these questions come up when I think about how to represent raw values, since exemplars are practically the same where the statistical count of each exemplar is 1. There has been some debate about how such values should appear in OTLP: there is a concern that raw values will tend to repeat their label sets and that the current shape of the Metric data point does not offer a way to repeatedly use label sets. Does it matter if every raw data point copies the effective label set, or is it important to combine raw values into a group and list their label set only once? If we don't care about this performance/cost concern, then it's stragithforward to represent raw values and exemplars as a repeated RawValue in the Metric, alongside the scalar, Histogram, and Summary repeated values. In this case, if the data point is actually raw data, then the value Type field should reflect it (e.g., as RAW_INT64 or RAW_DOUBLE), whereas if the data point is actually another value type and the raw values are merely exemplars, then the value type should indicate a different type.

@bogdandrutu and @open-telemetry/specs-metrics-approvers please consider. I don't want this issue to block OTLP, but I want to give @cnnradams a target he can hit at the same time.

@MrAlias
Copy link
Contributor

MrAlias commented Aug 11, 2020

What's the minimum viable product here?

I think this is the important question here.

Does it matter if every raw data point copies the effective label set, or is it important to combine raw values into a group and list their label set only once?

For an MVP, I don't think so. It seems like from what @mxiamxia is asking for it would be satisfied with what you described:

If we don't care about this performance/cost concern, then it's stragithforward to represent raw values and exemplars as a repeated RawValue

It seems to me that if you are asking for "all the data" from the SDK you likely already assume it to be sent in a non-optimized format. My guess is this will be seen as a benefit as the users is looking for full control in all statistical sampling decisions here.

As for trace info:

If we just fall back to trace_id and span_id, would we include these in the data points directly? Should we be limited to only one trace_id/span_id per data point?

Could this be added afterwards in a backwards compatible way? I think it would be useful, but if it blocks the MVP I don't see it being that useful.

@bogdandrutu
Copy link
Member

@MrAlias in the protocol we can add any new data type in a backwards compatible way (there is a need to deploy the collector first, but that is true for any distributed system). In terms of the SDK, this is another aggregation type "AllRawValues" so I don't see why cannot be added later.

@MrAlias
Copy link
Contributor

MrAlias commented Aug 20, 2020

@MrAlias in the protocol we can add any new data type in a backwards compatible way (there is a need to deploy the collector first, but that is true for any distributed system). In terms of the SDK, this is another aggregation type "AllRawValues" so I don't see why cannot be added later.

These are all good points.

IIRC AWS engineers were asking for this. I feel like they should probably be a part of the conversation.

cc @alolita (not sure who else to ping here).

@bogdandrutu bogdandrutu added priority:p1 Highest priority level and removed priority:p2 Medium priority level labels Oct 16, 2020
@reyang reyang added the area:sdk Related to the SDK label Mar 3, 2021
@reyang reyang added this to the Metrics API/SDK Feature Freeze milestone Mar 3, 2021
@jsuereth jsuereth added area:data-model For issues related to data model and removed release:required-for-ga Must be resolved before GA release, or nice to have before GA labels Apr 6, 2021
@jsuereth jsuereth added the release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs label Apr 6, 2021
@jsuereth
Copy link
Contributor

jsuereth commented Apr 6, 2021

From metrics data model sig:

  • This is non-blocking for declaring stable metrics protocol
  • We can add this in post-stability-of-protocol.
  • I'm going to remove the TODO from the metrics.proto around this, but we can still add this feature in the future.

@jsuereth jsuereth self-assigned this Apr 6, 2021
@reyang reyang removed the area:sdk Related to the SDK label Jun 8, 2021
@reyang reyang removed this from the Metrics API/SDK Feature Freeze milestone Jun 8, 2021
@reyang
Copy link
Member

reyang commented Jun 8, 2021

The SDK spec allows processors to have access to the raw data with this PR #1673.
Removing the area:sdk tag and leave it for the data model / OTLP.

@ramonguiu
Copy link

There is an interesting discussion in CNCF community slack. Some folks have shared some potential workarounds for now. Linking it here in case if may be useful to readers of this issue.

@jamielennox
Copy link

The majority of discussion around this happened long before the stabilization of the metrics APIs and the OTLP protocol for metrics. Is there an appetite now for instating something like the Exact (the old name, i don't mind what it's called) aggregator and equivalent OTLP format now and how would we go about it?

At it's core our metrics system doesn't do bucketted histograms and so we're blocked on exporting OTLP from the SDK today which is very limiting. Exporting the complete measurement array is the best way to get the full data into the collector which allows us to write the export in only one place.

@reyang
Copy link
Member

reyang commented Feb 15, 2023

I feel this is interesting, I do understand the need, but I don't see a high demand/priority.

@jmacd @jsuereth I know you're busy with other workstreams, just curious if you think this can be achieved with either one of the following approaches?

  1. Hacky way - folks can always write a custom ExemplarFilter and grab all the raw measurements, then do whatever they want.
  2. A bit more systematic - maybe we should introduce the "MeasurementProcessor" concept, such processor will see all the measurements reported by sync/async APIs (metrics via producer API is not flowing through this path). All the existing aggregations and exemplars can be modeled as special types of MeasurementProcessor.

@MatisseHack
Copy link

I would really love to see this feature come to fruition! My team maintains a service with a very bursty usage pattern and we loose a lot of fidelity in our metrics because there is no way to turn off aggregation.

Similar to AWS CloudWatch, Azure Application Insights has powerful tools to query and visualize raw metric data. I just wish I could get all my metric data there more easily.

@trask trask added triage:accepted:needs-sponsor Ready to be implemented, but does not yet have a specification sponsor and removed priority:p1 Highest priority level release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs labels Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:data-model For issues related to data model spec:metrics Related to the specification/metrics directory triage:accepted:needs-sponsor Ready to be implemented, but does not yet have a specification sponsor
Projects
Status: Spec - Accepted
Development

No branches or pull requests