Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueRecorder - default aggregation #636

Closed
obecny opened this issue Jun 5, 2020 · 46 comments
Closed

ValueRecorder - default aggregation #636

obecny opened this issue Jun 5, 2020 · 46 comments
Labels
area:api Cross language API specification issue priority:p1 Highest priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:metrics Related to the specification/metrics directory

Comments

@obecny
Copy link
Member

obecny commented Jun 5, 2020

https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/metrics/api.md#interpretation

Is this correct that for ValueObserver we have aggregator MinMaxSumCount instead of LastValue ?.
If this is not a bug can you please explain how would you track metric if you are only interested in last value as currently if I understand this correctly there is no such possibility.

@obecny obecny changed the title ValueObserver - default aggragation ValueObserver - default aggregation Jun 5, 2020
@jmacd
Copy link
Contributor

jmacd commented Jun 9, 2020

There's a slight distinction to make between the aggregation that's selected and whether it's possible to know the last value. If you use an observer instrument, there can be at most one value per label set per instrument. Therefore, the Max equals the Min value equals the Last value. I would stipulate that the appropriate representation for MinMaxSumCount when Count=1 is as a single value.

When aggregations are applied locally to an asynchronous instrument, then MinMaxSumCount produces data other than the LastValue aggregator would, because Count can be greater than 1. In these cases, I suspect LastValue is not what the user wants. I believe conceptually MinMaxSumCount is always a good choice.

I believe the real issue that you're raising is that users want a mapping from OTel metric instrument to a Gauge exporter in a legacy system. One approach that I've considered is to simply state that when exporting ValueObserver directly into a legacy metrics system from a client library, choose a Gauge instrument. This is probably correct in most cases. A slight modification would be to state that Gauge should be selected only when there is no local aggregation, that otherwise MinMaxSumCount should be chosen. In these situations, I would still specify that MinMaxSumCount is the correct default aggregation for the collector to use when receiving data via OTLP and there are no other configurations.

The other approach that's been discussed came up in #608. If we create two new Grouping instruments, one synchronous and one asynchronous (#625), then users will have a choice of whether they want their Grouping instrument to have MinMaxSumCount or LastValue aggregation by default. I think #608 is unlikely to pass given the current feedback, so this probably won't happen.

@jmacd
Copy link
Contributor

jmacd commented Jun 9, 2020

@bogdandrutu I'm interested in your thoughts here.

@jmacd
Copy link
Contributor

jmacd commented Jun 11, 2020

There was agreement in the SIG call today that we should change defaults to use the LastValue aggregation for ValueRecorder and ValueObserver. (I may propose an alternative, which would be called RandomValue, but either way I encourage you to go ahead and implement LastValue, since users are expecting Gauges in Prometheus and Statsd and it is the least expensive option.)

@jmacd
Copy link
Contributor

jmacd commented Jun 11, 2020

I will file an issue stating this later today.

@jmacd jmacd added the spec:metrics Related to the specification/metrics directory label Jun 11, 2020
@jkwatson
Copy link
Contributor

Why would ValueRecorder have LastValue as the default aggregation? ValueObserver I understand, but what is the rationale for ValueRecoder not having MMSC as the default?

@jmacd
Copy link
Contributor

jmacd commented Jun 12, 2020

I'm struggling with this issue for ValueRecorder. The issue is that there are three conventional ways to expose data from a synchronous grouping instrument in Prometheus: as a Gauge, a Histogram, or a Summary. As it stands, ValueRecorder generates MMSC which looks like a Summary, so otel-collector exports a Prometheus summary which users are not enjoying.

I'm against creating new instruments that have identical semantics and only are distinguished by their default aggregator. If it were not for OTLP and the collector, I would try to change the spec to say that exporters will have a say in the default aggregator: if the Exporter has a good way to represent MMSC, then it should take that option, but if you're Prometheus, using a Gauge is the only "Good, Cheap" option. With OTLP and the collector involved, I am struggling to find a default behavior that will result in "Good, Cheap" behavior for Prometheus.

One idea I had was to change the default to MinMaxSumCountLast, i.e., add the last value to MMSC so that it can generate a Gauge or a MMSC, but that idea is bad from a statistical point of view, plus its five numbers not one, so it loses some of the advantages of LastValue.

A better idea I had is to introduce a RandomValue aggregator to use as the default for ValueRecorder. It will select a random item from the stream, uniformly, and record it as a single exemplar for the distribution. This qualifies as cheap, but is it good? I think it is, because aggregating this data gives us the distribution--with enough hosts or collection intervals the empirical distribution emerges. I'm afraid LastValue won't do in this case, because it is biased to the moment of collection.

To summarize, I'm interested in changing the spec to say that the operator (i.e., installer of the SDK) and/or Exporters have a say in what the default aggregation should be, since even at the cheap end of the scale, systems differ in how they want to treat these measurements. Prometheus would prescribe Gauge for ValueRecorder. NewRelic would prescribe MinMaxSumCount for ValueRecorder, but this is not very satisfactory because OTLP has to make a decision. If OTLP used RandomValue as the default ValueRecorder aggregator, then exporters on the collector side will have to work with it. Prometheus can export the random value as a gauge value, but now we have a situation where Prometheus acts somewhat differently if exported locally vs through the collector. Maybe Prometheus should always export a random value as a gauge.

@jmacd jmacd changed the title ValueObserver - default aggregation ValueObserver and ValueRecorder - default aggregation Jun 12, 2020
@jkwatson
Copy link
Contributor

As it stands, ValueRecorder generates MMSC which looks like a Summary, so otel-collector exports a Prometheus summary which users are not enjoying.

I don't know much about Prometheus. Why is this such a bad experience?

@jmacd
Copy link
Contributor

jmacd commented Jun 12, 2020

I'd like to dig deeper into this, but my understanding is that Summary types have traditionally been used for quantiles, but the quantile summaries cannot be merged and are therefore generally discouraged. If Prometheus were changed to support a mergeable Summary this might not be a problem. Of course MMSC is trivially mergeable, but it's a special case from Prometheus' point of view.

Another "solution" here would be to use MMSC to compute the average value, and let Prometheus report a Gauge with the average. (I still favor this RandomValue idea.)

@jkwatson
Copy link
Contributor

I guess my concern is that if we're making a fairly draconian decision based solely on a prometheus limitation, that might not be the best reason to make that decision.

Ending up with only have a last-or-random value aggregation be the default for a ValueRecorder would be a really non-optimal out-of-the-box for our customers, for example. If I'm recording latency timings for something of medium throughput and only got a random value each reporting period, that seems like a terrible default, rather than providing some kind of statistically interesting result by default. If we could add a mean to MMSC (making it MMMSC?), that seems much more generally useful out of the box across a wide spectrum of backends.

@jmacd
Copy link
Contributor

jmacd commented Jun 12, 2020

If we can find support for exporting a Gauge of the mean value in Prometheus, then I think it's a good solution. MMSC already has the average value (i.e., Sum / Count) so we don't have to change much other than specify how to export MMSC values to Prometheus.

@jmacd
Copy link
Contributor

jmacd commented Jun 12, 2020

I am not sure if "draconian" is entirely fair. :-) I think "frugal" or "thrifty" could also apply.

@jkwatson
Copy link
Contributor

ah, yes, of course, you can compute the mean from the MMSC. foggy meeting brain got in the way of thinking. ;).

@dcwangmit01
Copy link

dcwangmit01 commented Jun 12, 2020

If we can find support for exporting a Gauge of the mean value in Prometheus, then I think it's a good solution. MMSC already has the average value (i.e., Sum / Count) so we don't have to change much other than specify how to export MMSC values to Prometheus.

Exporting the mean value could be very misleading. For long-running metrics that have collected a lot of measurements, recent measurements that are far from the mean will have a hard time triggering alerts which would be based off of mean instead of LastValue. For this same reason, RandomValue is also not what people will expect. Most of the open-source world is coming from Prometheus where gauges provide a LastValue snapshot in time. I truly believe LastValue is what people expect.

@jkwatson
Copy link
Contributor

Why are thinking that a ValueRecorder should map to a Gauge? This seems extremely counter intuitive to me. I really want a distribution as the output, and MMSC is the cheapest distribution money can buy.

@jmacd
Copy link
Contributor

jmacd commented Jun 13, 2020

I think we've been thinking too much about distributions as being stable with respect to time, and/or we've been assuming relatively short collection intervals. I've come around on this: knowing the last value tells us something time-dependent whereas otherwise the distribution statistics do not.

I will propose that we add a Last value to MMSC, making it something like MinMaxLastSumCount, and let it be the new default. This will allow exporting either value on the other side of an OTLP connection. In a local configuration, probably the extra work can be avoided.

In particular, I'd like it if the OTLP protocol specified that MMLSC of a single data point can be encoded as a single data point where Min == Max == Last == Sum and Count == 1.

@jkwatson
Copy link
Contributor

I think you lost me at "being stable with respect to time". Can you explain, or point this non-statistics-guru at some description of what this means?

If the MMSC has a start and end time on it, that seems to pretty well define the time period for what it is describing. I'm guessing your meaning something else, though.

@jmacd
Copy link
Contributor

jmacd commented Jun 16, 2020

@jkwatson I was thinking that a "stable" distribution as one where there is no change in the distribution with respect to time--nothing time dependent. We'll know the start and end time of the observation window, and knowing the last timestamped event value inside the window tells us still more about the events. (Knowing the first value is similar, knowing both is still more information).

Comparing the last value with the mean value could tell you something about a tendency in the data, for example a rising mean with respect to time.

The last timestamp can be useful on its own. If there is a large gap between the last timestamp and the end of the window (compared with the expected arrival time), it might reveal that something bursty was happening earlier in the window.

@jkwatson
Copy link
Contributor

I like the idea of including the "last" recording exemplar. 👍

@carlosalberto carlosalberto added the area:api Cross language API specification issue label Jun 26, 2020
@jmacd jmacd added the release:required-for-ga Must be resolved before GA release, or nice to have before GA label Jun 29, 2020
@obecny
Copy link
Member Author

obecny commented Aug 12, 2020

@bogdandrutu is there already an updated spec for that or at least draft ?

@bogdandrutu
Copy link
Member

No, this is what I am trying to discuss here. Is this proposal good? Should we move forward and propose this in the specs?

@obecny
Copy link
Member Author

obecny commented Aug 12, 2020

If we are about discussing then I would always add lastValue to MinMaxSumCount so it will be MinMaxLastSumCount. Why? many reason but even both of the metrics have a "value" in name so I would always be interested in knowing lastValue for both. With MinMaxSumCount I can't get the last value which is weird. I'm fine for LastValue or MinMaxLastSumCount
otherwise I cannot even display it easily to see if they are changing or not - just try to make a simple linear graph to show how the value was changing over time - you can't do it.

@bogdandrutu
Copy link
Member

Ok, so if I understand correctly you are saying that even for ValueRecorder (where the name Value may be a confusing) you would want to see the LastValue, this may be arguable but I can see some value in it.

But for the ValueObserver/GaugeObserver why do you need MinMaxSumCount?

@bogdandrutu
Copy link
Member

Also if I have to always export the LastValue and the MinMaxSumCount I would do it in 2 different metrics, because the first is a Gauge and the second is trying to represent a distribution of values.

@obecny
Copy link
Member Author

obecny commented Aug 12, 2020

Ok, so if I understand correctly you are saying that even for ValueRecorder (where the name Value may be a confusing) you would want to see the LastValue, this may be arguable but I can see some value in it.

But for the ValueObserver/GaugeObserver why do you need MinMaxSumCount?

Bogdan I'm fine with either LastValue or MinMaxLastSumCount. Having only LastValue is much more valuable then MinMaxSumCount as I can count then the rest myself (min, max etc.) but without lastValue this is not possible and I cannot do any calculation.

I can see some value in having already the basic aggregators (min, max etc.) but only when I also have the lastValue, otherwise I would get rid of MinMaxSumCount in favor of LastValue only.

@jmacd
Copy link
Contributor

jmacd commented Aug 12, 2020

I'm a bit confused @obecny. Are you suggesting that if there are 1000 values recorded with a ValueRecorder over an interval, that we should still only report the LastValue? If so, you can't compute the min or the max or the count, which are all relevant and useful for ValueRecorder data points.

@jmacd
Copy link
Contributor

jmacd commented Aug 12, 2020

Also

Also if I have to always export the LastValue and the MinMaxSumCount I would do it in 2 different metrics, because the first is a Gauge and the second is trying to represent a distribution of values.

This is not a conclusion that I reach. LastValue does represent a distribution of values. It's the maximum timestamp that was present in the data set. It's also a gauge, but you can think of Min, and Max as gauges. The Summary type as we had it (IMO) was able to represent N > 1 values in a single data point.

@obecny
Copy link
Member Author

obecny commented Aug 12, 2020

I'm a bit confused @obecny. Are you suggesting that if there are 1000 values recorded with a ValueRecorder over an interval, that we should still only report the LastValue? If so, you can't compute the min or the max or the count, which are all relevant and useful for ValueRecorder data points.

Imagine simple situation, you want to create a CPU usage graph that will show current value (last value), and min, max.
How would you do it ?

  1. If I only have lastValue I can display this and add some calculation. I would choose ValueObserver with LastValue aggregator and then do the rest calculation myself.

  2. Now let's improve point 1 - I don't want to make any calculation, how can I achieve that ? I would think about choosing something that can save more points and do some aggregation for me, here comes the ValueRecorder. But because you want the ValueRecorder not to have lastValue I need to use 2 metrics. Now what will be cheaper. Using 2 metrics with 2 aggregators or just one with MinMaxLastSumCount so ValueRecorder with MinMaxLastSumCount.

If ValueRecorder has MinMaxSumCount aggregator, why it cannot simply hold the lastValue what is the cost of that?

If ValueRecorder won't have MinMaxLastSumCount and I still don't want to use 2 metrics, I would rather create my own batcher where I change the aggregator for ValueRecorder so that I can have it. But why would I do it, if the ValueRecorder can already support that ?

And just to be clear I'm only talking about ValueRecorder to have MinMaxLastSumCount aggregator, for ValueObserver LastValue is perfectly fine - originally it had MinMaxSumCount which was the reason why all of this conversation started.

@bogdandrutu
Copy link
Member

@obecny very nice example, I would personally solve it differently:

  1. A lot of the backends have the possibility to calculate Min/Max on demand for a Gauge/LastValue metric, so you don't actually need to calculate anything by yourself.
  2. In case you want to report more granular measurements (more often) you still can use ValueObserver and configure an export time for this metric to be more often (not yet supported in our SDK, but this will be supported). Once you do that you not only have the ability to see min/max for the more granular metric but also when display the gauge metric see spikes.
  3. I would never use ValueRecorded for CPU (and for any metric that is calculated outside of our library). If the solution 2 is not acceptable because of the storage cost on the backend you choose, using "to be fully specified" views, you will be able to set "scrape time" to be every 1 second for this ValueObserver and export time to be every 10 seconds and produce a MinMaxSumCount + LastValue (explain next why I am using 2 different "views").

If ValueRecorder has MinMaxSumCount aggregator, why it cannot simply hold the lastValue what is the cost of that?

Here you are talking about an "available" Aggregator in the SDK, that we can offer. Which I agree that we can easily build an aggregator that is called MinMaxLastSumCount. If you don't mind can you tell me what backend do you have in mind that supports reporting in only one Metric MinMaxSumCount and LastValue/Gauge? I don't know about any to support this, so most likely you will end up with 2 metrics on the backend side.

When it comes to "implementation" detail, I would say that users should be aware that they configure 2 different "views/metrics" (based on my observation that in all the backends these will be 2 different metrics at least), but internally we can "merge" these 2 metrics requested by the user into only one "Aggregator" - similar argument can be made if user asks for "LastValue" + "Histogram" we can still have a smart logic and have a "combined" Aggregator. So I don't see why "MinMaxLastSumCount" is a special case compare to any other combination of 2 views/metrics.

And just to be clear I'm only talking about ValueRecorder to have MinMaxLastSumCount aggregator, for ValueObserver LastValue is perfectly fine - originally it had MinMaxSumCount which was the reason why all of this conversation started.

Happy that we agree on the ValueObserver.

@jmacd
Copy link
Contributor

jmacd commented Aug 12, 2020

Given the proposal in open-telemetry/opentelemetry-proto#199, I'm getting confused about how we could represent Min, Max, Sum, and Count fields. There has been a motion to consider Histogram or DDSketch as defaults for ValueRecorder, to which I'd add that an Exact aggregation (i.e., exporting raw data points) would also work, but that's a performance problem we'd like to avoid as a default.

@obecny
Copy link
Member Author

obecny commented Aug 12, 2020

@bogdandrutu thx for detailed explanation it makes sense and I'm glad that ValueObserver will have LastValue only :)

@jmacd jmacd changed the title ValueObserver and ValueRecorder - default aggregation ValueRecorder - default aggregation Aug 19, 2020
@jmacd
Copy link
Contributor

jmacd commented Aug 19, 2020

I retitled this because we have wide agreement that the default aggregator for ValueObserver should be LastValue (i.e., produce Gauge data points).

@bogdandrutu
Copy link
Member

Closing this as a duplicate of #982

@haarakdy
Copy link

haarakdy commented May 20, 2022

How do I replace the default MinMaxSumCount to the LastValue for valuerecorder in Python.

Is it possible to disable aggregation, collecting and shipping all values, the whole timeseries?
At the moment I have a monkeypatch

class DefaultAggregator:
    """
    The default aggregator is MinMaxSumCount. I need ValueObserverAggregator
    """

    initialized: bool = False
    _get_default_aggregator: Callable = view.get_default_aggregator

    def __init__(self):
        if DefaultAggregator.initialized:
            return
        DefaultAggregator.initialized = True
        DefaultAggregator._get_default_aggregator = view.get_default_aggregator
        view.get_default_aggregator = self.get_default_aggregator

    def get_default_aggregator(self, instrument: view.InstrumentT) -> Aggregator:
        instrument_type = instrument.__class__
        if issubclass(instrument_type, (UpDownCounter, ValueRecorder)):
            return ValueObserverAggregator
        return DefaultAggregator._get_default_aggregator(instrument)


DefaultAggregator()

I will appreciate a tip/link
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:api Cross language API specification issue priority:p1 Highest priority level release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:metrics Related to the specification/metrics directory
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants