-
Notifications
You must be signed in to change notification settings - Fork 894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueRecorder - default aggregation #636
Comments
There's a slight distinction to make between the aggregation that's selected and whether it's possible to know the last value. If you use an observer instrument, there can be at most one value per label set per instrument. Therefore, the Max equals the Min value equals the Last value. I would stipulate that the appropriate representation for MinMaxSumCount when Count=1 is as a single value. When aggregations are applied locally to an asynchronous instrument, then MinMaxSumCount produces data other than the LastValue aggregator would, because Count can be greater than 1. In these cases, I suspect LastValue is not what the user wants. I believe conceptually MinMaxSumCount is always a good choice. I believe the real issue that you're raising is that users want a mapping from OTel metric instrument to a Gauge exporter in a legacy system. One approach that I've considered is to simply state that when exporting ValueObserver directly into a legacy metrics system from a client library, choose a Gauge instrument. This is probably correct in most cases. A slight modification would be to state that Gauge should be selected only when there is no local aggregation, that otherwise MinMaxSumCount should be chosen. In these situations, I would still specify that MinMaxSumCount is the correct default aggregation for the collector to use when receiving data via OTLP and there are no other configurations. The other approach that's been discussed came up in #608. If we create two new Grouping instruments, one synchronous and one asynchronous (#625), then users will have a choice of whether they want their Grouping instrument to have MinMaxSumCount or LastValue aggregation by default. I think #608 is unlikely to pass given the current feedback, so this probably won't happen. |
@bogdandrutu I'm interested in your thoughts here. |
There was agreement in the SIG call today that we should change defaults to use the LastValue aggregation for ValueRecorder and ValueObserver. (I may propose an alternative, which would be called RandomValue, but either way I encourage you to go ahead and implement LastValue, since users are expecting Gauges in Prometheus and Statsd and it is the least expensive option.) |
I will file an issue stating this later today. |
Why would ValueRecorder have LastValue as the default aggregation? ValueObserver I understand, but what is the rationale for ValueRecoder not having MMSC as the default? |
I'm struggling with this issue for ValueRecorder. The issue is that there are three conventional ways to expose data from a synchronous grouping instrument in Prometheus: as a Gauge, a Histogram, or a Summary. As it stands, ValueRecorder generates MMSC which looks like a Summary, so otel-collector exports a Prometheus summary which users are not enjoying. I'm against creating new instruments that have identical semantics and only are distinguished by their default aggregator. If it were not for OTLP and the collector, I would try to change the spec to say that exporters will have a say in the default aggregator: if the Exporter has a good way to represent MMSC, then it should take that option, but if you're Prometheus, using a Gauge is the only "Good, Cheap" option. With OTLP and the collector involved, I am struggling to find a default behavior that will result in "Good, Cheap" behavior for Prometheus. One idea I had was to change the default to MinMaxSumCountLast, i.e., add the last value to MMSC so that it can generate a Gauge or a MMSC, but that idea is bad from a statistical point of view, plus its five numbers not one, so it loses some of the advantages of LastValue. A better idea I had is to introduce a RandomValue aggregator to use as the default for ValueRecorder. It will select a random item from the stream, uniformly, and record it as a single exemplar for the distribution. This qualifies as cheap, but is it good? I think it is, because aggregating this data gives us the distribution--with enough hosts or collection intervals the empirical distribution emerges. I'm afraid LastValue won't do in this case, because it is biased to the moment of collection. To summarize, I'm interested in changing the spec to say that the operator (i.e., installer of the SDK) and/or Exporters have a say in what the default aggregation should be, since even at the cheap end of the scale, systems differ in how they want to treat these measurements. Prometheus would prescribe Gauge for ValueRecorder. NewRelic would prescribe MinMaxSumCount for ValueRecorder, but this is not very satisfactory because OTLP has to make a decision. If OTLP used RandomValue as the default ValueRecorder aggregator, then exporters on the collector side will have to work with it. Prometheus can export the random value as a gauge value, but now we have a situation where Prometheus acts somewhat differently if exported locally vs through the collector. Maybe Prometheus should always export a random value as a gauge. |
I don't know much about Prometheus. Why is this such a bad experience? |
I'd like to dig deeper into this, but my understanding is that Summary types have traditionally been used for quantiles, but the quantile summaries cannot be merged and are therefore generally discouraged. If Prometheus were changed to support a mergeable Summary this might not be a problem. Of course MMSC is trivially mergeable, but it's a special case from Prometheus' point of view. Another "solution" here would be to use MMSC to compute the average value, and let Prometheus report a Gauge with the average. (I still favor this RandomValue idea.) |
I guess my concern is that if we're making a fairly draconian decision based solely on a prometheus limitation, that might not be the best reason to make that decision. Ending up with only have a last-or-random value aggregation be the default for a ValueRecorder would be a really non-optimal out-of-the-box for our customers, for example. If I'm recording latency timings for something of medium throughput and only got a random value each reporting period, that seems like a terrible default, rather than providing some kind of statistically interesting result by default. If we could add a mean to MMSC (making it MMMSC?), that seems much more generally useful out of the box across a wide spectrum of backends. |
If we can find support for exporting a Gauge of the mean value in Prometheus, then I think it's a good solution. MMSC already has the average value (i.e., Sum / Count) so we don't have to change much other than specify how to export MMSC values to Prometheus. |
I am not sure if "draconian" is entirely fair. :-) I think "frugal" or "thrifty" could also apply. |
ah, yes, of course, you can compute the mean from the MMSC. foggy meeting brain got in the way of thinking. ;). |
Exporting the |
Why are thinking that a ValueRecorder should map to a Gauge? This seems extremely counter intuitive to me. I really want a distribution as the output, and MMSC is the cheapest distribution money can buy. |
I think we've been thinking too much about distributions as being stable with respect to time, and/or we've been assuming relatively short collection intervals. I've come around on this: knowing the last value tells us something time-dependent whereas otherwise the distribution statistics do not. I will propose that we add a Last value to MMSC, making it something like MinMaxLastSumCount, and let it be the new default. This will allow exporting either value on the other side of an OTLP connection. In a local configuration, probably the extra work can be avoided. In particular, I'd like it if the OTLP protocol specified that MMLSC of a single data point can be encoded as a single data point where Min == Max == Last == Sum and Count == 1. |
I think you lost me at "being stable with respect to time". Can you explain, or point this non-statistics-guru at some description of what this means? If the MMSC has a start and end time on it, that seems to pretty well define the time period for what it is describing. I'm guessing your meaning something else, though. |
@jkwatson I was thinking that a "stable" distribution as one where there is no change in the distribution with respect to time--nothing time dependent. We'll know the start and end time of the observation window, and knowing the last timestamped event value inside the window tells us still more about the events. (Knowing the first value is similar, knowing both is still more information). Comparing the last value with the mean value could tell you something about a tendency in the data, for example a rising mean with respect to time. The last timestamp can be useful on its own. If there is a large gap between the last timestamp and the end of the window (compared with the expected arrival time), it might reveal that something bursty was happening earlier in the window. |
I like the idea of including the "last" recording exemplar. 👍 |
@bogdandrutu is there already an updated spec for that or at least draft ? |
No, this is what I am trying to discuss here. Is this proposal good? Should we move forward and propose this in the specs? |
If we are about discussing then I would always add |
Ok, so if I understand correctly you are saying that even for But for the |
Also if I have to always export the LastValue and the MinMaxSumCount I would do it in 2 different metrics, because the first is a |
Bogdan I'm fine with either LastValue or MinMaxLastSumCount. Having only LastValue is much more valuable then MinMaxSumCount as I can count then the rest myself (min, max etc.) but without lastValue this is not possible and I cannot do any calculation. I can see some value in having already the basic aggregators (min, max etc.) but only when I also have the lastValue, otherwise I would get rid of MinMaxSumCount in favor of LastValue only. |
I'm a bit confused @obecny. Are you suggesting that if there are 1000 values recorded with a |
Also
This is not a conclusion that I reach. LastValue does represent a distribution of values. It's the maximum timestamp that was present in the data set. It's also a gauge, but you can think of Min, and Max as gauges. The Summary type as we had it (IMO) was able to represent N > 1 values in a single data point. |
Imagine simple situation, you want to create a CPU usage graph that will show current value (last value), and min, max.
If ValueRecorder has If ValueRecorder won't have MinMaxLastSumCount and I still don't want to use 2 metrics, I would rather create my own batcher where I change the aggregator for ValueRecorder so that I can have it. But why would I do it, if the And just to be clear I'm only talking about |
@obecny very nice example, I would personally solve it differently:
Here you are talking about an "available" Aggregator in the SDK, that we can offer. Which I agree that we can easily build an aggregator that is called When it comes to "implementation" detail, I would say that users should be aware that they configure 2 different "views/metrics" (based on my observation that in all the backends these will be 2 different metrics at least), but internally we can "merge" these 2 metrics requested by the user into only one "Aggregator" - similar argument can be made if user asks for "LastValue" + "Histogram" we can still have a smart logic and have a "combined" Aggregator. So I don't see why "MinMaxLastSumCount" is a special case compare to any other combination of 2 views/metrics.
Happy that we agree on the |
Given the proposal in open-telemetry/opentelemetry-proto#199, I'm getting confused about how we could represent Min, Max, Sum, and Count fields. There has been a motion to consider Histogram or DDSketch as defaults for ValueRecorder, to which I'd add that an Exact aggregation (i.e., exporting raw data points) would also work, but that's a performance problem we'd like to avoid as a default. |
@bogdandrutu thx for detailed explanation it makes sense and I'm glad that ValueObserver will have LastValue only :) |
I retitled this because we have wide agreement that the default aggregator for ValueObserver should be LastValue (i.e., produce Gauge data points). |
Closing this as a duplicate of #982 |
How do I replace the default MinMaxSumCount to the LastValue for valuerecorder in Python. Is it possible to disable aggregation, collecting and shipping all values, the whole timeseries?
I will appreciate a tip/link |
https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/metrics/api.md#interpretation
Is this correct that for ValueObserver we have aggregator
MinMaxSumCount
instead of LastValue ?.If this is not a bug can you please explain how would you track metric if you are only interested in last value as currently if I understand this correctly there is no such possibility.
The text was updated successfully, but these errors were encountered: