-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(metrics): remove in_memory settings #946
Conversation
memory setting is used to control whether to keep the instruments if there are no updates in this collection internal. This should instead configure via temporality. memory = true -> Cumulative memory = false -> Delta
Codecov ReportBase: 67.7% // Head: 69.3% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #946 +/- ##
=======================================
+ Coverage 67.7% 69.3% +1.6%
=======================================
Files 116 116
Lines 9103 9090 -13
=======================================
+ Hits 6164 6308 +144
+ Misses 2939 2782 -157
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Will this affect gauges? |
What will happen if this stops reporting counter values for deltas? e.g. say at a regular reporting interval we're pushing metrics. So we have a counter where at T0 (time 0)=0, it goes up by 30 so T1 reports T1=30, then it doesn't report because it's not changed ever. So T2=null, T3=null, etc. If I'm graphing this out and I try to use my last known value, it'd be 30 and stay there. Shouldn't it at least emit one more time (T2=0) before not emitting again until the value changes? IOW, for deltas, I'd expect: Instead of:
IOW, shouldn't it emit a zero at least once after getting reset? That is, it should report at least one more time when there's been any change (and in this case there was indeed a change: when a delta's counter has gone from its current value to zero)? |
That's a good point. I don't think so but I can add some tests in this PR
Note that in delta temporality, we will export the metrics when there is any change for the instrument even if the value is 0. So if we emit one more time after the final round of change it can be confusing. i.e If your use case doesn't need to differentiate |
That's my concern, that there is a hidden delta here that isn't emitted. Null, to me, isn't the same as zero. Zero is an assertion and null implies lack of information/data. Here there's a known change -- the count has reverted to zero but it isn't asserted. |
I see your point.
For So given the following data points: I think one thing worth mention is in OTLP metrics data model. The aggregated metrics are associated with a range of time instead of a timestamp. The difference of temporality is just when this range starts. For So in your previous example, if SDK exporting
|
Thank you for your patient/full explanation -- it was very helpful. 😄 In your example, is there any way to force it to report 0s for T2-T7 for the counters it knows about? Or at least a way for an exporter to access the available counters (even if they haven't been changed)? |
Yeah in the exporter implementation can "remember" all seen instruments should it choose to. I am working on refactoring metrics APIs so that we are align with the spec. TBH I am a little worried that if we allow SDK to "remember" seen instruments it can easily cause memory issues. For example, if someone includes the request id as an attribute accidentally the number of instruments will grow along with number of requests even if all instruments will only have one data point. |
The update requires a change to the implementation and test update as follows: - In otel 0.18.0, processor factories had a `with_memory(bool)` method which we were using when building our prometheus exporter. AFAICT, this used to be a mechanism for controlling how metrics handled stale gauges. In 0.19.0, [this method was removed](open-telemetry/opentelemetry-rust#946) and now gauges are all assumed to be as though they were created with `false`. We had been providing `true` on our call. I'm not 100% certain of the impact of this change, but it appears that we can ignore it. We may need to consider it more carefully if problems arise. - There are now two standard OTEL attributes: ```otel_scope_name="apollo/router",otel_scope_version=""``` added to output and a number of tests had to be updated to accommodate that change. - One of our tests appeared to be searching for `apollo_router_cache_hit_count` (and this was working) when it should have been searching for `apollo_router_cache_hit_count_total` (likewise for miss). I've updated the test and think this is the correct thing to do. It looks like a bug was fixed in otel and this change matches the fix. The upgrade fixes many of the outstanding issues related to opentelemetry and various APM vendors: Fixes: #2878 Fixes: #2066 Fixes: #2959 Fixes: #2225 Fixes: #1520 <!-- start metadata --> **Checklist** Complete the checklist (and note appropriate exceptions) before a final PR is raised. - [x] Changes are compatible[^1] - [x] Documentation[^2] completed - [x] Performance impact assessed and acceptable - Tests added and passing[^3] - [x] Unit Tests - [x] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]. Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]. Tick whichever testing boxes are applicable. If you are adding Manual Tests: - please document the manual testing (extensively) in the Exceptions. - please raise a separate issue to automate the test and label it (or ask for it to be labeled) as `manual test`
The update requires a change to the implementation and test update as follows: - In otel 0.18.0, processor factories had a `with_memory(bool)` method which we were using when building our prometheus exporter. AFAICT, this used to be a mechanism for controlling how metrics handled stale gauges. In 0.19.0, [this method was removed](open-telemetry/opentelemetry-rust#946) and now gauges are all assumed to be as though they were created with `false`. We had been providing `true` on our call. I'm not 100% certain of the impact of this change, but it appears that we can ignore it. We may need to consider it more carefully if problems arise. - There are now two standard OTEL attributes: ```otel_scope_name="apollo/router",otel_scope_version=""``` added to output and a number of tests had to be updated to accommodate that change. - One of our tests appeared to be searching for `apollo_router_cache_hit_count` (and this was working) when it should have been searching for `apollo_router_cache_hit_count_total` (likewise for miss). I've updated the test and think this is the correct thing to do. It looks like a bug was fixed in otel and this change matches the fix. Regarding that last point. The prometheus spec mandates naming format and the change was part of the compliance with that spec. This PR made the change: open-telemetry/opentelemetry-rust#952 The two affected counters in the router were: apollo_router_cache_hit_count -> apollo_router_cache_hit_count_total apollo_router_cache_miss_count -> apollo_router_cache_miss_count_total It's good that our prometheus metrics are now spec compliant, but we should note this in the release notes and (if possible) somewhere in our documentation. I'll add it to the changeset at least. The upgrade fixes many of the outstanding issues related to opentelemetry and various APM vendors: Fixes: #2878 Fixes: #2066 Fixes: #2959 Fixes: #2225 Fixes: #1520 <!-- start metadata --> **Checklist** Complete the checklist (and note appropriate exceptions) before a final PR is raised. - [x] Changes are compatible[^1] - [x] Documentation[^2] completed - [x] Performance impact assessed and acceptable - Tests added and passing[^3] - [x] Unit Tests - [x] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]. Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]. Tick whichever testing boxes are applicable. If you are adding Manual Tests: - please document the manual testing (extensively) in the Exceptions. - please raise a separate issue to automate the test and label it (or ask for it to be labeled) as `manual test`
The
memory
settings are used to control whether the processor should process the instrument that is not updated in this collection interval.Users can achieve the same behavior using temporality. It also causes bugs when memory is
true
but the temporality isDelta
. In this case, I think we should delete the record if the instrument didn't update. But ifmemory
istrue
it will prevent the stale record from deleting.Is this fix #939?
Well, sort of. The root cause of #939 is we don't delete the stale record for
Delta
temporality because the memory is set totrue
, which should be fixed now that we remove the memory settings. However, for #939 the bigger issue is we shouldn't allow user to useDetla
temporality with Prometheus exporter.