fix(metrics): remove in_memory settings #946

TommyCpp · 2023-01-16T07:30:49Z

The memory settings are used to control whether the processor should process the instrument that is not updated in this collection interval.

Users can achieve the same behavior using temporality. It also causes bugs when memory is true but the temporality is Delta. In this case, I think we should delete the record if the instrument didn't update. But if memory is true it will prevent the stale record from deleting.

Is this fix #939?
Well, sort of. The root cause of #939 is we don't delete the stale record for Delta temporality because the memory is set to true, which should be fixed now that we remove the memory settings. However, for #939 the bigger issue is we shouldn't allow user to use Detla temporality with Prometheus exporter.

A Prometheus Exporter MUST only support Cumulative Temporality.

memory setting is used to control whether to keep the instruments if there are no updates in this collection internal. This should instead configure via temporality. memory = true -> Cumulative memory = false -> Delta

codecov · 2023-01-16T07:43:58Z

Codecov Report

Base: 67.7% // Head: 69.3% // Increases project coverage by +1.6% 🎉

Coverage data is based on head (90d583d) compared to base (b67a873).
Patch coverage: 60.0% of modified lines in pull request are covered.

Additional details and impacted files

@@           Coverage Diff           @@
##            main    #946     +/-   ##
=======================================
+ Coverage   67.7%   69.3%   +1.6%     
=======================================
  Files        116     116             
  Lines       9103    9090     -13     
=======================================
+ Hits        6164    6308    +144     
+ Misses      2939    2782    -157

Impacted Files	Coverage Δ
opentelemetry-sdk/src/metrics/controllers/basic.rs	`63.8% <ø> (+15.1%)`	⬆️
opentelemetry-sdk/src/metrics/processors/basic.rs	`72.7% <60.0%> (+10.2%)`	⬆️
opentelemetry-http/src/lib.rs	`8.7% <0.0%> (+5.2%)`	⬆️
opentelemetry-sdk/src/metrics/selectors/simple.rs	`43.7% <0.0%> (+6.2%)`	⬆️
opentelemetry-sdk/src/metrics/mod.rs	`78.3% <0.0%> (+14.6%)`	⬆️
opentelemetry-sdk/src/metrics/sdk_api/wrap.rs	`38.4% <0.0%> (+16.4%)`	⬆️
opentelemetry-sdk/src/util.rs	`81.2% <0.0%> (+18.7%)`	⬆️
opentelemetry-sdk/src/metrics/registry.rs	`75.4% <0.0%> (+28.0%)`	⬆️
... and 3 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

davidhoyt · 2023-01-17T23:04:51Z

Will this affect gauges?

davidhoyt · 2023-01-18T01:13:52Z

What will happen if this stops reporting counter values for deltas? e.g. say at a regular reporting interval we're pushing metrics. So we have a counter where at T0 (time 0)=0, it goes up by 30 so T1 reports T1=30, then it doesn't report because it's not changed ever. So T2=null, T3=null, etc. If I'm graphing this out and I try to use my last known value, it'd be 30 and stay there. Shouldn't it at least emit one more time (T2=0) before not emitting again until the value changes?

IOW, for deltas, I'd expect:
T0=0, T1=30, T2=0, T3=null, ..., TN=null

Instead of:

T0=0, T1=30, T2=null, T3=null, ..., TN=null

IOW, shouldn't it emit a zero at least once after getting reset? That is, it should report at least one more time when there's been any change (and in this case there was indeed a change: when a delta's counter has gone from its current value to zero)?

TommyCpp · 2023-01-18T06:02:49Z

Will this affect gauges?

That's a good point. I don't think so but I can add some tests in this PR

Shouldn't it at least emit one more time (T2=0) before not emitting again until the value changes?

Note that in delta temporality, we will export the metrics when there is any change for the instrument even if the value is 0. So if we emit one more time after the final round of change it can be confusing. i.e null means "there is no change in this round", and 0 means "the change is 0 in this round".

If your use case doesn't need to differentiate null and 0 I think most dashboards have the ability to convert the null to 0 for you 🙂

davidhoyt · 2023-01-18T16:08:25Z

That's my concern, that there is a hidden delta here that isn't emitted. Null, to me, isn't the same as zero. Zero is an assertion and null implies lack of information/data. Here there's a known change -- the count has reverted to zero but it isn't asserted.

TommyCpp · 2023-01-19T07:25:05Z

I see your point.

Here there's a known change -- the count has reverted to zero but it isn't asserted.

For Delta temporaliy this is implicit. That logically the SDK "forget" everything after each collection. The Delta here is comapre against a initial value(for counter, 0) instead of last know value.

So given the following data points:
[T0, T1, 30], [T1, T2, 20], the total sum between T0 and T2 is 50

I think one thing worth mention is in OTLP metrics data model. The aggregated metrics are associated with a range of time instead of a timestamp. The difference of temporality is just when this range starts. For Cumulative, the start of the range will always be T0(in other words, the start time of process). For Delta, the start time will be the end of last collection interval.

So in your previous example, if SDK exporting

[T0, T1, 30], [T1, T2, 0], it means the sum of the counter between T0 and T1 is 30; the sum of the counter between T1 and T2 is 0.
[T0, T1, 30], it means the counter has not been called since T1.
[T0, T1, 30], [T7, T8, 20] means the counter has been called since T1 until sometime between T7 and T8.

davidhoyt · 2023-01-20T06:29:13Z

Thank you for your patient/full explanation -- it was very helpful. 😄

In your example, is there any way to force it to report 0s for T2-T7 for the counters it knows about? Or at least a way for an exporter to access the available counters (even if they haven't been changed)?

TommyCpp · 2023-01-21T06:50:50Z

Or at least a way for an exporter to access the available counters (even if they haven't been changed)?

Yeah in the exporter implementation can "remember" all seen instruments should it choose to. I am working on refactoring metrics APIs so that we are align with the spec.

TBH I am a little worried that if we allow SDK to "remember" seen instruments it can easily cause memory issues. For example, if someone includes the request id as an attribute accidentally the number of instruments will grow along with number of requests even if all instruments will only have one data point.

The update requires a change to the implementation and test update as follows: - In otel 0.18.0, processor factories had a `with_memory(bool)` method which we were using when building our prometheus exporter. AFAICT, this used to be a mechanism for controlling how metrics handled stale gauges. In 0.19.0, [this method was removed](open-telemetry/opentelemetry-rust#946) and now gauges are all assumed to be as though they were created with `false`. We had been providing `true` on our call. I'm not 100% certain of the impact of this change, but it appears that we can ignore it. We may need to consider it more carefully if problems arise. - There are now two standard OTEL attributes: ```otel_scope_name="apollo/router",otel_scope_version=""``` added to output and a number of tests had to be updated to accommodate that change. - One of our tests appeared to be searching for `apollo_router_cache_hit_count` (and this was working) when it should have been searching for `apollo_router_cache_hit_count_total` (likewise for miss). I've updated the test and think this is the correct thing to do. It looks like a bug was fixed in otel and this change matches the fix. The upgrade fixes many of the outstanding issues related to opentelemetry and various APM vendors: Fixes: #2878 Fixes: #2066 Fixes: #2959 Fixes: #2225 Fixes: #1520  **Checklist** Complete the checklist (and note appropriate exceptions) before a final PR is raised. - [x] Changes are compatible[^1] - [x] Documentation[^2] completed - [x] Performance impact assessed and acceptable - Tests added and passing[^3] - [x] Unit Tests - [x] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]. Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]. Tick whichever testing boxes are applicable. If you are adding Manual Tests: - please document the manual testing (extensively) in the Exceptions. - please raise a separate issue to automate the test and label it (or ask for it to be labeled) as `manual test`

The update requires a change to the implementation and test update as follows: - In otel 0.18.0, processor factories had a `with_memory(bool)` method which we were using when building our prometheus exporter. AFAICT, this used to be a mechanism for controlling how metrics handled stale gauges. In 0.19.0, [this method was removed](open-telemetry/opentelemetry-rust#946) and now gauges are all assumed to be as though they were created with `false`. We had been providing `true` on our call. I'm not 100% certain of the impact of this change, but it appears that we can ignore it. We may need to consider it more carefully if problems arise. - There are now two standard OTEL attributes: ```otel_scope_name="apollo/router",otel_scope_version=""``` added to output and a number of tests had to be updated to accommodate that change. - One of our tests appeared to be searching for `apollo_router_cache_hit_count` (and this was working) when it should have been searching for `apollo_router_cache_hit_count_total` (likewise for miss). I've updated the test and think this is the correct thing to do. It looks like a bug was fixed in otel and this change matches the fix. Regarding that last point. The prometheus spec mandates naming format and the change was part of the compliance with that spec. This PR made the change: open-telemetry/opentelemetry-rust#952 The two affected counters in the router were: apollo_router_cache_hit_count -> apollo_router_cache_hit_count_total apollo_router_cache_miss_count -> apollo_router_cache_miss_count_total It's good that our prometheus metrics are now spec compliant, but we should note this in the release notes and (if possible) somewhere in our documentation. I'll add it to the changeset at least. The upgrade fixes many of the outstanding issues related to opentelemetry and various APM vendors: Fixes: #2878 Fixes: #2066 Fixes: #2959 Fixes: #2225 Fixes: #1520  **Checklist** Complete the checklist (and note appropriate exceptions) before a final PR is raised. - [x] Changes are compatible[^1] - [x] Documentation[^2] completed - [x] Performance impact assessed and acceptable - Tests added and passing[^3] - [x] Unit Tests - [x] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]. Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]. Tick whichever testing boxes are applicable. If you are adding Manual Tests: - please document the manual testing (extensively) in the Exceptions. - please raise a separate issue to automate the test and label it (or ask for it to be labeled) as `manual test`

TommyCpp added 2 commits January 15, 2023 22:53

feat(metrics): remove memory settings.

2e03082

memory setting is used to control whether to keep the instruments if there are no updates in this collection internal. This should instead configure via temporality. memory = true -> Cumulative memory = false -> Delta

test(metrics): add tests for temporality

a0062f5

fix(metrics): test

e94da44

TommyCpp force-pushed the fix-939 branch from 18ed42b to e94da44 Compare January 16, 2023 08:40

TommyCpp marked this pull request as ready for review January 16, 2023 23:20

TommyCpp requested a review from a team January 16, 2023 23:20

jtescher approved these changes Jan 17, 2023

View reviewed changes

test(metrics): adding more tests

90d583d

TommyCpp merged commit 482edb6 into open-telemetry:main Jan 24, 2023

TommyCpp mentioned this pull request Jan 25, 2023

Remove stale gauges #955

Closed

garypen mentioned this pull request Jun 1, 2023

update opentelemetry to 0.19.0 apollographql/router#3196

Merged

6 tasks

garypen mentioned this pull request Jul 12, 2023

otel 0.19.0 second try apollographql/router#3421

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(metrics): remove in_memory settings #946

fix(metrics): remove in_memory settings #946

TommyCpp commented Jan 16, 2023

codecov bot commented Jan 16, 2023 •

edited

Loading

davidhoyt commented Jan 17, 2023

davidhoyt commented Jan 18, 2023 •

edited

Loading

TommyCpp commented Jan 18, 2023

davidhoyt commented Jan 18, 2023

TommyCpp commented Jan 19, 2023

davidhoyt commented Jan 20, 2023

TommyCpp commented Jan 21, 2023

fix(metrics): remove in_memory settings #946

fix(metrics): remove in_memory settings #946

Conversation

TommyCpp commented Jan 16, 2023

codecov bot commented Jan 16, 2023 • edited Loading

Codecov Report

davidhoyt commented Jan 17, 2023

davidhoyt commented Jan 18, 2023 • edited Loading

TommyCpp commented Jan 18, 2023

davidhoyt commented Jan 18, 2023

TommyCpp commented Jan 19, 2023

davidhoyt commented Jan 20, 2023

TommyCpp commented Jan 21, 2023

codecov bot commented Jan 16, 2023 •

edited

Loading

davidhoyt commented Jan 18, 2023 •

edited

Loading