Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(metrics): remove in_memory settings #946

Merged
merged 4 commits into from
Jan 24, 2023

Conversation

TommyCpp
Copy link
Contributor

The memory settings are used to control whether the processor should process the instrument that is not updated in this collection interval.

Users can achieve the same behavior using temporality. It also causes bugs when memory is true but the temporality is Delta. In this case, I think we should delete the record if the instrument didn't update. But if memory is true it will prevent the stale record from deleting.

Is this fix #939?
Well, sort of. The root cause of #939 is we don't delete the stale record for Delta temporality because the memory is set to true, which should be fixed now that we remove the memory settings. However, for #939 the bigger issue is we shouldn't allow user to use Detla temporality with Prometheus exporter.

A Prometheus Exporter MUST only support Cumulative Temporality.

memory setting is used to control whether to keep the instruments if there are no updates in this collection internal.

This should instead configure via temporality.
memory = true -> Cumulative
memory = false -> Delta
@codecov
Copy link

codecov bot commented Jan 16, 2023

Codecov Report

Base: 67.7% // Head: 69.3% // Increases project coverage by +1.6% 🎉

Coverage data is based on head (90d583d) compared to base (b67a873).
Patch coverage: 60.0% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##            main    #946     +/-   ##
=======================================
+ Coverage   67.7%   69.3%   +1.6%     
=======================================
  Files        116     116             
  Lines       9103    9090     -13     
=======================================
+ Hits        6164    6308    +144     
+ Misses      2939    2782    -157     
Impacted Files Coverage Δ
opentelemetry-sdk/src/metrics/controllers/basic.rs 63.8% <ø> (+15.1%) ⬆️
opentelemetry-sdk/src/metrics/processors/basic.rs 72.7% <60.0%> (+10.2%) ⬆️
opentelemetry-http/src/lib.rs 8.7% <0.0%> (+5.2%) ⬆️
opentelemetry-sdk/src/metrics/selectors/simple.rs 43.7% <0.0%> (+6.2%) ⬆️
opentelemetry-sdk/src/metrics/mod.rs 78.3% <0.0%> (+14.6%) ⬆️
opentelemetry-sdk/src/metrics/sdk_api/wrap.rs 38.4% <0.0%> (+16.4%) ⬆️
opentelemetry-sdk/src/util.rs 81.2% <0.0%> (+18.7%) ⬆️
opentelemetry-sdk/src/metrics/registry.rs 75.4% <0.0%> (+28.0%) ⬆️
... and 3 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@TommyCpp TommyCpp marked this pull request as ready for review January 16, 2023 23:20
@TommyCpp TommyCpp requested a review from a team January 16, 2023 23:20
@davidhoyt
Copy link

Will this affect gauges?

@davidhoyt
Copy link

davidhoyt commented Jan 18, 2023

What will happen if this stops reporting counter values for deltas? e.g. say at a regular reporting interval we're pushing metrics. So we have a counter where at T0 (time 0)=0, it goes up by 30 so T1 reports T1=30, then it doesn't report because it's not changed ever. So T2=null, T3=null, etc. If I'm graphing this out and I try to use my last known value, it'd be 30 and stay there. Shouldn't it at least emit one more time (T2=0) before not emitting again until the value changes?

IOW, for deltas, I'd expect:
T0=0, T1=30, T2=0, T3=null, ..., TN=null

Instead of:

T0=0, T1=30, T2=null, T3=null, ..., TN=null

IOW, shouldn't it emit a zero at least once after getting reset? That is, it should report at least one more time when there's been any change (and in this case there was indeed a change: when a delta's counter has gone from its current value to zero)?

@TommyCpp
Copy link
Contributor Author

Will this affect gauges?

That's a good point. I don't think so but I can add some tests in this PR

Shouldn't it at least emit one more time (T2=0) before not emitting again until the value changes?

Note that in delta temporality, we will export the metrics when there is any change for the instrument even if the value is 0. So if we emit one more time after the final round of change it can be confusing. i.e null means "there is no change in this round", and 0 means "the change is 0 in this round".

If your use case doesn't need to differentiate null and 0 I think most dashboards have the ability to convert the null to 0 for you 🙂

@davidhoyt
Copy link

That's my concern, that there is a hidden delta here that isn't emitted. Null, to me, isn't the same as zero. Zero is an assertion and null implies lack of information/data. Here there's a known change -- the count has reverted to zero but it isn't asserted.

@TommyCpp
Copy link
Contributor Author

I see your point.

Here there's a known change -- the count has reverted to zero but it isn't asserted.

For Delta temporaliy this is implicit. That logically the SDK "forget" everything after each collection. The Delta here is comapre against a initial value(for counter, 0) instead of last know value.

So given the following data points:
[T0, T1, 30], [T1, T2, 20], the total sum between T0 and T2 is 50

I think one thing worth mention is in OTLP metrics data model. The aggregated metrics are associated with a range of time instead of a timestamp. The difference of temporality is just when this range starts. For Cumulative, the start of the range will always be T0(in other words, the start time of process). For Delta, the start time will be the end of last collection interval.

So in your previous example, if SDK exporting

  • [T0, T1, 30], [T1, T2, 0], it means the sum of the counter between T0 and T1 is 30; the sum of the counter between T1 and T2 is 0.
  • [T0, T1, 30], it means the counter has not been called since T1.
  • [T0, T1, 30], [T7, T8, 20] means the counter has been called since T1 until sometime between T7 and T8.

@davidhoyt
Copy link

Thank you for your patient/full explanation -- it was very helpful. 😄

In your example, is there any way to force it to report 0s for T2-T7 for the counters it knows about? Or at least a way for an exporter to access the available counters (even if they haven't been changed)?

@TommyCpp
Copy link
Contributor Author

Or at least a way for an exporter to access the available counters (even if they haven't been changed)?

Yeah in the exporter implementation can "remember" all seen instruments should it choose to. I am working on refactoring metrics APIs so that we are align with the spec.

TBH I am a little worried that if we allow SDK to "remember" seen instruments it can easily cause memory issues. For example, if someone includes the request id as an attribute accidentally the number of instruments will grow along with number of requests even if all instruments will only have one data point.

@TommyCpp TommyCpp merged commit 482edb6 into open-telemetry:main Jan 24, 2023
@TommyCpp TommyCpp mentioned this pull request Jan 25, 2023
garypen added a commit to apollographql/router that referenced this pull request Jun 5, 2023
The update requires a change to the implementation and test update as
follows:

- In otel 0.18.0, processor factories had a `with_memory(bool)` method
which we were using when building our prometheus exporter. AFAICT, this
used to be a mechanism for controlling how metrics handled stale gauges.
In 0.19.0, [this method was
removed](open-telemetry/opentelemetry-rust#946)
and now gauges are all assumed to be as though they were created with
`false`. We had been providing `true` on our call. I'm not 100% certain
of the impact of this change, but it appears that we can ignore it. We
may need to consider it more carefully if problems arise.
- There are now two standard OTEL attributes:
```otel_scope_name="apollo/router",otel_scope_version=""``` added to
output and a number of tests had to be updated to accommodate that
change.
- One of our tests appeared to be searching for
`apollo_router_cache_hit_count` (and this was working) when it should
have been searching for `apollo_router_cache_hit_count_total` (likewise
for miss). I've updated the test and think this is the correct thing to
do. It looks like a bug was fixed in otel and this change matches the
fix.

The upgrade fixes many of the outstanding issues related to
opentelemetry and various APM vendors:

Fixes: #2878
Fixes: #2066 
Fixes: #2959 
Fixes: #2225 
Fixes: #1520 

<!-- start metadata -->

**Checklist**

Complete the checklist (and note appropriate exceptions) before a final
PR is raised.

- [x] Changes are compatible[^1]
- [x] Documentation[^2] completed
- [x] Performance impact assessed and acceptable
- Tests added and passing[^3]
    - [x] Unit Tests
    - [x] Integration Tests
    - [ ] Manual Tests

**Exceptions**

*Note any exceptions here*

**Notes**

[^1]. It may be appropriate to bring upcoming changes to the attention
of other (impacted) groups. Please endeavour to do this before seeking
PR approval. The mechanism for doing this will vary considerably, so use
your judgement as to how and when to do this.
[^2]. Configuration is an important part of many changes. Where
applicable please try to document configuration examples.
[^3]. Tick whichever testing boxes are applicable. If you are adding
Manual Tests:
- please document the manual testing (extensively) in the Exceptions.
- please raise a separate issue to automate the test and label it (or
ask for it to be labeled) as `manual test`
garypen added a commit to apollographql/router that referenced this pull request Jul 12, 2023
The update requires a change to the implementation and test update as
follows:

- In otel 0.18.0, processor factories had a `with_memory(bool)` method
which we were using when building our prometheus exporter. AFAICT, this
used to be a mechanism for controlling how metrics handled stale gauges.
In 0.19.0, [this method was
removed](open-telemetry/opentelemetry-rust#946)
and now gauges are all assumed to be as though they were created with
`false`. We had been providing `true` on our call. I'm not 100% certain
of the impact of this change, but it appears that we can ignore it. We
may need to consider it more carefully if problems arise.
- There are now two standard OTEL attributes:
```otel_scope_name="apollo/router",otel_scope_version=""``` added to
output and a number of tests had to be updated to accommodate that
change.
- One of our tests appeared to be searching for
`apollo_router_cache_hit_count` (and this was working) when it should
have been searching for `apollo_router_cache_hit_count_total` (likewise
for miss). I've updated the test and think this is the correct thing to
do. It looks like a bug was fixed in otel and this change matches the
fix.
 
Regarding that last point. The prometheus spec mandates naming format
and the change was part of the compliance with that spec. This PR made
the change:
open-telemetry/opentelemetry-rust#952

The two affected counters in the router were:

apollo_router_cache_hit_count -> apollo_router_cache_hit_count_total
apollo_router_cache_miss_count -> apollo_router_cache_miss_count_total

It's good that our prometheus metrics are now spec compliant, but we
should note this in the release notes and (if possible) somewhere in our
documentation. I'll add it to the changeset at least.

The upgrade fixes many of the outstanding issues related to
opentelemetry and various APM vendors:

Fixes: #2878
Fixes: #2066 
Fixes: #2959 
Fixes: #2225 
Fixes: #1520 

<!-- start metadata -->

**Checklist**

Complete the checklist (and note appropriate exceptions) before a final
PR is raised.

- [x] Changes are compatible[^1]
- [x] Documentation[^2] completed
- [x] Performance impact assessed and acceptable
- Tests added and passing[^3]
    - [x] Unit Tests
    - [x] Integration Tests
    - [ ] Manual Tests

**Exceptions**

*Note any exceptions here*

**Notes**

[^1]. It may be appropriate to bring upcoming changes to the attention
of other (impacted) groups. Please endeavour to do this before seeking
PR approval. The mechanism for doing this will vary considerably, so use
your judgement as to how and when to do this.
[^2]. Configuration is an important part of many changes. Where
applicable please try to document configuration examples.
[^3]. Tick whichever testing boxes are applicable. If you are adding
Manual Tests:
- please document the manual testing (extensively) in the Exceptions.
- please raise a separate issue to automate the test and label it (or
ask for it to be labeled) as `manual test`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metrics counter with delta_temporality_selector unexpected behavior
3 participants