Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse AggregatorHandle with cumulative temporality #5142

Merged
merged 3 commits into from
Jan 30, 2023

Conversation

jack-berg
Copy link
Member

@jack-berg jack-berg commented Jan 20, 2023

The metric sdk's storage situation has an unnecessary layer of abstraction called TemporalMetricStorage.

AsynchronousMetricStorage and DefaultSynchronousMetricStorageare responsible for storing metric state for asynchronous and synchronous instruments, respectively. If you look at their code, you'll notice that each has a reference to TemporalMetricStorage. This class takes in a map of accumulations, and spits out a MetricData with points that are the appropriate temporality for the reader.

How it produces points with the appropriate temporality is different for async vs sync instruments:

  • Async instruments pass in cumulative accumulations (because async instruments observe cumulative sums). If the reader wants cumulative metrics, temporal metric storage returns the accumulations as is. If the reader wants delta, temporal metric storage computes the difference between the current accumulations and the last recorded (which it holds onto).
  • Sync instrument pass in delta accumulations. If the reader wants cumulative metrics, it merges the deltas with the previous cumulatives (which it holds onto). If the reader wants deltas, it returns the deltas is.

When you think about it, its weird that temporal metric storage has to merge deltas into cumulative. Why does DefaultSynchronousMetricStorage provide TemporalMetricStorage with deltas? Well because it implements MetricStorage#collectAndReset, and needs to reset.

But actually it doesn't necessarily need to reset. Instead of always resetting, DefaultSynchronousMetricStorage could only reset on collections if the temporality is delta. If cumulative, just don't reset and let the AggregatorHandle continue to accumulate measurements.

Switching to this approach has some pretty significant benefits:

  • TemporalMetricStorage goes away. Lots of code removed and its much clearer what's happening.
  • Memory allocation is dramatically reduced when temporality is cumulative (50%+ reduction), since AggregatorHandles don't need to be reallocated after collections
  • Aggregator#merge is not longer used, so tons of boilerplate merge code can be deleted.

Check out the before and after of HistogramCollectBenchmark, which is currently the best JMH test for measurement memory allocation of collection.

Before:

Benchmark                                                                            (aggregationGenerator)  (aggregationTemporality)  Mode  Cnt           Score           Error   Units
HistogramCollectBenchmark.recordAndCollect                                        EXPLICIT_BUCKET_HISTOGRAM                CUMULATIVE    ss    5  3514191866.600 ±  38228608.252   ns/op
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate                         EXPLICIT_BUCKET_HISTOGRAM                CUMULATIVE    ss    5           8.361 ±         0.177  MB/sec
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate.norm                    EXPLICIT_BUCKET_HISTOGRAM                CUMULATIVE    ss    5    30810020.800 ±    502992.137    B/op
HistogramCollectBenchmark.recordAndCollect:·gc.count                              EXPLICIT_BUCKET_HISTOGRAM                CUMULATIVE    ss    5           2.000                  counts
HistogramCollectBenchmark.recordAndCollect:·gc.time                               EXPLICIT_BUCKET_HISTOGRAM                CUMULATIVE    ss    5           3.000                      ms
HistogramCollectBenchmark.recordAndCollect                             DEFAULT_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5  9230705283.800 ± 722815039.635   ns/op
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate              DEFAULT_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5           3.352 ±         0.266  MB/sec
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate.norm         DEFAULT_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5    32430996.800 ±    501402.773    B/op
HistogramCollectBenchmark.recordAndCollect:·gc.count                   DEFAULT_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5           2.000                  counts
HistogramCollectBenchmark.recordAndCollect:·gc.time                    DEFAULT_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5           5.000                      ms
HistogramCollectBenchmark.recordAndCollect                      ZERO_MAX_SCALE_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5  2730342374.800 ± 190729093.001   ns/op
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate       ZERO_MAX_SCALE_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5           7.605 ±         0.528  MB/sec
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate.norm  ZERO_MAX_SCALE_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5    21769772.800 ±    411890.913    B/op
HistogramCollectBenchmark.recordAndCollect:·gc.count            ZERO_MAX_SCALE_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5           2.000                  counts
HistogramCollectBenchmark.recordAndCollect:·gc.time             ZERO_MAX_SCALE_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5           4.000                      ms

After:

Benchmark                                                                            (aggregationGenerator)  (aggregationTemporality)  Mode  Cnt           Score           Error   Units
HistogramCollectBenchmark.recordAndCollect                                        EXPLICIT_BUCKET_HISTOGRAM                CUMULATIVE    ss    5  3559941425.200 ± 221984528.370   ns/op
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate                         EXPLICIT_BUCKET_HISTOGRAM                CUMULATIVE    ss    5           3.770 ±         0.342  MB/sec
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate.norm                    EXPLICIT_BUCKET_HISTOGRAM                CUMULATIVE    ss    5    14069625.600 ±    558794.139    B/op
HistogramCollectBenchmark.recordAndCollect:·gc.count                              EXPLICIT_BUCKET_HISTOGRAM                CUMULATIVE    ss    5           1.000                  counts
HistogramCollectBenchmark.recordAndCollect:·gc.time                               EXPLICIT_BUCKET_HISTOGRAM                CUMULATIVE    ss    5           2.000                      ms
HistogramCollectBenchmark.recordAndCollect                             DEFAULT_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5  9153833758.200 ± 163073300.897   ns/op
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate              DEFAULT_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5           1.060 ±         0.091  MB/sec
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate.norm         DEFAULT_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5    10177576.000 ±    702819.724    B/op
HistogramCollectBenchmark.recordAndCollect:·gc.count                   DEFAULT_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5           1.000                  counts
HistogramCollectBenchmark.recordAndCollect:·gc.time                    DEFAULT_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5           2.000                      ms
HistogramCollectBenchmark.recordAndCollect                      ZERO_MAX_SCALE_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5  2663111633.600 ±  10858906.517   ns/op
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate       ZERO_MAX_SCALE_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5           3.609 ±         0.122  MB/sec
HistogramCollectBenchmark.recordAndCollect:·gc.alloc.rate.norm  ZERO_MAX_SCALE_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5    10078862.400 ±    316114.952    B/op
HistogramCollectBenchmark.recordAndCollect:·gc.count            ZERO_MAX_SCALE_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5           1.000                  counts
HistogramCollectBenchmark.recordAndCollect:·gc.time             ZERO_MAX_SCALE_EXPONENTIAL_BUCKET_HISTOGRAM                CUMULATIVE    ss    5           2.000                      ms

Note I've narrowed in on the cumulative results because there's no material change for delta (as expected).

This is one of those changes that manages to reduce complexity and increases performance.

}
T accumulation = entry.getValue().accumulateThenReset(entry.getKey());
T accumulation = entry.getValue().accumulateThenReset(entry.getKey(), reset);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is pretty hard to understand and only exists to serve the Bound{Instrument} concept which has been shelved. If we don't see a realistic chance of the being incorporated into the spec, I suggest we rip it out and reap the benefits of reduced complexity.

@codecov
Copy link

codecov bot commented Jan 20, 2023

Codecov Report

Base: 91.11% // Head: 90.96% // Decreases project coverage by -0.15% ⚠️

Coverage data is based on head (a3da40f) compared to base (a3ac819).
Patch coverage: 92.64% of modified lines in pull request are covered.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5142      +/-   ##
============================================
- Coverage     91.11%   90.96%   -0.15%     
+ Complexity     4903     4871      -32     
============================================
  Files           549      545       -4     
  Lines         14501    14465      -36     
  Branches       1392     1390       -2     
============================================
- Hits          13212    13158      -54     
- Misses          890      906      +16     
- Partials        399      401       +2     
Impacted Files Coverage Δ
...dk/metrics/internal/aggregator/DropAggregator.java 44.44% <ø> (+4.44%) ⬆️
...sdk/metrics/internal/state/EmptyMetricStorage.java 0.00% <ø> (ø)
...etry/sdk/metrics/internal/state/MetricStorage.java 0.00% <ø> (ø)
...internal/aggregator/DoubleLastValueAggregator.java 91.30% <33.33%> (-8.70%) ⬇️
...s/internal/aggregator/LongLastValueAggregator.java 82.60% <33.33%> (-8.31%) ⬇️
...nternal/state/DefaultSynchronousMetricStorage.java 79.72% <92.85%> (+1.46%) ⬆️
...ry/sdk/metrics/internal/aggregator/Aggregator.java 87.50% <100.00%> (ø)
.../metrics/internal/aggregator/AggregatorHandle.java 93.33% <100.00%> (+0.22%) ⬆️
...gator/DoubleExplicitBucketHistogramAggregator.java 94.23% <100.00%> (-5.77%) ⬇️
...gregator/DoubleExponentialHistogramAggregator.java 98.73% <100.00%> (+2.06%) ⬆️
... and 35 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@jack-berg jack-berg marked this pull request as ready for review January 23, 2023 21:25
@jack-berg jack-berg requested a review from a team January 23, 2023 21:25
@@ -118,7 +118,7 @@ void staleMetricsDropped_synchronousInstrument() {
(Consumer<SumData<LongPointData>>)
sumPointData ->
assertThat(sumPointData.getPoints().size())
.isEqualTo(10))));
.isEqualTo(2000))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not clear to me...is this a significant behavioral change due to this refactoring?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean by that is... is this change in behavior going to be a breaking change for anyone? Seems like suddenly getting 20x the number of points could be significant to some backends.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior has changed, but the current behavior is odd and the new behavior is slightly less odd so I think its defensible.

Currently, if the temporality is cumulative, we:

So there's two rounds of enforcement of the cardinality limit and the behavior isn't very intuitive / expected.

The new behavior is much simpler: limit to 2000 attributes in the storage. Drop all measurements which exceed that. This makes the behavior of cardinality enforcement the same with cumulative and delta. I don't think this will be surprising / breaking for any users, since the change is limited to reducing cardinality after the limit has already been hit. I.e. you'll see a potential reduction in cardinality only after you've hit cardinality limits.

There's some work active in the spec which would change this behavior further. The most likely outcome is If a configurable cardinality limit is exceeded, log a warning and emit a single point with a static attribute key / value which aggregates measurements across all dimensions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants