Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor aggregator and exporters #2295

Conversation

cijothomas
Copy link
Member

@cijothomas cijothomas commented Aug 31, 2021

  1. Introduce Metric and MetricPoint, where Metric is a stream of aggregated metrics, containing upto N MetricPoints.
    Each instrument result in one Metric.
    (Once views are in, each instrument may result in more than on Metric. Also view config will be "hard-wired" at instrument creation time, so hot path has a simple Metric.Update() with no view lookups)

  2. Each Metric pre-allocates MetricPoint[2000], to store time-series. The Dictionary is still used for lookup, to obtain the index within this array.
    SDK caps max number of Metric (1000), and number of MetricPoints within each Metric to be 2000 - just hardcoded numbers
    for now. This needs to be revisited to have a reasonable defaults, as we don't anticipate exposing this to user (unless spec asks so, but that'd likely be a separate discussion)

  3. Exporters get Batch, reusing Batch from trace/log. This might be reconsidered for a simple IEnumerable.

  4. Metric exposes GetDataPoints(), returning BatchMetricPoint. BatchMetricPoint is newly introduced as a way to iterate though the MetricPoint array, which is a struct and hence cannot re-use Batch. This need revisit based on 5 below.

  5. MetricPoint is a struct now storing all type of Aggregator output - Sum,Gauge,Histogram. (Might save some cost by leveraging Union approach.)
    There is cost involved in "copying" the struct when iterating. Will explore better ways to handle this.

  6. All exporters modified with the new Metric/MetricPoint structure. This fixes the Prometheus exporter issue, and optimizes OTLP as well. The histogram is also modelled as per OTLP which is more efficient. (Also fixed the bound issue which was reported by Alan)

// Upcoming in separate PRs

  1. Explore if ConcurrentDictionary could be leveraged instead of Dictionary with locks.
  2. Sorting of Tag keys/values is a major cost on hot path. Might be good to store the 1st seen orders to the Dictionary along with sorted, so if user keep reusing the same order, we can avoid the Sort cost. (costs additional memory, but likely justifiable)

Its a big PR, I can split to separate steps if needed. Or keep this as is, and address key issues as separate PRs.

@codecov
Copy link

codecov bot commented Sep 1, 2021

Codecov Report

Merging #2295 (fbc8992) into metrics (94c6e56) will increase coverage by 4.31%.
The diff coverage is 59.45%.

Impacted file tree graph

@@             Coverage Diff             @@
##           metrics    #2295      +/-   ##
===========================================
+ Coverage    74.58%   78.90%   +4.31%     
===========================================
  Files          218      235      +17     
  Lines         6957     7476     +519     
===========================================
+ Hits          5189     5899     +710     
+ Misses        1768     1577     -191     
Impacted Files Coverage Δ
...InMemory/InMemoryExporterMetricHelperExtensions.cs 0.00% <0.00%> (ø)
...metry.Exporter.InMemory/InMemoryExporterOptions.cs 0.00% <0.00%> (ø)
...etryProtocol/OtlpMetricExporterHelperExtensions.cs 0.00% <0.00%> (ø)
src/OpenTelemetry/Logs/LogRecord.cs 93.84% <ø> (ø)
...enTelemetry/Logs/OpenTelemetryLoggingExtensions.cs 85.71% <ø> (ø)
src/OpenTelemetry/Metrics/DataPoint/DataValue.cs 0.00% <ø> (ø)
...elemetry/Metrics/MeterProviderBuilderExtensions.cs 77.77% <ø> (+77.77%) ⬆️
...Metrics/MetricAggregators/GaugeMetricAggregator.cs 0.00% <0.00%> (ø)
...ics/MetricAggregators/SumMetricAggregatorDouble.cs 0.00% <0.00%> (ø)
...trics/MetricAggregators/SumMetricAggregatorLong.cs 0.00% <0.00%> (ø)
... and 94 more

@cijothomas cijothomas marked this pull request as ready for review September 1, 2021 20:55
@cijothomas cijothomas requested a review from a team September 1, 2021 20:55
@@ -32,7 +32,7 @@ public ConsoleMetricExporter(ConsoleExporterOptions options)
{
}

public override ExportResult Export(in Batch<MetricItem> batch)
public override ExportResult Export(in Batch<Metric> batch)
Copy link
Member

@reyang reyang Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 1) metrics exporter will always run in a separate thread (unlike trace/log exporter which might run in the hot path - e.g. SimpleExportingProcessor) 2) metrics exporter will only be triggered at a much lower frequency (unlike trace/log exporter which might get triggered for every single event), so it might be actually easier if we define this as IEnumerable<Metric>.

I understand that there might be a desire to have ConsoleExporter<T> which covers logs/metrics/traces (for sake of consistency).

@CodeBlanch do you have strong opinion on this? (look at the change on the src/OpenTelemetry/Batch.cs, do you think the extra if-condition could slow down traces/logs?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agree to make this simpler with just IEnumerable, so we won't affect the other signals's hot path.

Copy link
Member

@CodeBlanch CodeBlanch Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do IList<T> instead of IEnumerable<T>? With that we could at least do a for (int i = 0; i < list.Count; i++) { var item = list[i]; } type of loop that doesn't allocate an enumerator. Or we could even pass the array directly. Array doesn't have a struct enumerator but AFAIK the JIT is smart enough to re-write a foreach loop on an array to be allocation-free.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we will be able to tell the count of metrics or not, if we could, IList definitely works better.

I guess array won't work since it assumes certain memory layout. For example, if we pre-allocate memory and keep reusing them, we would need to compact the pre-allocated aggregation buffer before exporting (and that seems to be expensive).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also pass a ReadOnlySpan over the array?


if (circularBuffer == null)
if (circularBuffer == null && metrics == null)
Copy link
Member

@reyang reyang Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly concerned about the extra perf cost although it might be just 2 nanoseconds, as this will contribute to every single export call. https://github.com/open-telemetry/opentelemetry-dotnet/pull/2295/files#r700611706

Probably need some perf numbers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// <summary>
/// Calculate SUM from incoming delta measurements.
/// </summary>
DoubleSumIncomingDelta = 2,
Copy link
Member

@reyang reyang Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we consider making this Flags?

e.g.

  • first four bits:
    • 0001: Sum
    • 0010: Gauge
    • 0100: Histogram
    • anything else: Invalid
  • next two bits:
    • 00: Invalid
    • 01: Delta
    • 10: Cumulative
    • 11: Invalid
  • next four bits:
    • 0001: int64 (long)
    • 0010: double
    • anything else: Invalid

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This in internal only enum, and this micro-optimization can be added in the future transparently.

@reyang
Copy link
Member

reyang commented Sep 1, 2021

I don't know how should we proceed on this big PR, want to get some input from @alanwest @CodeBlanch @utpilla:

  1. I hope that we avoid merging something that touches the stable code, and later remove it - that makes the git blame and history very dirty. (e.g. if we decided not to touch the Batch.cs file, I think we SHOULD NOT leave a change history on it).
  2. I think the general movement to reduce unnecessary memory allocation is great, and we should get the skeleton in place so other folks can help to divide & conquer.
  3. We can also take the costly approach by splitting this in to a dozen PRs (like Refactor exporter - step 1 #1078), I don't know how much extra cost (@cijothomas do you have a wild guess?) that would introduce, and it doesn't seem to address 1).

I guess one possible way is that we merge this PR to the metrics branch, and perform clean up + refactor, and squash merge it back to main when we have good confidence?

@cijothomas
Copy link
Member Author

I don't know how should we proceed on this big PR, want to get some input from @alanwest @CodeBlanch @utpilla:

  1. I hope that we avoid merging something that touches the stable code, and later remove it - that makes the git blame and history very dirty. (e.g. if we decided not to touch the Batch.cs file, I think we SHOULD NOT leave a change history on it).
  2. I think the general movement to reduce unnecessary memory allocation is great, and we should get the skeleton in place so other folks can help to divide & conquer.
  3. We can also take the costly approach by splitting this in to a dozen PRs (like Refactor exporter - step 1 #1078), I don't know how much extra cost (@cijothomas do you have a wild guess?) that would introduce, and it doesn't seem to address 1).

I guess one possible way is that we merge this PR to the metrics branch, and perform clean up + refactor, and squash merge it back to main when we have good confidence?

I agree its best to leave the Batch<> untouched in this PR, given its unlikely to be the final stage we want.
Yes we can use metrics branch (we still kept metrics branch for any major works, and this would fit that bill). This would be easier than splitting.

@reyang
Copy link
Member

reyang commented Sep 1, 2021

Yes we can use metrics branch (we still kept metrics branch for any major works, and this would fit that bill). This would be easier than splitting.

I'll approve the PR (knowing that there are multiple things that we can / probably should refactor / improve) if you could update it to target metrics branch (or whatever feature branch).

foreach (var processor in this.metricProcessors)
{
processor.SetGetMetricFunction(this.Collect);
processor.SetParentProvider(this);
temporality = processor.GetAggregationTemporality();
Copy link
Member

@alanwest alanwest Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda outside the scope of this PR, but would it make sense to do away with the list of MetricProcessors here since there can only be one? My first glance of this code made me think the last processor would dictate the temporality, but then I remembered:

internal MeterProviderBuilderSdk AddMetricProcessor(MetricProcessor processor)
{
if (this.MetricProcessors.Count >= 1)
{
throw new InvalidOperationException("Only one MetricProcessor is allowed.");
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, its planned for the next milestone , where we match the spec. (There is no MetricProcessor anymore :))
https://github.com/open-telemetry/opentelemetry-dotnet/milestone/27

@alanwest
Copy link
Member

alanwest commented Sep 1, 2021

I support approving and targeting the metrics branch. @cijothomas, you mentioned doing another alpha release this Friday. Would we do the release from the metrics branch in this case?

/// <summary>
/// Calculate SUM from incoming delta measurements.
/// </summary>
LongSumIncomingDelta = 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does Incoming mean in this context?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incoming = Measurement. i.e the one reported by user using the counter API.

For sync counter, the measurement reported by the user is deltas.
For async counter, the measurement reported by the user is cumulative.

/// <summary>
/// Histogram.
/// </summary>
Histogram = 6,
Copy link
Member

@alanwest alanwest Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be the need for HistogramDelta and HistogramCumulative?

My 🧠 is still churning to grok what's going on here... so looks like LongSumIncomingDelta vs. LongSumIncomingCumulative, for example, really just speaks to whether whether the instrument is sync (delta) vs async (cumulative)... not so much the resultant aggregation temporality that is used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right. This class simply refers to the types of aggregation we need to do internally. And speaks nothing about the outputted temporality!

initValue = this.doubleVal;
newValue = initValue + number;
}
while (initValue != Interlocked.CompareExchange(ref this.doubleVal, newValue, initValue));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 😄 I learned something here... Interlocked.Add equivalent for double

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.Description = instrument.Description;
this.Unit = instrument.Unit;
this.Meter = instrument.Meter;
AggregationType aggType = default;
Copy link
Member

@alanwest alanwest Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably for another PR, but I assume we might want some logging in the event none of the conditions match the instrument like Counter<Decimal>, for example. Or is this already prevented by .NET?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check again, but i think .NET already restricts the types for

@cijothomas cijothomas changed the base branch from main to metrics September 2, 2021 02:20
@cijothomas
Copy link
Member Author

I support approving and targeting the metrics branch. @cijothomas, you mentioned doing another alpha release this Friday. Would we do the release from the metrics branch in this case?

The next release is tied to the exporter/aggregator improvement. That milestone(alpha3) will not happen until these changes are brought to main. Hopefully I can quickly address issues and take it to main soon.

I changed the PR to target metrics branch.

@cijothomas cijothomas changed the title WIP - Refactor aggregator and exporters Refactor aggregator and exporters Sep 2, 2021
Copy link
Member

@reyang reyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM based on this discussion.

@cijothomas
Copy link
Member Author

merging to metrics branch. Will send follow ups to the metrics branch, and do a merge to main once all concerns are addressed. The plan is to have alpha3 release include fixes (Promtheus fix, OTLP fix - histogram) from this PR.

@cijothomas cijothomas merged commit 91cde47 into open-telemetry:metrics Sep 2, 2021
@cijothomas cijothomas deleted the cijothomas/metricexporter_refactor branch September 2, 2021 17:01
this.circularBuffer = circularBuffer ?? throw new ArgumentNullException(nameof(circularBuffer));
this.targetCount = circularBuffer.RemovedCount + Math.Min(maxSize, circularBuffer.Count);
}

internal Batch(T[] metrics, int maxSize)
{
Debug.Assert(maxSize > 0, $"{nameof(maxSize)} should be a positive number.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need some runtime check on maxSize, as Debug.Assert doesn't show up in release build?

@cijothomas
Copy link
Member Author

  1. Introduce Metric and MetricPoint, where Metric is a stream of aggregated metrics, containing upto N MetricPoints.
    Each instrument result in one Metric.
    (Once views are in, each instrument may result in more than on Metric. Also view config will be "hard-wired" at instrument creation time, so hot path has a simple Metric.Update() with no view lookups)
  2. Each Metric pre-allocates MetricPoint[2000], to store time-series. The Dictionary is still used for lookup, to obtain the index within this array.
    SDK caps max number of Metric (1000), and number of MetricPoints within each Metric to be 2000 - just hardcoded numbers
    for now. This needs to be revisited to have a reasonable defaults, as we don't anticipate exposing this to user (unless spec asks so, but that'd likely be a separate discussion)
  3. Exporters get Batch, reusing Batch from trace/log. This might be reconsidered for a simple IEnumerable. - Update: We are going to re-use Batch and is being worked on in this PR : https://github.com/open-telemetry/opentelemetry-dotnet/pull/2327/files
  4. Metric exposes GetDataPoints(), returning BatchMetricPoint. BatchMetricPoint is newly introduced as a way to iterate though the MetricPoint array, which is a struct and hence cannot re-use Batch. This need revisit based on 5 below. - Update: We are continuing with this approach for now. See below as well.
  5. MetricPoint is a struct now storing all type of Aggregator output - Sum,Gauge,Histogram. (Might save some cost by leveraging Union approach.)
    There is cost involved in "copying" the struct when iterating. Will explore better ways to handle this. Update: The copy perf cost is fixed -Modify MetricPoint to avoid copy #2321 . The "Union" approach is not yet done and will be revisited for later release.
  6. All exporters modified with the new Metric/MetricPoint structure. This fixes the Prometheus exporter issue, and optimizes OTLP as well. The histogram is also modelled as per OTLP which is more efficient. (Also fixed the bound issue which was reported by Alan)

// Upcoming in separate PRs

  1. Explore if ConcurrentDictionary could be leveraged instead of Dictionary with locks.
  2. Sorting of Tag keys/values is a major cost on hot path. Might be good to store the 1st seen orders to the Dictionary along with sorted, so if user keep reusing the same order, we can avoid the Sort cost. (costs additional memory, but likely justifiable)

Its a big PR, I can split to separate steps if needed. Or keep this as is, and address key issues as separate PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants