Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify MetricPoint to avoid copy #2321

Merged

Conversation

cijothomas
Copy link
Member

Alternate to https://github.com/open-telemetry/opentelemetry-dotnet/pull/2320/files

Changes

Please provide a brief description of the changes here.

For significant contributions please make sure you have completed the following items:

  • CHANGELOG.md updated for non-trivial changes
  • Design discussion issue #
  • Changes in public API reviewed

@cijothomas cijothomas requested a review from a team September 8, 2021 19:58
@codecov
Copy link

codecov bot commented Sep 8, 2021

Codecov Report

Merging #2321 (0aa9837) into metrics (4afa474) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           metrics    #2321      +/-   ##
===========================================
+ Coverage    78.08%   78.09%   +0.01%     
===========================================
  Files          241      241              
  Lines         7619     7614       -5     
===========================================
- Hits          5949     5946       -3     
+ Misses        1670     1668       -2     
Impacted Files Coverage Δ
src/OpenTelemetry/Metrics/BatchMetricPoint.cs 83.33% <100.00%> (-2.88%) ⬇️
...Zipkin/Implementation/ZipkinExporterEventSource.cs 63.63% <0.00%> (-9.10%) ⬇️
...emetry.Api/Internal/OpenTelemetryApiEventSource.cs 82.35% <0.00%> (+8.82%) ⬆️

{
get
{
return ref this.metricsPoints[this.index];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this because, if you think about it, ref MetricPoint is essentially a pointer (think &MetricPoint) which is what we were trying to do with that MetricPointPointer struct on the other PR. What I don't know is if using this enumerator in a foreach is the compiler smart enough to use Current as a ref and not make a local copy? Need to look at compiled MSIL, possibly JIT output in benchmark, to see. I can take that on if you would like.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
I tried benchmark (https://github.com/open-telemetry/opentelemetry-dotnet/pull/2322/files) and the ref approach (this PR) reduced the mean time from ~100us to ~50us. (Must be the savings from copying, but yes we should confirm this)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I threw together a quick benchmark over just the bits (more or less) of the pattern we're talking about.

Benchmarks
    [DisassemblyDiagnoser(printSource: true)]
    [MemoryDiagnoser]
    public class RefEnumeratorBenchmark
    {
        private readonly MetricPoint[] storage = new MetricPoint[100];

        [Benchmark]
        public void Foreach()
        {
            foreach (MetricPoint mp in new Enumeration(this.storage))
            {
                var dv = mp.DoubleValue;
            }
        }

        [Benchmark]
        public void ForeachWithRef()
        {
            foreach (ref MetricPoint mp in new Enumeration(this.storage))
            {
                var dv = mp.DoubleValue;
            }
        }

        [Benchmark]
        public void Unrolled()
        {
            Enumeration.Enumerator e = new Enumeration(this.storage).GetEnumerator();
            while (e.MoveNext())
            {
                ref MetricPoint mp = ref e.Current;

                var dv = mp.DoubleValue;
            }
        }

        public readonly struct Enumeration
        {
            private readonly MetricPoint[] storage;

            internal Enumeration(MetricPoint[] storage)
            {
                this.storage = storage;
            }

            public Enumerator GetEnumerator() => new Enumerator(this.storage);

            public struct Enumerator
            {
                private readonly MetricPoint[] storage;
                private int index;

                internal Enumerator(MetricPoint[] storage)
                {
                    this.storage = storage;
                    this.index = 0;
                }

                public ref MetricPoint Current => ref this.storage[this.index - 1];

                public bool MoveNext()
                {
                    while (this.index++ < this.storage.Length)
                    {
                        return true;
                    }

                    return false;
                }
            }
        }
    }

Some interesting results.

Method Mean Error StdDev Median Code Size Allocated
Foreach 43.19 ns 1.319 ns 3.786 ns 42.15 ns 41 B -
ForeachWithRef 68.13 ns 1.319 ns 1.295 ns 67.79 ns 59 B -
Unrolled 69.40 ns 1.586 ns 4.524 ns 67.74 ns 59 B -

ForeachWithRef and Unrolled seem to be slower, which I can't explain.

Foreach does:

      IL_0016: ldloca.s     V_0
      IL_0018: call         instance valuetype [OpenTelemetry]OpenTelemetry.Metrics.MetricPoint& Benchmarks.RefEnumeratorBenchmark/Enumeration/Enumerator::get_Current()
      IL_001d: ldobj        [OpenTelemetry]OpenTelemetry.Metrics.MetricPoint
      IL_0022: stloc.2      // mp

      IL_0023: ldloca.s     mp
      IL_0025: call         instance float64 [OpenTelemetry]OpenTelemetry.Metrics.MetricPoint::get_DoubleValue()
      IL_002a: pop

ldobj is making a copy of mp (MetricPoint) which should be much slower.

Where ForeachWithRef and Unrolled do:

      IL_0016: ldloca.s     e
      IL_0018: call         instance valuetype [OpenTelemetry]OpenTelemetry.Metrics.MetricPoint& Benchmarks.RefEnumeratorBenchmark/Enumeration/Enumerator::get_Current()

      IL_001d: call         instance float64 [OpenTelemetry]OpenTelemetry.Metrics.MetricPoint::get_DoubleValue()
      IL_0022: pop

Which is using the address of mp on the stack.

My guess would be the ref Current in the enumerator throws off some rule in the JIT optimizing the foreach but we probably need some help from runtime team to confirm.

Here's the disassembled output. I used .NET 6 because it supposedly has optimizations for value types but .net 5.0 & 3.1 performed the same.

Disassembled .NET 6 code

.NET 6.0.0 (6.0.21.45706), X64 RyuJIT

; Benchmarks.RefEnumeratorBenchmark.Foreach()
       sub       rsp,28
       mov       rax,[rcx+8]
       xor       edx,edx
       jmp       short M00_L01
M00_L00:
       lea       edx,[rcx+0FFFF]
       cmp       edx,[rax+8]
       jae       short M00_L02
       mov       edx,ecx
M00_L01:
       lea       ecx,[rdx+1]
       cmp       [rax+8],edx
       jg        short M00_L00
       add       rsp,28
       ret
M00_L02:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 41

.NET 6.0.0 (6.0.21.45706), X64 RyuJIT

; Benchmarks.RefEnumeratorBenchmark.ForeachWithRef()
       sub       rsp,28
       mov       rax,[rcx+8]
       xor       edx,edx
       jmp       short M00_L01
M00_L00:
       lea       edx,[rcx+0FFFF]
       cmp       edx,[rax+8]
       jae       short M00_L02
       movsxd    rdx,edx
       imul      rdx,88
       lea       rdx,[rax+rdx+10]
       mov       edx,[rdx+58]
       mov       edx,ecx
M00_L01:
       lea       ecx,[rdx+1]
       cmp       [rax+8],edx
       jg        short M00_L00
       add       rsp,28
       ret
M00_L02:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 59

.NET 6.0.0 (6.0.21.45706), X64 RyuJIT

; Benchmarks.RefEnumeratorBenchmark.Unrolled()
       sub       rsp,28
       mov       rax,[rcx+8]
       xor       edx,edx
       jmp       short M00_L01
M00_L00:
       lea       edx,[rcx+0FFFF]
       cmp       edx,[rax+8]
       jae       short M00_L02
       movsxd    rdx,edx
       imul      rdx,88
       lea       rdx,[rax+rdx+10]
       mov       edx,[rdx+58]
       mov       edx,ecx
M00_L01:
       lea       ecx,[rdx+1]
       cmp       [rax+8],edx
       jg        short M00_L00
       add       rsp,28
       ret
M00_L02:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 59

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GrabYourPitchforks @stephentoub Hey sorry for bothering you guys, looking for some expert guidance on the code above. Could either of you assist or tag in someone who has an extra cycle or two?

What we're trying to vet out: Good idea or bad idea to use ref [struct] Current in an enumerator?

Seems to work but seeing some surprising benchmarks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JIT is sometimes detecting that the benchmark code posted at #2321 (comment) isn't actually doing anything other than looping, and it's optimizing away the dereference. For example, the Benchmarks.RefEnumeratorBenchmark.Foreach benchmark only ever loops from i = 0 to arr.Length - 1, never attempting to dereference any element from the array. In ForeachWithRef, it is dereferencing the value, but it's immediately throwing away the value once the read is complete. I suspect you're right in that it's simply a missing optimization in the JIT: in theory it could detect that the dereference is a no-op (as is done in the non-ref case) and elide it entirely.

To defeat this optimization and get closer to what you're trying to measure for real, you should do something with the result. Something like:

[Benchmark]
public double MyBenchmark()
{
    double sum = 0;
    foreach (var entry in new Enumerator(/* ... */))
    {
        sum += entry.DoubleValue;
    }
    return sum;
}

This will force the JIT to dereference at least the DoubleValue field for all entries. But the JIT is still free to detect that you're not touching any other part of the entries, so it could still attempt to keep the number of dereferences as minimal as possible. (This is generally a good thing!)

I think the best benchmark would be one that performs a realistic unit of work within each loop. At least that would have the effect of forcing the JIT to generate code that matches what your users will observe in the real world.

Hope this helps!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cijothomas OK updated the benchmarks using @GrabYourPitchforks suggestions (thank you!).

Benchmarks
    [DisassemblyDiagnoser(printSource: true)]
    [MemoryDiagnoser]
    public class RefEnumeratorBenchmark
    {
        private readonly MetricPoint[] storage = new MetricPoint[100];

        [Benchmark]
        public double Foreach()
        {
            double sum = 0;
            foreach (MetricPoint mp in new Enumeration(this.storage))
            {
                sum += mp.DoubleValue;
            }

            return sum;
        }

        [Benchmark]
        public double ForeachWithRef()
        {
            double sum = 0;
            foreach (ref MetricPoint mp in new Enumeration(this.storage))
            {
                sum += mp.DoubleValue;
            }

            return sum;
        }

        [Benchmark]
        public double Unrolled()
        {
            double sum = 0;
            Enumeration.Enumerator e = new Enumeration(this.storage).GetEnumerator();
            while (e.MoveNext())
            {
                ref MetricPoint mp = ref e.Current;

                sum += mp.DoubleValue;
            }

            return sum;
        }

        public readonly struct Enumeration
        {
            private readonly MetricPoint[] storage;

            internal Enumeration(MetricPoint[] storage)
            {
                this.storage = storage;
            }

            public Enumerator GetEnumerator() => new Enumerator(this.storage);

            public struct Enumerator
            {
                private readonly MetricPoint[] storage;
                private int index;

                internal Enumerator(MetricPoint[] storage)
                {
                    this.storage = storage;
                    this.index = 0;
                }

                public ref MetricPoint Current => ref this.storage[this.index - 1];

                public bool MoveNext()
                {
                    while (this.index++ < this.storage.Length)
                    {
                        return true;
                    }

                    return false;
                }
            }
        }
    }
Method Mean Error StdDev Median Code Size Allocated
Foreach 1,329.33 ns 28.373 ns 81.862 ns 1,304.20 ns 193 B -
ForeachWithRef 94.52 ns 1.686 ns 1.942 ns 93.81 ns 68 B -
Unrolled 93.08 ns 1.231 ns 1.151 ns 93.27 ns 68 B -

This is showing more what I was expecting to see.

I like this pattern with the ref MetricPoint Current enumerator but we'll need to take care that exporters properly loop (eg foreach (ref MetricPoint metricPoint in metric.GetMetricPoints())) otherwise the loop will make a copy of each struct and be slooooow. I don't think that's a deal-breaker?

Disassembled .NET 5 code

.NET 5.0.9 (5.0.921.35908), X64 RyuJIT

; Benchmarks.RefEnumeratorBenchmark.Foreach()
       push      rdi
       push      rsi
       push      rbp
       push      rbx
       sub       rsp,0C8
       vzeroupper
       vmovaps   [rsp+0B0],xmm6
       xor       eax,eax
       mov       [rsp+28],rax
       vxorps    xmm4,xmm4,xmm4
       vmovdqa   xmmword ptr [rsp+30],xmm4
       vmovdqa   xmmword ptr [rsp+40],xmm4
       mov       rax,0FFFFFFFFFFA0
M00_L00:
       vmovdqa   xmmword ptr [rsp+rax+0B0],xmm4
       vmovdqa   xmmword ptr [rsp+rax+0C0],xmm4
       vmovdqa   xmmword ptr [rsp+rax+0D0],xmm4
       add       rax,30
       jne       short M00_L00
       vxorps    xmm6,xmm6,xmm6
       mov       rbx,[rcx+8]
       xor       ebp,ebp
       jmp       short M00_L02
M00_L01:
       mov       [rsp+24],ecx
       lea       eax,[rcx+0FFFF]
       cmp       eax,[rbx+8]
       jae       short M00_L03
       movsxd    rax,eax
       imul      rax,88
       lea       rsi,[rbx+rax+10]
       lea       rdi,[rsp+28]
       mov       ecx,11
       rep movsq
       vaddsd    xmm6,xmm6,qword ptr [rsp+80]
       mov       ebp,[rsp+24]
M00_L02:
       lea       ecx,[rbp+1]
       cmp       [rbx+8],ebp
       jg        short M00_L01
       vmovaps   xmm0,xmm6
       vmovaps   xmm6,[rsp+0B0]
       add       rsp,0C8
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       ret
M00_L03:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 193

.NET 5.0.9 (5.0.921.35908), X64 RyuJIT

; Benchmarks.RefEnumeratorBenchmark.ForeachWithRef()
       sub       rsp,28
       vzeroupper
       vxorps    xmm0,xmm0,xmm0
       mov       rax,[rcx+8]
       xor       edx,edx
       jmp       short M00_L01
M00_L00:
       lea       edx,[rcx+0FFFF]
       cmp       edx,[rax+8]
       jae       short M00_L02
       movsxd    rdx,edx
       imul      rdx,88
       lea       rdx,[rax+rdx+10]
       vaddsd    xmm0,xmm0,qword ptr [rdx+58]
       mov       edx,ecx
M00_L01:
       lea       ecx,[rdx+1]
       cmp       [rax+8],edx
       jg        short M00_L00
       add       rsp,28
       ret
M00_L02:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 68

.NET 5.0.9 (5.0.921.35908), X64 RyuJIT

; Benchmarks.RefEnumeratorBenchmark.Unrolled()
       sub       rsp,28
       vzeroupper
       vxorps    xmm0,xmm0,xmm0
       mov       rax,[rcx+8]
       xor       edx,edx
       jmp       short M00_L01
M00_L00:
       lea       edx,[rcx+0FFFF]
       cmp       edx,[rax+8]
       jae       short M00_L02
       movsxd    rdx,edx
       imul      rdx,88
       lea       rdx,[rax+rdx+10]
       vaddsd    xmm0,xmm0,qword ptr [rdx+58]
       mov       edx,ecx
M00_L01:
       lea       ecx,[rdx+1]
       cmp       [rax+8],edx
       jg        short M00_L00
       add       rsp,28
       ret
M00_L02:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 68

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the insights here! I added similar benchmark, part of the metricbenchmarks demo-ing this as well.

@CodeBlanch I think asking export authors to use ref is fine.

  1. Only people writing exporters need to be aware of this. We can highlight it in similar section like https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/trace/extending-the-sdk#exporter
  2. Even if Export writers dont follow this guidance, they will still get correct output. Only perf is affected.

@alanwest @reyang Please share your thoughts here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merging to unblock releases. WIll address any comments via future PRs.

@cijothomas cijothomas merged commit cb8f686 into open-telemetry:metrics Sep 10, 2021
@cijothomas cijothomas deleted the cijothomas/metricpoint_avoidcopy2 branch September 10, 2021 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants