Modify MetricPoint to avoid copy #2321

CodeBlanch · 2021-09-08T20:59:08Z

I like this because, if you think about it, ref MetricPoint is essentially a pointer (think &MetricPoint) which is what we were trying to do with that MetricPointPointer struct on the other PR. What I don't know is if using this enumerator in a foreach is the compiler smart enough to use Current as a ref and not make a local copy? Need to look at compiled MSIL, possibly JIT output in benchmark, to see. I can take that on if you would like.

Thanks!
I tried benchmark (https://github.com/open-telemetry/opentelemetry-dotnet/pull/2322/files) and the ref approach (this PR) reduced the mean time from ~100us to ~50us. (Must be the savings from copying, but yes we should confirm this)

I threw together a quick benchmark over just the bits (more or less) of the pattern we're talking about.

Benchmarks

[DisassemblyDiagnoser(printSource: true)] [MemoryDiagnoser] public class RefEnumeratorBenchmark { private readonly MetricPoint[] storage = new MetricPoint[100]; [Benchmark] public void Foreach() { foreach (MetricPoint mp in new Enumeration(this.storage)) { var dv = mp.DoubleValue; } } [Benchmark] public void ForeachWithRef() { foreach (ref MetricPoint mp in new Enumeration(this.storage)) { var dv = mp.DoubleValue; } } [Benchmark] public void Unrolled() { Enumeration.Enumerator e = new Enumeration(this.storage).GetEnumerator(); while (e.MoveNext()) { ref MetricPoint mp = ref e.Current; var dv = mp.DoubleValue; } } public readonly struct Enumeration { private readonly MetricPoint[] storage; internal Enumeration(MetricPoint[] storage) { this.storage = storage; } public Enumerator GetEnumerator() => new Enumerator(this.storage); public struct Enumerator { private readonly MetricPoint[] storage; private int index; internal Enumerator(MetricPoint[] storage) { this.storage = storage; this.index = 0; } public ref MetricPoint Current => ref this.storage[this.index - 1]; public bool MoveNext() { while (this.index++ < this.storage.Length) { return true; } return false; } } } }

Some interesting results.

Method Mean Error StdDev Median Code Size Allocated

Foreach 43.19 ns 1.319 ns 3.786 ns 42.15 ns 41 B -

ForeachWithRef 68.13 ns 1.319 ns 1.295 ns 67.79 ns 59 B -

Unrolled 69.40 ns 1.586 ns 4.524 ns 67.74 ns 59 B -

ForeachWithRef and Unrolled seem to be slower, which I can't explain.

Foreach does:

IL_0016: ldloca.s V_0 IL_0018: call instance valuetype [OpenTelemetry]OpenTelemetry.Metrics.MetricPoint& Benchmarks.RefEnumeratorBenchmark/Enumeration/Enumerator::get_Current() IL_001d: ldobj [OpenTelemetry]OpenTelemetry.Metrics.MetricPoint IL_0022: stloc.2 // mp IL_0023: ldloca.s mp IL_0025: call instance float64 [OpenTelemetry]OpenTelemetry.Metrics.MetricPoint::get_DoubleValue() IL_002a: pop

ldobj is making a copy of mp (MetricPoint) which should be much slower.

Where ForeachWithRef and Unrolled do:

IL_0016: ldloca.s e IL_0018: call instance valuetype [OpenTelemetry]OpenTelemetry.Metrics.MetricPoint& Benchmarks.RefEnumeratorBenchmark/Enumeration/Enumerator::get_Current() IL_001d: call instance float64 [OpenTelemetry]OpenTelemetry.Metrics.MetricPoint::get_DoubleValue() IL_0022: pop

Which is using the address of mp on the stack.

My guess would be the ref Current in the enumerator throws off some rule in the JIT optimizing the foreach but we probably need some help from runtime team to confirm.

Here's the disassembled output. I used .NET 6 because it supposedly has optimizations for value types but .net 5.0 & 3.1 performed the same.

Disassembled .NET 6 code

.NET 6.0.0 (6.0.21.45706), X64 RyuJIT

; Benchmarks.RefEnumeratorBenchmark.Foreach() sub rsp,28 mov rax,[rcx+8] xor edx,edx jmp short M00_L01 M00_L00: lea edx,[rcx+0FFFF] cmp edx,[rax+8] jae short M00_L02 mov edx,ecx M00_L01: lea ecx,[rdx+1] cmp [rax+8],edx jg short M00_L00 add rsp,28 ret M00_L02: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 41

.NET 6.0.0 (6.0.21.45706), X64 RyuJIT

; Benchmarks.RefEnumeratorBenchmark.ForeachWithRef() sub rsp,28 mov rax,[rcx+8] xor edx,edx jmp short M00_L01 M00_L00: lea edx,[rcx+0FFFF] cmp edx,[rax+8] jae short M00_L02 movsxd rdx,edx imul rdx,88 lea rdx,[rax+rdx+10] mov edx,[rdx+58] mov edx,ecx M00_L01: lea ecx,[rdx+1] cmp [rax+8],edx jg short M00_L00 add rsp,28 ret M00_L02: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 59

.NET 6.0.0 (6.0.21.45706), X64 RyuJIT

; Benchmarks.RefEnumeratorBenchmark.Unrolled() sub rsp,28 mov rax,[rcx+8] xor edx,edx jmp short M00_L01 M00_L00: lea edx,[rcx+0FFFF] cmp edx,[rax+8] jae short M00_L02 movsxd rdx,edx imul rdx,88 lea rdx,[rax+rdx+10] mov edx,[rdx+58] mov edx,ecx M00_L01: lea ecx,[rdx+1] cmp [rax+8],edx jg short M00_L00 add rsp,28 ret M00_L02: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 59

@GrabYourPitchforks @stephentoub Hey sorry for bothering you guys, looking for some expert guidance on the code above. Could either of you assist or tag in someone who has an extra cycle or two?

What we're trying to vet out: Good idea or bad idea to use ref [struct] Current in an enumerator?

Seems to work but seeing some surprising benchmarks.

JIT is sometimes detecting that the benchmark code posted at #2321 (comment) isn't actually doing anything other than looping, and it's optimizing away the dereference. For example, the Benchmarks.RefEnumeratorBenchmark.Foreach benchmark only ever loops from i = 0 to arr.Length - 1, never attempting to dereference any element from the array. In ForeachWithRef, it is dereferencing the value, but it's immediately throwing away the value once the read is complete. I suspect you're right in that it's simply a missing optimization in the JIT: in theory it could detect that the dereference is a no-op (as is done in the non-ref case) and elide it entirely.

To defeat this optimization and get closer to what you're trying to measure for real, you should do something with the result. Something like:

[Benchmark] public double MyBenchmark() { double sum = 0; foreach (var entry in new Enumerator(/* ... */)) { sum += entry.DoubleValue; } return sum; }

This will force the JIT to dereference at least the DoubleValue field for all entries. But the JIT is still free to detect that you're not touching any other part of the entries, so it could still attempt to keep the number of dereferences as minimal as possible. (This is generally a good thing!)

I think the best benchmark would be one that performs a realistic unit of work within each loop. At least that would have the effect of forcing the JIT to generate code that matches what your users will observe in the real world.

Hope this helps!

@cijothomas OK updated the benchmarks using @GrabYourPitchforks suggestions (thank you!).

Benchmarks

[DisassemblyDiagnoser(printSource: true)] [MemoryDiagnoser] public class RefEnumeratorBenchmark { private readonly MetricPoint[] storage = new MetricPoint[100]; [Benchmark] public double Foreach() { double sum = 0; foreach (MetricPoint mp in new Enumeration(this.storage)) { sum += mp.DoubleValue; } return sum; } [Benchmark] public double ForeachWithRef() { double sum = 0; foreach (ref MetricPoint mp in new Enumeration(this.storage)) { sum += mp.DoubleValue; } return sum; } [Benchmark] public double Unrolled() { double sum = 0; Enumeration.Enumerator e = new Enumeration(this.storage).GetEnumerator(); while (e.MoveNext()) { ref MetricPoint mp = ref e.Current; sum += mp.DoubleValue; } return sum; } public readonly struct Enumeration { private readonly MetricPoint[] storage; internal Enumeration(MetricPoint[] storage) { this.storage = storage; } public Enumerator GetEnumerator() => new Enumerator(this.storage); public struct Enumerator { private readonly MetricPoint[] storage; private int index; internal Enumerator(MetricPoint[] storage) { this.storage = storage; this.index = 0; } public ref MetricPoint Current => ref this.storage[this.index - 1]; public bool MoveNext() { while (this.index++ < this.storage.Length) { return true; } return false; } } } }

Method Mean Error StdDev Median Code Size Allocated

Foreach 1,329.33 ns 28.373 ns 81.862 ns 1,304.20 ns 193 B -

ForeachWithRef 94.52 ns 1.686 ns 1.942 ns 93.81 ns 68 B -

Unrolled 93.08 ns 1.231 ns 1.151 ns 93.27 ns 68 B -

This is showing more what I was expecting to see.

I like this pattern with the ref MetricPoint Current enumerator but we'll need to take care that exporters properly loop (eg foreach (ref MetricPoint metricPoint in metric.GetMetricPoints())) otherwise the loop will make a copy of each struct and be slooooow. I don't think that's a deal-breaker?

Disassembled .NET 5 code

.NET 5.0.9 (5.0.921.35908), X64 RyuJIT

; Benchmarks.RefEnumeratorBenchmark.Foreach() push rdi push rsi push rbp push rbx sub rsp,0C8 vzeroupper vmovaps [rsp+0B0],xmm6 xor eax,eax mov [rsp+28],rax vxorps xmm4,xmm4,xmm4 vmovdqa xmmword ptr [rsp+30],xmm4 vmovdqa xmmword ptr [rsp+40],xmm4 mov rax,0FFFFFFFFFFA0 M00_L00: vmovdqa xmmword ptr [rsp+rax+0B0],xmm4 vmovdqa xmmword ptr [rsp+rax+0C0],xmm4 vmovdqa xmmword ptr [rsp+rax+0D0],xmm4 add rax,30 jne short M00_L00 vxorps xmm6,xmm6,xmm6 mov rbx,[rcx+8] xor ebp,ebp jmp short M00_L02 M00_L01: mov [rsp+24],ecx lea eax,[rcx+0FFFF] cmp eax,[rbx+8] jae short M00_L03 movsxd rax,eax imul rax,88 lea rsi,[rbx+rax+10] lea rdi,[rsp+28] mov ecx,11 rep movsq vaddsd xmm6,xmm6,qword ptr [rsp+80] mov ebp,[rsp+24] M00_L02: lea ecx,[rbp+1] cmp [rbx+8],ebp jg short M00_L01 vmovaps xmm0,xmm6 vmovaps xmm6,[rsp+0B0] add rsp,0C8 pop rbx pop rbp pop rsi pop rdi ret M00_L03: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 193

.NET 5.0.9 (5.0.921.35908), X64 RyuJIT

; Benchmarks.RefEnumeratorBenchmark.ForeachWithRef() sub rsp,28 vzeroupper vxorps xmm0,xmm0,xmm0 mov rax,[rcx+8] xor edx,edx jmp short M00_L01 M00_L00: lea edx,[rcx+0FFFF] cmp edx,[rax+8] jae short M00_L02 movsxd rdx,edx imul rdx,88 lea rdx,[rax+rdx+10] vaddsd xmm0,xmm0,qword ptr [rdx+58] mov edx,ecx M00_L01: lea ecx,[rdx+1] cmp [rax+8],edx jg short M00_L00 add rsp,28 ret M00_L02: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 68

.NET 5.0.9 (5.0.921.35908), X64 RyuJIT

; Benchmarks.RefEnumeratorBenchmark.Unrolled() sub rsp,28 vzeroupper vxorps xmm0,xmm0,xmm0 mov rax,[rcx+8] xor edx,edx jmp short M00_L01 M00_L00: lea edx,[rcx+0FFFF] cmp edx,[rax+8] jae short M00_L02 movsxd rdx,edx imul rdx,88 lea rdx,[rax+rdx+10] vaddsd xmm0,xmm0,qword ptr [rdx+58] mov edx,ecx M00_L01: lea ecx,[rdx+1] cmp [rax+8],edx jg short M00_L00 add rsp,28 ret M00_L02: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 68

Thanks for the insights here! I added similar benchmark, part of the metricbenchmarks demo-ing this as well.

@CodeBlanch I think asking export authors to use ref is fine.

Only people writing exporters need to be aware of this. We can highlight it in similar section like https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/trace/extending-the-sdk#exporter

Even if Export writers dont follow this guidance, they will still get correct output. Only perf is affected.

@alanwest @reyang Please share your thoughts here.

merging to unblock releases. WIll address any comments via future PRs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify MetricPoint to avoid copy #2321

Diff view

Diff view

There are no files selected for viewing

CodeBlanch Sep 8, 2021

cijothomas Sep 8, 2021

CodeBlanch Sep 8, 2021

CodeBlanch Sep 9, 2021

GrabYourPitchforks Sep 9, 2021

CodeBlanch Sep 9, 2021

cijothomas Sep 10, 2021

cijothomas Sep 10, 2021

Modify MetricPoint to avoid copy #2321

Modify MetricPoint to avoid copy #2321

Diff view

Diff view

There are no files selected for viewing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

.NET 6.0.0 (6.0.21.45706), X64 RyuJIT

.NET 6.0.0 (6.0.21.45706), X64 RyuJIT

.NET 6.0.0 (6.0.21.45706), X64 RyuJIT

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

.NET 5.0.9 (5.0.921.35908), X64 RyuJIT

.NET 5.0.9 (5.0.921.35908), X64 RyuJIT

.NET 5.0.9 (5.0.921.35908), X64 RyuJIT

Choose a reason for hiding this comment

Choose a reason for hiding this comment