-
Notifications
You must be signed in to change notification settings - Fork 779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify MetricPoint to avoid copy #2321
Modify MetricPoint to avoid copy #2321
Conversation
Codecov Report
@@ Coverage Diff @@
## metrics #2321 +/- ##
===========================================
+ Coverage 78.08% 78.09% +0.01%
===========================================
Files 241 241
Lines 7619 7614 -5
===========================================
- Hits 5949 5946 -3
+ Misses 1670 1668 -2
|
{ | ||
get | ||
{ | ||
return ref this.metricsPoints[this.index]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this because, if you think about it, ref MetricPoint is essentially a pointer (think &MetricPoint) which is what we were trying to do with that MetricPointPointer struct on the other PR. What I don't know is if using this enumerator in a foreach is the compiler smart enough to use Current as a ref and not make a local copy? Need to look at compiled MSIL, possibly JIT output in benchmark, to see. I can take that on if you would like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I tried benchmark (https://github.com/open-telemetry/opentelemetry-dotnet/pull/2322/files) and the ref approach (this PR) reduced the mean time from ~100us to ~50us. (Must be the savings from copying, but yes we should confirm this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I threw together a quick benchmark over just the bits (more or less) of the pattern we're talking about.
Benchmarks
[DisassemblyDiagnoser(printSource: true)]
[MemoryDiagnoser]
public class RefEnumeratorBenchmark
{
private readonly MetricPoint[] storage = new MetricPoint[100];
[Benchmark]
public void Foreach()
{
foreach (MetricPoint mp in new Enumeration(this.storage))
{
var dv = mp.DoubleValue;
}
}
[Benchmark]
public void ForeachWithRef()
{
foreach (ref MetricPoint mp in new Enumeration(this.storage))
{
var dv = mp.DoubleValue;
}
}
[Benchmark]
public void Unrolled()
{
Enumeration.Enumerator e = new Enumeration(this.storage).GetEnumerator();
while (e.MoveNext())
{
ref MetricPoint mp = ref e.Current;
var dv = mp.DoubleValue;
}
}
public readonly struct Enumeration
{
private readonly MetricPoint[] storage;
internal Enumeration(MetricPoint[] storage)
{
this.storage = storage;
}
public Enumerator GetEnumerator() => new Enumerator(this.storage);
public struct Enumerator
{
private readonly MetricPoint[] storage;
private int index;
internal Enumerator(MetricPoint[] storage)
{
this.storage = storage;
this.index = 0;
}
public ref MetricPoint Current => ref this.storage[this.index - 1];
public bool MoveNext()
{
while (this.index++ < this.storage.Length)
{
return true;
}
return false;
}
}
}
}
Some interesting results.
Method | Mean | Error | StdDev | Median | Code Size | Allocated |
---|---|---|---|---|---|---|
Foreach | 43.19 ns | 1.319 ns | 3.786 ns | 42.15 ns | 41 B | - |
ForeachWithRef | 68.13 ns | 1.319 ns | 1.295 ns | 67.79 ns | 59 B | - |
Unrolled | 69.40 ns | 1.586 ns | 4.524 ns | 67.74 ns | 59 B | - |
ForeachWithRef
and Unrolled
seem to be slower, which I can't explain.
Foreach
does:
IL_0016: ldloca.s V_0
IL_0018: call instance valuetype [OpenTelemetry]OpenTelemetry.Metrics.MetricPoint& Benchmarks.RefEnumeratorBenchmark/Enumeration/Enumerator::get_Current()
IL_001d: ldobj [OpenTelemetry]OpenTelemetry.Metrics.MetricPoint
IL_0022: stloc.2 // mp
IL_0023: ldloca.s mp
IL_0025: call instance float64 [OpenTelemetry]OpenTelemetry.Metrics.MetricPoint::get_DoubleValue()
IL_002a: pop
ldobj
is making a copy of mp (MetricPoint) which should be much slower.
Where ForeachWithRef
and Unrolled
do:
IL_0016: ldloca.s e
IL_0018: call instance valuetype [OpenTelemetry]OpenTelemetry.Metrics.MetricPoint& Benchmarks.RefEnumeratorBenchmark/Enumeration/Enumerator::get_Current()
IL_001d: call instance float64 [OpenTelemetry]OpenTelemetry.Metrics.MetricPoint::get_DoubleValue()
IL_0022: pop
Which is using the address of mp on the stack.
My guess would be the ref Current in the enumerator throws off some rule in the JIT optimizing the foreach but we probably need some help from runtime team to confirm.
Here's the disassembled output. I used .NET 6 because it supposedly has optimizations for value types but .net 5.0 & 3.1 performed the same.
Disassembled .NET 6 code
.NET 6.0.0 (6.0.21.45706), X64 RyuJIT
; Benchmarks.RefEnumeratorBenchmark.Foreach()
sub rsp,28
mov rax,[rcx+8]
xor edx,edx
jmp short M00_L01
M00_L00:
lea edx,[rcx+0FFFF]
cmp edx,[rax+8]
jae short M00_L02
mov edx,ecx
M00_L01:
lea ecx,[rdx+1]
cmp [rax+8],edx
jg short M00_L00
add rsp,28
ret
M00_L02:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 41
.NET 6.0.0 (6.0.21.45706), X64 RyuJIT
; Benchmarks.RefEnumeratorBenchmark.ForeachWithRef()
sub rsp,28
mov rax,[rcx+8]
xor edx,edx
jmp short M00_L01
M00_L00:
lea edx,[rcx+0FFFF]
cmp edx,[rax+8]
jae short M00_L02
movsxd rdx,edx
imul rdx,88
lea rdx,[rax+rdx+10]
mov edx,[rdx+58]
mov edx,ecx
M00_L01:
lea ecx,[rdx+1]
cmp [rax+8],edx
jg short M00_L00
add rsp,28
ret
M00_L02:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 59
.NET 6.0.0 (6.0.21.45706), X64 RyuJIT
; Benchmarks.RefEnumeratorBenchmark.Unrolled()
sub rsp,28
mov rax,[rcx+8]
xor edx,edx
jmp short M00_L01
M00_L00:
lea edx,[rcx+0FFFF]
cmp edx,[rax+8]
jae short M00_L02
movsxd rdx,edx
imul rdx,88
lea rdx,[rax+rdx+10]
mov edx,[rdx+58]
mov edx,ecx
M00_L01:
lea ecx,[rdx+1]
cmp [rax+8],edx
jg short M00_L00
add rsp,28
ret
M00_L02:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 59
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GrabYourPitchforks @stephentoub Hey sorry for bothering you guys, looking for some expert guidance on the code above. Could either of you assist or tag in someone who has an extra cycle or two?
What we're trying to vet out: Good idea or bad idea to use ref [struct] Current
in an enumerator?
Seems to work but seeing some surprising benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JIT is sometimes detecting that the benchmark code posted at #2321 (comment) isn't actually doing anything other than looping, and it's optimizing away the dereference. For example, the Benchmarks.RefEnumeratorBenchmark.Foreach
benchmark only ever loops from i = 0 to arr.Length - 1, never attempting to dereference any element from the array. In ForeachWithRef
, it is dereferencing the value, but it's immediately throwing away the value once the read is complete. I suspect you're right in that it's simply a missing optimization in the JIT: in theory it could detect that the dereference is a no-op (as is done in the non-ref case) and elide it entirely.
To defeat this optimization and get closer to what you're trying to measure for real, you should do something with the result. Something like:
[Benchmark]
public double MyBenchmark()
{
double sum = 0;
foreach (var entry in new Enumerator(/* ... */))
{
sum += entry.DoubleValue;
}
return sum;
}
This will force the JIT to dereference at least the DoubleValue field for all entries. But the JIT is still free to detect that you're not touching any other part of the entries, so it could still attempt to keep the number of dereferences as minimal as possible. (This is generally a good thing!)
I think the best benchmark would be one that performs a realistic unit of work within each loop. At least that would have the effect of forcing the JIT to generate code that matches what your users will observe in the real world.
Hope this helps!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cijothomas OK updated the benchmarks using @GrabYourPitchforks suggestions (thank you!).
Benchmarks
[DisassemblyDiagnoser(printSource: true)]
[MemoryDiagnoser]
public class RefEnumeratorBenchmark
{
private readonly MetricPoint[] storage = new MetricPoint[100];
[Benchmark]
public double Foreach()
{
double sum = 0;
foreach (MetricPoint mp in new Enumeration(this.storage))
{
sum += mp.DoubleValue;
}
return sum;
}
[Benchmark]
public double ForeachWithRef()
{
double sum = 0;
foreach (ref MetricPoint mp in new Enumeration(this.storage))
{
sum += mp.DoubleValue;
}
return sum;
}
[Benchmark]
public double Unrolled()
{
double sum = 0;
Enumeration.Enumerator e = new Enumeration(this.storage).GetEnumerator();
while (e.MoveNext())
{
ref MetricPoint mp = ref e.Current;
sum += mp.DoubleValue;
}
return sum;
}
public readonly struct Enumeration
{
private readonly MetricPoint[] storage;
internal Enumeration(MetricPoint[] storage)
{
this.storage = storage;
}
public Enumerator GetEnumerator() => new Enumerator(this.storage);
public struct Enumerator
{
private readonly MetricPoint[] storage;
private int index;
internal Enumerator(MetricPoint[] storage)
{
this.storage = storage;
this.index = 0;
}
public ref MetricPoint Current => ref this.storage[this.index - 1];
public bool MoveNext()
{
while (this.index++ < this.storage.Length)
{
return true;
}
return false;
}
}
}
}
Method | Mean | Error | StdDev | Median | Code Size | Allocated |
---|---|---|---|---|---|---|
Foreach | 1,329.33 ns | 28.373 ns | 81.862 ns | 1,304.20 ns | 193 B | - |
ForeachWithRef | 94.52 ns | 1.686 ns | 1.942 ns | 93.81 ns | 68 B | - |
Unrolled | 93.08 ns | 1.231 ns | 1.151 ns | 93.27 ns | 68 B | - |
This is showing more what I was expecting to see.
I like this pattern with the ref MetricPoint Current
enumerator but we'll need to take care that exporters properly loop (eg foreach (ref MetricPoint metricPoint in metric.GetMetricPoints())
) otherwise the loop will make a copy of each struct and be slooooow. I don't think that's a deal-breaker?
Disassembled .NET 5 code
.NET 5.0.9 (5.0.921.35908), X64 RyuJIT
; Benchmarks.RefEnumeratorBenchmark.Foreach()
push rdi
push rsi
push rbp
push rbx
sub rsp,0C8
vzeroupper
vmovaps [rsp+0B0],xmm6
xor eax,eax
mov [rsp+28],rax
vxorps xmm4,xmm4,xmm4
vmovdqa xmmword ptr [rsp+30],xmm4
vmovdqa xmmword ptr [rsp+40],xmm4
mov rax,0FFFFFFFFFFA0
M00_L00:
vmovdqa xmmword ptr [rsp+rax+0B0],xmm4
vmovdqa xmmword ptr [rsp+rax+0C0],xmm4
vmovdqa xmmword ptr [rsp+rax+0D0],xmm4
add rax,30
jne short M00_L00
vxorps xmm6,xmm6,xmm6
mov rbx,[rcx+8]
xor ebp,ebp
jmp short M00_L02
M00_L01:
mov [rsp+24],ecx
lea eax,[rcx+0FFFF]
cmp eax,[rbx+8]
jae short M00_L03
movsxd rax,eax
imul rax,88
lea rsi,[rbx+rax+10]
lea rdi,[rsp+28]
mov ecx,11
rep movsq
vaddsd xmm6,xmm6,qword ptr [rsp+80]
mov ebp,[rsp+24]
M00_L02:
lea ecx,[rbp+1]
cmp [rbx+8],ebp
jg short M00_L01
vmovaps xmm0,xmm6
vmovaps xmm6,[rsp+0B0]
add rsp,0C8
pop rbx
pop rbp
pop rsi
pop rdi
ret
M00_L03:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 193
.NET 5.0.9 (5.0.921.35908), X64 RyuJIT
; Benchmarks.RefEnumeratorBenchmark.ForeachWithRef()
sub rsp,28
vzeroupper
vxorps xmm0,xmm0,xmm0
mov rax,[rcx+8]
xor edx,edx
jmp short M00_L01
M00_L00:
lea edx,[rcx+0FFFF]
cmp edx,[rax+8]
jae short M00_L02
movsxd rdx,edx
imul rdx,88
lea rdx,[rax+rdx+10]
vaddsd xmm0,xmm0,qword ptr [rdx+58]
mov edx,ecx
M00_L01:
lea ecx,[rdx+1]
cmp [rax+8],edx
jg short M00_L00
add rsp,28
ret
M00_L02:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 68
.NET 5.0.9 (5.0.921.35908), X64 RyuJIT
; Benchmarks.RefEnumeratorBenchmark.Unrolled()
sub rsp,28
vzeroupper
vxorps xmm0,xmm0,xmm0
mov rax,[rcx+8]
xor edx,edx
jmp short M00_L01
M00_L00:
lea edx,[rcx+0FFFF]
cmp edx,[rax+8]
jae short M00_L02
movsxd rdx,edx
imul rdx,88
lea rdx,[rax+rdx+10]
vaddsd xmm0,xmm0,qword ptr [rdx+58]
mov edx,ecx
M00_L01:
lea ecx,[rdx+1]
cmp [rax+8],edx
jg short M00_L00
add rsp,28
ret
M00_L02:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 68
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the insights here! I added similar benchmark, part of the metricbenchmarks demo-ing this as well.
@CodeBlanch I think asking export authors to use ref is fine.
- Only people writing exporters need to be aware of this. We can highlight it in similar section like https://github.com/open-telemetry/opentelemetry-dotnet/tree/main/docs/trace/extending-the-sdk#exporter
- Even if Export writers dont follow this guidance, they will still get correct output. Only perf is affected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merging to unblock releases. WIll address any comments via future PRs.
Alternate to https://github.com/open-telemetry/opentelemetry-dotnet/pull/2320/files
Changes
Please provide a brief description of the changes here.
For significant contributions please make sure you have completed the following items:
CHANGELOG.md
updated for non-trivial changes