-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System.Collections.Sort<BigStruct>.LinqQuery has regressed on all configs except Windows 64 bit #66776
Comments
Tagging subscribers to this area: @dotnet/area-system-collections Issue DetailsThis regression seems to be specific to all configs except of Windows 64 bit. Repro: git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f net6.0 net7.0 --filter 'System.Collections.Sort<BigStruct>.LinqQuery' The diff points to #55604 (cc @alexcovington) and #59287 (cc @AndyAyersMS)
|
#59287 is locked so doesn't get cross linked. That seems unfortunate. That change should have purely impacted jit diagnostics, so it's unlikely to have caused regressions. |
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsThis regression seems to be specific to all configs except of Windows 64 bit. Repro: git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f net6.0 net7.0 --filter 'System.Collections.Sort<BigStruct>.LinqQuery' The diff points to #55604 (cc @alexcovington) and #59287 (cc @AndyAyersMS)
|
Digging through it looks like we expected this to be resolved -- see dotnet/perf-autofiling-issues#1501 (comment) But that only fixed issues on Windows, Ubuntu did not benefit. So we still have a regression. (Windows is slightly worse off too) |
Looks like this is still unassigned. I'll take it for now. |
Can reproduce running locally (via wsl2)
|
@adamsitnik is it expected that with |
From the above I can get a crude profile of sorts. But not sure it is helping me spot which method(s) have regressed. |
In case of EventPipe we just get different CPU samples (events emitted by the .NET Runtime, not the OS). In PerfView you need to open the "Thread Time" view (not "CPU Stacks" like usual): Or you can take the
and open it with speedscope |
Still didn't find that very helpful. But here's perf (via WSL2) on the two: If this is credible then the issue is in this bit of code. ;; 6.0
; Assembly listing for method GenericComparer`1:Compare(BigStruct,BigStruct):int:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; Tier-1 compilation
; optimized code
; rbp based frame
; partially interruptible
; No PGO data
; 1 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd single-def
; V01 arg1 [V01,T03] ( 2, 1.36) struct (32) [rbp+10H] do-not-enreg[SF] ld-addr-op single-def
; V02 arg2 [V02,T04] ( 1, 1 ) struct (32) [rbp+30H] do-not-enreg[SB] single-def
;# V03 OutArgs [V03 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
; V04 tmp1 [V04,T01] ( 2, 4 ) struct (32) [rbp-20H] do-not-enreg[SFB] "Inlining Arg"
; V05 tmp2 [V05,T02] ( 4, 1.50) int -> rax "Inline return value spill temp"
; V06 tmp3 [V06,T00] ( 3, 4.71) int -> rax "Inlining Arg"
;
; Lcl frame size = 32
G_M25642_IG01: ;; offset=0000H
55 push rbp
4883EC20 sub rsp, 32
C5F877 vzeroupper
488D6C2420 lea rbp, [rsp+20H]
;; bbWeight=1 PerfScore 2.75
G_M25642_IG02: ;; offset=000DH
C5FA6F4530 vmovdqu xmm0, xmmword ptr [rbp+30H]
C5FA7F45E0 vmovdqu xmmword ptr [rbp-20H], xmm0
C5FA6F4540 vmovdqu xmm0, xmmword ptr [rbp+40H]
C5FA7F45F0 vmovdqu xmmword ptr [rbp-10H], xmm0
8B45EC mov eax, dword ptr [rbp-14H]
39451C cmp dword ptr [rbp+1CH], eax
7C14 jl SHORT G_M25642_IG07
;; bbWeight=1 PerfScore 7.00
G_M25642_IG03: ;; offset=0029H
39451C cmp dword ptr [rbp+1CH], eax
7F08 jg SHORT G_M25642_IG06
;; bbWeight=0.36 PerfScore 0.71
G_M25642_IG04: ;; offset=002EH
33C0 xor eax, eax
;; bbWeight=0.26 PerfScore 0.06
G_M25642_IG05: ;; offset=0030H
4883C420 add rsp, 32
5D pop rbp
C3 ret
;; bbWeight=1 PerfScore 1.75
G_M25642_IG06: ;; offset=0036H
B801000000 mov eax, 1
EBF3 jmp SHORT G_M25642_IG05
;; bbWeight=0.10 PerfScore 0.22
G_M25642_IG07: ;; offset=003DH
B8FFFFFFFF mov eax, -1
EBEC jmp SHORT G_M25642_IG05
;; bbWeight=0.14 PerfScore 0.32 versus ;; 7.0
; Assembly listing for method GenericComparer`1:Compare(BigStruct,BigStruct):int:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; Tier-1 compilation
; optimized code
; rbp based frame
; partially interruptible
; No PGO data
; 1 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd single-def
; V01 arg1 [V01,T03] ( 2, 1.35) struct (32) [rbp+10H] do-not-enreg[SF] ld-addr-op single-def
; V02 arg2 [V02,T04] ( 1, 1 ) struct (32) [rbp+30H] do-not-enreg[S] single-def
;# V03 OutArgs [V03 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
; V04 tmp1 [V04,T01] ( 2, 4 ) struct (32) [rbp-20H] do-not-enreg[SF] "Inlining Arg"
; V05 tmp2 [V05,T02] ( 4, 1.50) int -> rax "Inline return value spill temp"
; V06 tmp3 [V06,T00] ( 3, 4.70) int -> rax "Inlining Arg"
;
; Lcl frame size = 32
G_M25642_IG01: ;; offset=0000H
55 push rbp
4883EC20 sub rsp, 32
C5F877 vzeroupper
488D6C2420 lea rbp, [rsp+20H]
;; size=13 bbWeight=1 PerfScore 2.75
G_M25642_IG02: ;; offset=000DH
C5FE6F4530 vmovdqu ymm0, ymmword ptr[rbp+30H]
C5FE7F45E0 vmovdqu ymmword ptr[rbp-20H], ymm0
8B45EC mov eax, dword ptr [rbp-14H]
39451C cmp dword ptr [rbp+1CH], eax
7C17 jl SHORT G_M25642_IG07
;; size=18 bbWeight=1 PerfScore 9.00
G_M25642_IG03: ;; offset=001FH
39451C cmp dword ptr [rbp+1CH], eax
7F0B jg SHORT G_M25642_IG06
;; size=5 bbWeight=0.35 PerfScore 1.06
G_M25642_IG04: ;; offset=0024H
33C0 xor eax, eax
;; size=2 bbWeight=0.25 PerfScore 0.06
G_M25642_IG05: ;; offset=0026H
C5F877 vzeroupper
4883C420 add rsp, 32
5D pop rbp
C3 ret
;; size=9 bbWeight=1 PerfScore 2.75
G_M25642_IG06: ;; offset=002FH
B801000000 mov eax, 1
EBF0 jmp SHORT G_M25642_IG05
;; size=7 bbWeight=0.10 PerfScore 0.22
G_M25642_IG07: ;; offset=0036H
B8FFFFFFFF mov eax, -1
EBE9 jmp SHORT G_M25642_IG05
;; size=7 bbWeight=0.15 PerfScore 0.33 |
Note with AVX/AVX2 disabled 6 and 7 match perf (and match 6 with avx enabled) BenchmarkDotNet=v0.13.1.1823-nightly, OS=ubuntu 20.04 EnvironmentVariables=COMPlus_EnableAVX2=0,COMPlus_EnableAVX=0 PowerPlanMode=00000000-0000-0000-0000-000000000000 InvocationCount=5000
Going to modify the jit so I can do this per-method and see if just disabling AVX for the comparer explains the perf loss. |
Looks like the regression comes from the use of YMM registers in the two hottest methods above
In both cases there is a YMM store closely followed by a narrower load: ;; Compare
C5FE7F45E0 vmovdqu ymmword ptr[rbp-20H], ymm0
8B45EC mov eax, dword ptr [rbp-14H]
;; CompareAnyKeys
C5FE7F45C8 vmovdqu ymmword ptr[rbp-38H], ymm0
C5FA6F45C8 vmovdqu xmm0, qword ptr [rbp-38H] |
On windows, there is similar codegen in ;; (windows) Compare
C5FE7F442408 vmovdqu ymmword ptr[rsp+08H], ymm0
8B442414 mov eax, dword ptr [rsp+14H] Despire this, perf on windows generally seems better (around 53us). Note the store above is misaligned (as is the store in linux's Also note that in
|
Verified this is mitigated with the preliminary changes from #73719. This is beyond the scope of what we can fix for .net7, so I think we're going to have to live with this regression.
|
This should be fixed by #74384. |
This regression seems to be specific to all configs except of Windows 64 bit.
Repro:
Ubuntu Historical results
The diff points to #55604 (cc @alexcovington) and #59287 (cc @AndyAyersMS)
Windows Historical results
The text was updated successfully, but these errors were encountered: