-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove more unnecessary scenarios from HWIntrinsic test templates and fix timeout/failure #85026
Conversation
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics Issue DetailsFollow up to #85008, this removes the unnecessary load scenarios from the rest of the hwintrinsic test templates. It also removes the ClsVar scenarios as they weren't actually testing CLS_VAR scenarios. That would require passing in something recognizable as a GT_CNS_VEC or similar, which would require a significantly more complex template or static readonly variables + tiering to kick in. It also removes the ClassLclFld scenario as there is no real difference between Finally it divides the original
|
CC. @BruceForstall, @dotnet/avx512-contrib, @dotnet/jit-contrib Also CC. @markples, @trylek for the striping change. We're still seeing timeouts and long run times for the Avx512 stress job, regardless of striping count. It's possible that is related to #84967 and the instruction that is being incorrectly encoded; however it seems unique to Windows runs at the moment (possibly an ABI difference causing it...). |
084e344
to
3a36b4a
Compare
Figured out the timeout issue. Turns out crossgen was only handling the This meant that the |
@@ -801,7 +801,7 @@ HARDWARE_INTRINSIC(AVX512F, ConvertToVector128UInt16, | |||
HARDWARE_INTRINSIC(AVX512F, ConvertToVector128UInt32, -1, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vpmovqd, INS_vpmovqd, INS_invalid, INS_invalid}, HW_Category_SimpleSIMD, HW_Flag_BaseTypeFromFirstArg|HW_Flag_SpecialCodeGen) | |||
HARDWARE_INTRINSIC(AVX512F, ConvertToVector256Int16, 64, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vpmovdw, INS_vpmovdw, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_SimpleSIMD, HW_Flag_BaseTypeFromFirstArg|HW_Flag_SpecialCodeGen) | |||
HARDWARE_INTRINSIC(AVX512F, ConvertToVector256Int32, 64, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vpmovqd, INS_vpmovqd, INS_invalid, INS_cvtpd2dq}, HW_Category_SimpleSIMD, HW_Flag_BaseTypeFromFirstArg|HW_Flag_SpecialCodeGen) | |||
HARDWARE_INTRINSIC(AVX512F, ConvertToVector128Int32WithTruncation, 64, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cvttpd2dq}, HW_Category_SimpleSIMD, HW_Flag_BaseTypeFromFirstArg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we don't need ConvertToVector128Int32WithTruncation
at all for AVX512F?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no such API for Avx512F
You have Sse2.ConvertToVector128Int32WithTruncation
which handles Vector128<float>
to Vector128<int>
(4x32 to 4x32) and Avx.ConvertToVector128Int32WithTruncation
which handles Vector256<double>
to Vector128<int>
(4x64 to 4x32)
There is then no need for any Avx512F API since we don't have any 4x128
to 4x32
like scenario. This was jsut a typo that should've been Vector256Int32
since it handles Vector512<double>
to Vector256<int>
(8x64 to 8x32)
Interestingly we didn't find out in the initial PR CI? |
It was dependent on a couple few factors, such as whether or not the R2R assembly is used. It was unclear why Linux nor Windows x86 failed and why it was only showing up on Windows x64 in CI. |
@@ -1786,12 +1786,22 @@ void CodeGen::genAvxFamilyIntrinsic(GenTreeHWIntrinsic* node) | |||
break; | |||
} | |||
|
|||
case NI_AVX512F_ConvertToVector256Int32: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this moved up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with the other places that specially handle this one path and to avoid the cost of the varTypeIsFloating
check for the many intrinsics where it can't be true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Merging after talking with @BruceForstall. The only jobs that timed out are the arm32 replay jobs, which won't be impacted by this PR as we're only touching xarch specific files/paths. The reply job did pass earlier today before the minor test fix went in. |
Follow up to #85008, this removes the unnecessary load scenarios from the rest of the hwintrinsic test templates.
It also removes the ClsVar scenarios as they weren't actually testing CLS_VAR scenarios. That would require passing in something recognizable as a GT_CNS_VEC or similar, which would require a significantly more complex template or static readonly variables + tiering to kick in.
It also removes the ClassLclFld scenario as there is no real difference between
local._fld
andthis._fld
, the latter being tested byClassFld
already.Finally it divides the original
20
striping used in theHardwareIntrinsics_r/ro
projects into the newHardwareIntrinsics_*_r/ro
projects based on the original test count vs the new test count:8
stripes8
stripes2
stripes2
stripes2
stripesThis also resolves a timeout issue that was cropping up and in turn fixes a downstream failure that was being hidden. It is unclear why this wasn't surfacing on the Linux or Windows x86 legs.