Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize scalar conversions with AVX512 #84384

Merged
merged 41 commits into from
Jul 16, 2023
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
7d764be
fixing the JITDbl2Ulng helper function. The new AVX512 instruction vc…
khushal1996 May 9, 2023
f50408b
Making changes to the library test case expected output based on the …
khushal1996 May 10, 2023
f018095
Fixing the JITDbl2Ulng helper function. Also making sure that we are …
khushal1996 May 12, 2023
ffe97cd
reverting jitformat
khushal1996 May 12, 2023
a8ee861
Adding a truncate function to the Dbl2Ulng helper to make sure we avo…
khushal1996 May 15, 2023
bbd8a8b
Adding code to handle vectorized conversion for float/double to/from …
khushal1996 May 16, 2023
a21a077
reverting changes for float to ulong
khushal1996 May 16, 2023
1e3415a
enabling float to ulong conversion
khushal1996 May 16, 2023
c788c67
Making change to set w1 bit for evex
khushal1996 May 17, 2023
fbb2a90
merging with main. Picking up hwintrinsiclistxarh from main
khushal1996 May 18, 2023
9fece01
jit format
khushal1996 May 18, 2023
b40cd8e
Splitting vcvttss2usi to vcvttss2usi32 and vcvttss2usi64. Also adding…
khushal1996 May 18, 2023
710026e
undoing jitformat changes due to merge error
khushal1996 May 18, 2023
75e6acf
removing unused code and correcting throughput and latency informatio…
khushal1996 May 19, 2023
e15be4b
correcting throughput and latency for vcvttss2usi32 and placing it wi…
khushal1996 May 19, 2023
10e2876
formatting
khushal1996 May 19, 2023
9463173
formatting
khushal1996 May 19, 2023
4f7bb67
updating comments
khushal1996 May 22, 2023
a99725c
updating code for github comments. Using compIsaSupportedDebugOnly fo…
khushal1996 May 24, 2023
44390b2
reverting to original checks for ISA supported Debug only because the…
khushal1996 May 24, 2023
2f20ef3
running jitformat
khushal1996 May 24, 2023
b7dff8a
running jitformat
khushal1996 May 25, 2023
9622f78
combine the 2 nodes GT_CAST(GT_CAST(TYP_ULONG, TYP_DOUBLE), TYP_FLOAT…
khushal1996 Jun 17, 2023
d3b542f
merging with main and updating hwintrinsiclistxarch to take into cons…
khushal1996 Jun 18, 2023
8343e18
Changing noway_assert to assert to make sure compOpportunisticallyDep…
khushal1996 Jun 19, 2023
e456763
running jitformat
khushal1996 Jun 19, 2023
fdb28c6
Changing compOpportunisticallyDependsOn to compIsaSupportedDebugOnly …
khushal1996 Jun 20, 2023
e9ff179
Making code review changes. Moving around the comOpportunisticallyDep…
khushal1996 Jun 22, 2023
db2a0cb
FCALL_CONTRACT should be only used on FCalls itself
khushal1996 Jun 23, 2023
167b563
Making paralle changes to JITHelper in MathHelper for native AOT
khushal1996 Jun 23, 2023
b02a96c
resolving regression issues
khushal1996 Jun 23, 2023
fc0d127
Rolling back changes for double/float -> ulong
khushal1996 Jun 30, 2023
9b56b86
Rolling back changes for double/float -> ulong
khushal1996 Jun 30, 2023
930c473
Reverting ouf_or_range_fp_conversion to original version
khushal1996 Jun 30, 2023
b2ae110
Reverting ouf_or_range_fp_conversion to original version
khushal1996 Jun 30, 2023
0439e28
Reverting jithelpers.cpp to original versino
khushal1996 Jun 30, 2023
2166ae5
Reverting jithelpers.cpp to original version
khushal1996 Jun 30, 2023
e2a6029
Changind comments, reverting asserts, skipping to change node for cast
khushal1996 Jul 5, 2023
715fc7e
addressing review comments
khushal1996 Jul 14, 2023
dc6e41a
Update src/coreclr/jit/morph.cpp
tannergooding Jul 15, 2023
b1a31aa
Merge remote-tracking branch 'dotnet/main' into avx512-scalar-convert…
tannergooding Jul 16, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 14 additions & 2 deletions src/coreclr/jit/codegenxarch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7336,7 +7336,19 @@ void CodeGen::genIntToFloatCast(GenTree* treeNode)
// Also we don't expect to see uint32 -> float/double and uint64 -> float conversions
// here since they should have been lowered appropriately.
noway_assert(srcType != TYP_UINT);
noway_assert((srcType != TYP_ULONG) || (dstType != TYP_FLOAT));
assert((srcType != TYP_ULONG) || (dstType != TYP_FLOAT) ||
compiler->compIsaSupportedDebugOnly(InstructionSet_AVX512F));

if ((srcType == TYP_ULONG) && varTypeIsFloating(dstType) &&
compiler->compOpportunisticallyDependsOn(InstructionSet_AVX512F))
{
assert(compiler->compIsaSupportedDebugOnly(InstructionSet_AVX512F));
genConsumeOperands(treeNode->AsOp());
instruction ins = ins_FloatConv(dstType, srcType, emitTypeSize(srcType));
GetEmitter()->emitInsBinary(ins, emitTypeSize(srcType), treeNode, op1);
genProduceReg(treeNode);
return;
}

// To convert int to a float/double, cvtsi2ss/sd SSE2 instruction is used
// which does a partial write to lower 4/8 bytes of xmm register keeping the other
Expand Down Expand Up @@ -7450,7 +7462,7 @@ void CodeGen::genFloatToIntCast(GenTree* treeNode)

// We shouldn't be seeing uint64 here as it should have been converted
// into a helper call by either front-end or lowering phase.
noway_assert(!varTypeIsUnsigned(dstType) || (dstSize != EA_ATTR(genTypeSize(TYP_LONG))));
assert(!varTypeIsUnsigned(dstType) || (dstSize != EA_ATTR(genTypeSize(TYP_LONG))));

// If the dstType is TYP_UINT, we have 32-bits to encode the
// float number. Any of 33rd or above bits can be the sign bit.
Expand Down
29 changes: 22 additions & 7 deletions src/coreclr/jit/emitxarch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1399,7 +1399,6 @@ bool emitter::TakesRexWPrefix(const instrDesc* id) const
case INS_vcvtsd2usi:
case INS_vcvtss2usi:
case INS_vcvttsd2usi:
case INS_vcvttss2usi:
{
if (attr == EA_8BYTE)
{
Expand Down Expand Up @@ -2623,7 +2622,8 @@ bool emitter::emitInsCanOnlyWriteSSE2OrAVXReg(instrDesc* id)
case INS_vcvtsd2usi:
case INS_vcvtss2usi:
case INS_vcvttsd2usi:
case INS_vcvttss2usi:
case INS_vcvttss2usi32:
case INS_vcvttss2usi64:
{
// These SSE instructions write to a general purpose integer register.
return false;
Expand Down Expand Up @@ -11435,12 +11435,18 @@ void emitter::emitDispIns(
case INS_vcvtsd2usi:
case INS_vcvtss2usi:
case INS_vcvttsd2usi:
case INS_vcvttss2usi:
{
printf(" %s, %s", emitRegName(id->idReg1(), attr), emitRegName(id->idReg2(), EA_16BYTE));
break;
}

case INS_vcvttss2usi32:
case INS_vcvttss2usi64:
{
printf(" %s, %s", emitRegName(id->idReg1(), attr), emitRegName(id->idReg2(), EA_4BYTE));
break;
}

#ifdef TARGET_AMD64
case INS_movsxd:
{
Expand Down Expand Up @@ -18595,23 +18601,32 @@ emitter::insExecutionCharacteristics emitter::getInsExecutionCharacteristics(ins
case INS_cvtsi2sd64:
case INS_cvtsi2ss64:
case INS_vcvtsd2usi:
case INS_vcvttsd2usi:
case INS_vcvtusi2sd32:
case INS_vcvtusi2sd64:
case INS_vcvtusi2ss32:
case INS_vcvtusi2ss64:
case INS_vcvttsd2usi:
tannergooding marked this conversation as resolved.
Show resolved Hide resolved
case INS_vcvttss2usi32:
result.insThroughput = PERFSCORE_THROUGHPUT_1C;
result.insLatency += PERFSCORE_LATENCY_7C;
break;

case INS_vcvtusi2sd64:
case INS_vcvtusi2sd32:
result.insThroughput = PERFSCORE_THROUGHPUT_1C;
result.insLatency += PERFSCORE_LATENCY_5C;
break;

case INS_cvttss2si:
case INS_cvtss2si:
case INS_vcvtss2usi:
case INS_vcvttss2usi:
result.insThroughput = PERFSCORE_THROUGHPUT_1C;
result.insLatency += opSize == EA_8BYTE ? PERFSCORE_LATENCY_8C : PERFSCORE_LATENCY_7C;
break;

case INS_vcvttss2usi64:
result.insThroughput = PERFSCORE_THROUGHPUT_1C;
result.insLatency += PERFSCORE_LATENCY_8C;
break;

case INS_cvtss2sd:
result.insThroughput = PERFSCORE_THROUGHPUT_1C;
result.insLatency += PERFSCORE_LATENCY_5C;
Expand Down
4 changes: 2 additions & 2 deletions src/coreclr/jit/hwintrinsiclistxarch.h
Original file line number Diff line number Diff line change
Expand Up @@ -845,7 +845,7 @@ HARDWARE_INTRINSIC(AVX512F, CompareNotEqual,
HARDWARE_INTRINSIC(AVX512F, ConvertScalarToVector128Double, 16, 2, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vcvtusi2sd32, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_SIMDScalar, HW_Flag_BaseTypeFromSecondArg|HW_Flag_CopyUpperBits)
HARDWARE_INTRINSIC(AVX512F, ConvertScalarToVector128Single, 16, 2, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vcvtusi2ss32, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_SIMDScalar, HW_Flag_BaseTypeFromSecondArg|HW_Flag_CopyUpperBits)
HARDWARE_INTRINSIC(AVX512F, ConvertToUInt32, 16, 1, true, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vcvtss2usi, INS_vcvtsd2usi}, HW_Category_SIMDScalar, HW_Flag_BaseTypeFromFirstArg|HW_Flag_SpecialCodeGen)
HARDWARE_INTRINSIC(AVX512F, ConvertToUInt32WithTruncation, 16, 1, true, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vcvttss2usi, INS_vcvttsd2usi}, HW_Category_SIMDScalar, HW_Flag_BaseTypeFromFirstArg|HW_Flag_SpecialCodeGen)
HARDWARE_INTRINSIC(AVX512F, ConvertToUInt32WithTruncation, 16, 1, true, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vcvttss2usi32, INS_vcvttsd2usi}, HW_Category_SIMDScalar, HW_Flag_BaseTypeFromFirstArg|HW_Flag_SpecialCodeGen)
HARDWARE_INTRINSIC(AVX512F, ConvertToVector128Byte, 64, 1, true, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vpmovdb, INS_vpmovdb, INS_vpmovqb, INS_vpmovqb, INS_invalid, INS_invalid}, HW_Category_SimpleSIMD, HW_Flag_BaseTypeFromFirstArg|HW_Flag_SpecialCodeGen)
HARDWARE_INTRINSIC(AVX512F, ConvertToVector128ByteWithSaturation, 64, 1, true, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vpmovusdb, INS_invalid, INS_vpmovusqb, INS_invalid, INS_invalid}, HW_Category_SimpleSIMD, HW_Flag_BaseTypeFromFirstArg|HW_Flag_SpecialCodeGen)
HARDWARE_INTRINSIC(AVX512F, ConvertToVector128Int16, 64, 1, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vpmovqw, INS_vpmovqw, INS_invalid, INS_invalid}, HW_Category_SimpleSIMD, HW_Flag_BaseTypeFromFirstArg|HW_Flag_SpecialCodeGen)
Expand Down Expand Up @@ -1002,7 +1002,7 @@ HARDWARE_INTRINSIC(AVX512F_VL, TernaryLogic,
HARDWARE_INTRINSIC(AVX512F_X64, ConvertScalarToVector128Double, 16, 2, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vcvtusi2sd64, INS_invalid, INS_invalid}, HW_Category_SIMDScalar, HW_Flag_BaseTypeFromSecondArg|HW_Flag_CopyUpperBits|HW_Flag_SpecialCodeGen)
HARDWARE_INTRINSIC(AVX512F_X64, ConvertScalarToVector128Single, 16, 2, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vcvtusi2ss64, INS_invalid, INS_invalid}, HW_Category_SIMDScalar, HW_Flag_BaseTypeFromSecondArg|HW_Flag_CopyUpperBits|HW_Flag_SpecialCodeGen)
HARDWARE_INTRINSIC(AVX512F_X64, ConvertToUInt64, 16, 1, true, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vcvtss2usi, INS_vcvtsd2usi}, HW_Category_SIMDScalar, HW_Flag_BaseTypeFromFirstArg|HW_Flag_SpecialCodeGen)
HARDWARE_INTRINSIC(AVX512F_X64, ConvertToUInt64WithTruncation, 16, 1, true, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vcvttss2usi, INS_vcvttsd2usi}, HW_Category_SIMDScalar, HW_Flag_BaseTypeFromFirstArg|HW_Flag_SpecialCodeGen)
HARDWARE_INTRINSIC(AVX512F_X64, ConvertToUInt64WithTruncation, 16, 1, true, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_vcvttss2usi64, INS_vcvttsd2usi}, HW_Category_SIMDScalar, HW_Flag_BaseTypeFromFirstArg|HW_Flag_SpecialCodeGen)

// ***************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
// ISA Function name SIMD size NumArg EncodesExtraTypeArg Instructions Category Flags
Expand Down
12 changes: 12 additions & 0 deletions src/coreclr/jit/importer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7883,6 +7883,18 @@ void Compiler::impImportBlockCode(BasicBlock* block)
|| (impStackTop().val->TypeGet() == TYP_BYREF)
#endif
;
#ifdef TARGET_AMD64
// If AVX512 is present and we are not checking for overflow, we do not need
// a large node. In this case, we will not fallback to a helper function but
// will use the intrinsic instead. This is done for all long/ulong to floating
// point conversions. Hence setting the callNode to false to
// avoid generating a large node.
if (callNode && !ovfl && varTypeIsLong(impStackTop().val) &&
jkotas marked this conversation as resolved.
Show resolved Hide resolved
compOpportunisticallyDependsOn(InstructionSet_AVX512F))
{
callNode = false;
}
#endif // TARGET_AMD64
}
else
{
Expand Down
13 changes: 13 additions & 0 deletions src/coreclr/jit/instr.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2281,6 +2281,8 @@ instruction CodeGen::ins_MathOp(genTreeOps oper, var_types type)
instruction CodeGen::ins_FloatConv(var_types to, var_types from, emitAttr attr)
{
// AVX: For now we support only conversion from Int/Long -> float
// AVX512: Supports following conversions
// srcType = ulong castToType = double/float
jkotas marked this conversation as resolved.
Show resolved Hide resolved

switch (from)
{
Expand Down Expand Up @@ -2350,6 +2352,17 @@ instruction CodeGen::ins_FloatConv(var_types to, var_types from, emitAttr attr)
}
break;

case TYP_ULONG:
switch (to)
{
case TYP_DOUBLE:
return INS_vcvtusi2sd64;
case TYP_FLOAT:
return INS_vcvtusi2ss64;
default:
unreached();
}

default:
unreached();
}
Expand Down
3 changes: 2 additions & 1 deletion src/coreclr/jit/instrsxarch.h
Original file line number Diff line number Diff line change
Expand Up @@ -637,7 +637,8 @@ INST3(vcvtss2usi, "cvtss2usi", IUM_WR, BAD_CODE, BAD_
INST3(vcvttpd2udq, "cvttpd2udq", IUM_WR, BAD_CODE, BAD_CODE, PCKFLT(0x78), INS_TT_FULL, Input_64Bit | REX_W1 | Encoding_EVEX) // cvt w/ truncation packed doubles to unsigned DWORDs
INST3(vcvttps2udq, "cvttps2udq", IUM_WR, BAD_CODE, BAD_CODE, PCKFLT(0x78), INS_TT_FULL, Input_32Bit | REX_W0 | Encoding_EVEX) // cvt w/ truncation packed singles to unsigned DWORDs
INST3(vcvttsd2usi, "cvttsd2usi", IUM_WR, BAD_CODE, BAD_CODE, SSEDBL(0x78), INS_TT_TUPLE1_FIXED, Input_64Bit | REX_WX | Encoding_EVEX) // cvt w/ truncation scalar double to unsigned DWORD/QWORD
INST3(vcvttss2usi, "cvttss2usi", IUM_WR, BAD_CODE, BAD_CODE, SSEFLT(0x78), INS_TT_TUPLE1_FIXED, Input_32Bit | REX_WX | Encoding_EVEX) // cvt w/ truncation scalar single to unsigned DWORD/QWORD
INST3(vcvttss2usi32, "cvttss2usi", IUM_WR, BAD_CODE, BAD_CODE, SSEFLT(0x78), INS_TT_TUPLE1_FIXED, Input_32Bit | REX_W0 | Encoding_EVEX) // cvt w/ truncation scalar single to unsigned DWORD/QWORD
INST3(vcvttss2usi64, "cvttss2usi", IUM_WR, BAD_CODE, BAD_CODE, SSEFLT(0x78), INS_TT_TUPLE1_FIXED, Input_32Bit | REX_W1 | Encoding_EVEX) // cvt w/ truncation scalar single to unsigned DWORD/QWORD
INST3(vcvtudq2pd, "cvtudq2pd", IUM_WR, BAD_CODE, BAD_CODE, SSEFLT(0x7A), INS_TT_HALF, Input_32Bit | REX_W0 | Encoding_EVEX) // cvt packed unsigned DWORDs to doubles
INST3(vcvtudq2ps, "cvtudq2ps", IUM_WR, BAD_CODE, BAD_CODE, SSEDBL(0x7A), INS_TT_FULL, Input_32Bit | REX_W0 | Encoding_EVEX) // cvt packed unsigned DWORDs to singles
INST3(vcvtusi2sd32, "cvtusi2sd", IUM_WR, BAD_CODE, BAD_CODE, SSEDBL(0x7B), INS_TT_TUPLE1_SCALAR, Input_32Bit | REX_W0 | Encoding_EVEX | INS_Flags_IsDstDstSrcAVXInstruction) // cvt scalar unsigned DWORD to double
Expand Down
6 changes: 3 additions & 3 deletions src/coreclr/jit/lowerxarch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -802,16 +802,16 @@ void Lowering::LowerCast(GenTree* tree)
// Reason: ulong -> float = ulong -> double -> float
if (varTypeIsFloating(srcType))
{
noway_assert(!tree->gtOverflow());
noway_assert(castToType != TYP_ULONG);
assert(!tree->gtOverflow());
assert(castToType != TYP_ULONG);
}
else if (srcType == TYP_UINT)
{
noway_assert(!varTypeIsFloating(castToType));
jkotas marked this conversation as resolved.
Show resolved Hide resolved
}
else if (srcType == TYP_ULONG)
{
noway_assert(castToType != TYP_FLOAT);
assert(castToType != TYP_FLOAT || comp->compIsaSupportedDebugOnly(InstructionSet_AVX512F));
}

// Case of src is a small type and dst is a floating point type.
Expand Down
35 changes: 34 additions & 1 deletion src/coreclr/jit/morph.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -293,6 +293,39 @@ GenTree* Compiler::fgMorphExpandCast(GenTreeCast* tree)
var_types dstType = tree->CastToType();
unsigned dstSize = genTypeSize(dstType);

#if defined(TARGET_AMD64)
// If AVX512 is present, we have intrinsic available to convert
// ulong directly to float. Hence, we need to combine the 2 nodes
// GT_CAST(GT_CAST(TYP_ULONG, TYP_DOUBLE), TYP_FLOAT) into a single
// node i.e. GT_CAST(TYP_ULONG, TYP_FLOAT). At this point, we already
// have the 2 GT_CAST nodes in the tree and we are combining them below.
if (oper->OperIs(GT_CAST))
{
GenTreeCast* innerCast = static_cast<GenTreeCast*>(oper);
jkotas marked this conversation as resolved.
Show resolved Hide resolved

if (innerCast->IsUnsigned())
{
GenTree* innerOper = innerCast->CastOp();
var_types innerSrcType = genActualType(innerOper);
var_types innerDstType = innerCast->CastToType();
unsigned innerDstSize = genTypeSize(innerDstType);
innerSrcType = varTypeToUnsigned(innerSrcType);

// Check if we are going from ulong->double->float
if (innerSrcType == TYP_ULONG && innerDstType == TYP_DOUBLE && dstType == TYP_FLOAT)
tannergooding marked this conversation as resolved.
Show resolved Hide resolved
{
if (compOpportunisticallyDependsOn(InstructionSet_AVX512F))
{
// One optimized (combined) cast here
tree = gtNewCastNode(TYP_ULONG, innerOper, true, TYP_FLOAT);
tannergooding marked this conversation as resolved.
Show resolved Hide resolved
tree->gtType = TYP_FLOAT;
return fgMorphTree(tree);
}
}
}
}
#endif // TARGET_AMD64

// See if the cast has to be done in two steps. R -> I
if (varTypeIsFloating(srcType) && varTypeIsIntegral(dstType))
{
Expand Down Expand Up @@ -449,7 +482,7 @@ GenTree* Compiler::fgMorphExpandCast(GenTreeCast* tree)
{
srcType = varTypeToUnsigned(srcType);

if (srcType == TYP_ULONG)
if (srcType == TYP_ULONG && !compOpportunisticallyDependsOn(InstructionSet_AVX512F))
{
if (dstType == TYP_FLOAT)
{
Expand Down