Intel hardware intrinsic API change #27584

fiigii · 2018-10-09T18:39:36Z

According to the second hw intrinsic API review meeting https://github.com/dotnet/apireviews/tree/master/2018/Hardware-Intrinsics-Intel, we need to change some intrinsic APIs and their implementation

[Leave for the next version] Design SSE4.2 STTNI intrinsic
- issue https://github.com/dotnet/corefx/issues/30373
- PR [No Merge]Adding SSE4.2 STTNI intrinsic APIs coreclr#19958
[Done] 64-bit only intrinsics
- issue https://github.com/dotnet/coreclr/issues/18744
- PR Expose 64-bit only hardware intrinsic in nested classes coreclr#20146
[Done] Improve StaticCast
- PR Move the various helper intrinsics to be implemented on the S.R.Intrinsics.Vector types coreclr#20147
- solution
1. exploding frequently-used intrinsic on more base-types Add all integer overloads for AlignRight/BlendVariable and unsigned overloads for MultiplyLow coreclr#19420
2. move StaticCast and some other helpers into vector classes, and rename StaticCast to As.
New intrinsic requests
- issues https://github.com/dotnet/coreclr/issues/19071, https://github.com/dotnet/coreclr/issues/19271, https://github.com/dotnet/coreclr/issues/18459, https://github.com/dotnet/corefx/issues/32075
- waiting for concrete API proposals and reviews
[DONE] Missing and incorrect APIs
- PRs Add pointer overloads for Avx2.BroadcastScalarToVector128 coreclr#20055, Fix inconsistent Intel hardware intrinsic APIs coreclr#19949
[DONE] Remove generic Intel hardware intrinsic and explode them on all the supported types
- issue https://github.com/dotnet/coreclr/issues/20057

The text was updated successfully, but these errors were encountered:

fiigii · 2018-10-09T18:53:03Z

So far, the first two items (SSE4.2 STTNI intrinsic and 64-bit only intrinsic) needs more discussion for naming.

[No Merge]Adding SSE4.2 STTNI intrinsic APIs coreclr#19958 shows the current progress of designing SSE4.2 STTNI intrinsic APIs.
Expose 64-bit only hardware intrinsic in nested classes coreclr#20146 shows the 64-bit only intrinsic APIs exposed in nested classes, currently named X64. That will make the 64-bit checks more explicit.

if (Popcnt.X64.IsSupported)
{
     var res = Popcnt.X64.Popcnt(longVal);
     ...
}

We think this is better than if (Popcnt.IsSupported && Environment.Is64BitProcess).
However, the name X64 of the nested class may be confusing sometimes because we have used X86 to represent the whole ISA family (32-bit and 64-bit both), so a name like Only64Bit may be better to express the subset.

fiigii · 2018-10-09T18:55:26Z

cc @CarolEidt @tannergooding @eerhardt @jkotas @stephentoub

tannergooding · 2018-10-09T18:58:17Z

I think just X64 will be self-explanatory, especially given the target audience for these APIs.

4creators · 2018-10-09T22:50:05Z

I think just X64 will be self-explanatory, especially given the target audience for these APIs.

Our personal opinions are heavily biased but the fact that already 3 discussion participants identified X64 as potentially ambigous should prevail. We should not look for consensus on this issue but try to avoid introducing naming that for very visible group of users could be confusing.

The point is that 3 discussion participants comprise 3/5 of whole group what should be alarming.

tannergooding · 2018-10-10T07:41:42Z

X64 is no more ambiguous than the existing X86 namespace. It is short, concise, and is almost universally understood to mean the 64-bit version of the x86 instruction set.

It is also already used elsewhere in .NET, alongside X86 in many cases, to differentiate in the same manner as we are trying to here (see https://source.dot.net/#q=X64 and https://source.dot.net/#q=X86) and in Roslyn (see http://source.roslyn.io/#q=X64 and http://source.roslyn.io/#q=X86) and other Microsoft products (the list could go on).

fiigii · 2018-10-17T18:26:07Z

@eerhardt @jkotas Do you have any suggestion for the name of 64bit-only classes?

tannergooding · 2018-10-17T18:29:59Z

I've requested another API review be scheduled to cover this and a few other questions. cc. @terrajobst

tannergooding · 2018-11-05T23:02:39Z

For 3, the PR is here: dotnet/coreclr#20147

The current API in the PR is:

public static class Vector64
{
    public static unsafe Vector64<byte> Create(byte value);
    public static unsafe Vector64<double> Create(double value);
    public static unsafe Vector64<short> Create(short value);
    public static unsafe Vector64<int> Create(int value);
    public static unsafe Vector64<long> Create(long value);
    public static unsafe Vector64<sbyte> Create(sbyte value);
    public static unsafe Vector64<float> Create(float value);
    public static unsafe Vector64<ushort> Create(ushort value);
    public static unsafe Vector64<uint> Create(uint value);
    public static unsafe Vector64<ulong> Create(ulong value);
    
    public static unsafe Vector64<byte> Create(byte e0, byte e1, byte e2, byte e3, byte e4, byte e5, byte e6, byte e7);
    public static unsafe Vector64<short> Create(short e0, short e1, short e2, short e3);
    public static unsafe Vector64<int> Create(int e0, int e1);
    public static unsafe Vector64<sbyte> Create(sbyte e0, sbyte e1, sbyte e2, sbyte e3, sbyte e4, sbyte e5, sbyte e6, sbyte e7);
    public static unsafe Vector64<float> Create(float e0, float e1);
    public static unsafe Vector64<ushort> Create(ushort e0, ushort e1, ushort e2, ushort e3);
    public static unsafe Vector64<uint> Create(uint e0, uint e1);

    public static unsafe Vector64<byte> CreateScalar(byte value);
    public static unsafe Vector64<double> CreateScalar(double value);
    public static unsafe Vector64<short> CreateScalar(short value);
    public static unsafe Vector64<int> CreateScalar(int value);
    public static unsafe Vector64<long> CreateScalar(long value);
    public static unsafe Vector64<sbyte> CreateScalar(sbyte value);
    public static unsafe Vector64<float> CreateScalar(float value);
    public static unsafe Vector64<ushort> CreateScalar(ushort value);
    public static unsafe Vector64<uint> CreateScalar(uint value);
    public static unsafe Vector64<ulong> CreateScalar(ulong value);
}

public static partial struct Vector64<T>
{
    public static Vector64<T> Zero { get; }

    public Vector64<byte> AsByte();
    public Vector64<double> AsDouble();
    public Vector64<short> AsInt16();
    public Vector64<int> AsInt32();
    public Vector64<long> AsInt64();
    public Vector64<sbyte> AsSByte();
    public Vector64<float> AsSingle();
    public Vector64<ushort> AsUInt16();
    public Vector64<uint> AsUInt32();
    public Vector64<ulong> AsUInt64();

    public T AsScalar();

    public T GetElement(int index);
    public void SetElement(int index, T value);
}

public static class Vector128
{
    public static unsafe Vector128<byte> Create(byte value);
    public static unsafe Vector128<double> Create(double value);
    public static unsafe Vector128<short> Create(short value);
    public static unsafe Vector128<int> Create(int value);
    public static unsafe Vector128<long> Create(long value);
    public static unsafe Vector128<sbyte> Create(sbyte value);
    public static unsafe Vector128<float> Create(float value);
    public static unsafe Vector128<ushort> Create(ushort value);
    public static unsafe Vector128<uint> Create(uint value);
    public static unsafe Vector128<ulong> Create(ulong value);
    
    public static unsafe Vector128<byte> Create(byte e0, byte e1, byte e2, byte e3, byte e4, byte e5, byte e6, byte e7, byte e8, byte e9, byte e10, byte e11, byte e12, byte e13, byte e14, byte e15);
    public static unsafe Vector128<double> Create(double e0, double e1);
    public static unsafe Vector128<short> Create(short e0, short e1, short e2, short e3, short e4, short e5, short e6, short e7);
    public static unsafe Vector128<int> Create(int e0, int e1, int e2, int e3);
    public static unsafe Vector128<long> Create(long e0, long e1);
    public static unsafe Vector128<sbyte> Create(sbyte e0, sbyte e1, sbyte e2, sbyte e3, sbyte e4, sbyte e5, sbyte e6, sbyte e7, sbyte e8, sbyte e9, sbyte e10, sbyte e11, sbyte e12, sbyte e13, sbyte e14, sbyte e15);
    public static unsafe Vector128<float> Create(float e0, float e1, float e2, float e3);
    public static unsafe Vector128<ushort> Create(ushort e0, ushort e1, ushort e2, ushort e3, ushort e4, ushort e5, ushort e6, ushort e7);
    public static unsafe Vector128<uint> Create(uint e0, uint e1, uint e2, uint e3);
    public static unsafe Vector128<ulong> Create(ulong e0, ulong e1);
    
    public static unsafe Vector128<T> Create<T>(Vector64<T> lower, Vector64<T> upper);

    public static unsafe Vector128<byte> CreateScalar(byte value);
    public static unsafe Vector128<double> CreateScalar(double value);
    public static unsafe Vector128<short> CreateScalar(short value);
    public static unsafe Vector128<int> CreateScalar(int value);
    public static unsafe Vector128<long> CreateScalar(long value);
    public static unsafe Vector128<sbyte> CreateScalar(sbyte value);
    public static unsafe Vector128<float> CreateScalar(float value);
    public static unsafe Vector128<ushort> CreateScalar(ushort value);
    public static unsafe Vector128<uint> CreateScalar(uint value);
    public static unsafe Vector128<ulong> CreateScalar(ulong value);

    public static unsafe Vector128<T> CreateScalar<T>(Vector64<T> value);
}

public static partial struct Vector128<T>
{
    public static Vector128<T> Zero { get; }

    public Vector128<byte> AsByte();
    public Vector128<double> AsDouble();
    public Vector128<short> AsInt16();
    public Vector128<int> AsInt32();
    public Vector128<long> AsInt64();
    public Vector128<sbyte> AsSByte();
    public Vector128<float> AsSingle();
    public Vector128<ushort> AsUInt16();
    public Vector128<uint> AsUInt32();
    public Vector128<ulong> AsUInt64();

    public T AsScalar();

    public T GetElement(int index);
    public void SetElement(int index, T value);

    public Vector64<T> GetLower();
    public Vector64<T> GetUpper();
}

public static class Vector256
{
    public static unsafe Vector256<byte> Create(byte value);
    public static unsafe Vector256<double> Create(double value);
    public static unsafe Vector256<short> Create(short value);
    public static unsafe Vector256<int> Create(int value);
    public static unsafe Vector256<long> Create(long value);
    public static unsafe Vector256<sbyte> Create(sbyte value);
    public static unsafe Vector256<float> Create(float value);
    public static unsafe Vector256<ushort> Create(ushort value);
    public static unsafe Vector256<uint> Create(uint value);
    public static unsafe Vector256<ulong> Create(ulong value);
    
    public static unsafe Vector256<byte> Create(byte e0, byte e1, byte e2, byte e3, byte e4, byte e5, byte e6, byte e7, byte e8, byte e9, byte e10, byte e11, byte e12, byte e13, byte e14, byte e15, byte e16, byte e17, byte e18, byte e19, byte e20, byte e21, byte e22, byte e23, byte e24, byte e25, byte e26, byte e27, byte e28, byte e29, byte e30, byte e31);
    public static unsafe Vector256<double> Create(double e0, double e1, double e2, double e3);
    public static unsafe Vector256<short> Create(short e0, short e1, short e2, short e3, short e4, short e5, short e6, short e7, short e8, short e9, short e10, short e11, short e12, short e13, short e14, short e15);
    public static unsafe Vector256<int> Create(int e0, int e1, int e2, int e3, int e4, int e5, int e6, int e7);
    public static unsafe Vector256<long> Create(long e0, long e1, long e2, long e3);
    public static unsafe Vector256<sbyte> Create(sbyte e0, sbyte e1, sbyte e2, sbyte e3, sbyte e4, sbyte e5, sbyte e6, sbyte e7, sbyte e8, sbyte e9, sbyte e10, sbyte e11, sbyte e12, sbyte e13, sbyte e14, sbyte e15, sbyte e16, sbyte e17, sbyte e18, sbyte e19, sbyte e20, sbyte e21, sbyte e22, sbyte e23, sbyte e24, sbyte e25, sbyte e26, sbyte e27, sbyte e28, sbyte e29, sbyte e30, sbyte e31);
    public static unsafe Vector256<float> Create(float e0, float e1, float e2, float e3, float e4, float e5, float e6, float e7);
    public static unsafe Vector256<ushort> Create(ushort e0, ushort e1, ushort e2, ushort e3, ushort e4, ushort e5, ushort e6, ushort e7, ushort e8, ushort e9, ushort e10, ushort e11, ushort e12, ushort e13, ushort e14, ushort e15);
    public static unsafe Vector256<uint> Create(uint e0, uint e1, uint e2, uint e3, uint e4, uint e5, uint e6, uint e7);
    public static unsafe Vector256<ulong> Create(ulong e0, ulong e1, ulong e2, ulong e3);
    
    public static unsafe Vector256<T> Create<T>(Vector128<T> lower, Vector128<T> upper);

    public static unsafe Vector256<byte> CreateScalar(byte value);
    public static unsafe Vector256<double> CreateScalar(double value);
    public static unsafe Vector256<short> CreateScalar(short value);
    public static unsafe Vector256<int> CreateScalar(int value);
    public static unsafe Vector256<long> CreateScalar(long value);
    public static unsafe Vector256<sbyte> CreateScalar(sbyte value);
    public static unsafe Vector256<float> CreateScalar(float value);
    public static unsafe Vector256<ushort> CreateScalar(ushort value);
    public static unsafe Vector256<uint> CreateScalar(uint value);
    public static unsafe Vector256<ulong> CreateScalar(ulong value);

    public static unsafe Vector256<T> CreateScalar<T>(Vector128<T> value);
}

public static partial struct Vector256<T>
{
    public static Vector256<T> Zero { get; }

    public Vector256<byte> AsByte();
    public Vector256<double> AsDouble();
    public Vector256<short> AsInt16();
    public Vector256<int> AsInt32();
    public Vector256<long> AsInt64();
    public Vector256<sbyte> AsSByte();
    public Vector256<float> AsSingle();
    public Vector256<ushort> AsUInt16();
    public Vector256<uint> AsUInt32();
    public Vector256<ulong> AsUInt64();

    public T AsScalar();

    public T GetElement(int index);
    public void SetElement(int index, T value);

    public Vector128<T> GetLower();
    public Vector128<T> GetUpper();
}

The open questions are:

Should we also expose a generic As<T, U() API?
- Previously asked, was said users can use Unsafe.As<TFrom, TTo>
Is ordering parameters from least to greatest (e0, e1, ..., e15) okay?
- The current ones use the reverse (e15, e14, ..., e0) to match the native x86 side
- The native intrinsics also expose _mm_setr methods that match the least to greatest ordering
Should we explicitly throw for unsupported T, or only when T won't fit in the Vector128
- Current PR throws for unsupported T
What should we name the "unsafe" CreateScalar overloads which leave the upper bits uninitialized (https://github.com/dotnet/corefx/issues/32834)
Should the ExtendTo APIs be exposed as static CreateScalar or explicit ZeroExtend methods on the instance

tannergooding · 2018-11-05T23:04:46Z

For 2, the biggest question is the name. Is X64 sufficient, or do we want something more "explicit"...

We also want to confirm that the nested X64 class should inherit from the next lowest X64 class. That is:

Bmi1.X64 -> Object
Bmi2.X64 -> Object
Lzcnt.X64 -> Object
Popcnt.X64 -> Sse42.X64 -> Sse41.X64 -> Sse2.X64 -> Sse.X64 -> Object

fiigii · 2018-11-05T23:07:59Z

We will have an API review meeting at 11/6 to finalize the Intel hardware intrinsic APIs. There are three topics we need to go through in this review.

To review the current design of 64-bit only intrinsic and decide the nested class name (e.g., X64, X64Only, etc.)
To dive into SSE4.2 STTNI API design
- In the last review, we decided to have 3 enums StringComparisonMode, IndexStringComparisonMode, and MaskStringComparisonMode. They have some overlapped values and a few unique values to provide "safer" semantics for different environments (return bool, indexes, or mask). The names of these ComparisonMode match the manual and C/C++ counterparts (e.g., _SIDD_MASKED_NEGATIVE_POLARITY in C++ for MaskedNegativePolarity in C#).
- For intrinsic function names, we decided to encode result flags into more understandable function names, for example, _mm_cmpistra (C and Z flag not set) would be named CompareNoMatchAndRightNotTerminated to reflect the instruction semantics (too verbose?).
To review the redesigned helper intrinsic APIs in vector classes.
- the redesigned helper intrinsic does not provide genetic cast intrinsic (i.e., As<T>) and let users write their own ones via Unsafe.As<TFrom, TTo>.
- In my opinion, it is better to provide the genetic cast intrinsic in vector classes, which makes 1) the feature more self-contained 2) more stable perf with tiered JIT (Tiered0 does not inline Unsafe.As).

fiigii closed this as completed Jan 11, 2019

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the 3.0 milestone Jan 31, 2020

ghost locked as resolved and limited conversation to collaborators Dec 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intel hardware intrinsic API change #27584

Intel hardware intrinsic API change #27584

fiigii commented Oct 9, 2018

fiigii commented Oct 9, 2018

fiigii commented Oct 9, 2018

tannergooding commented Oct 9, 2018

4creators commented Oct 9, 2018 •

edited

Loading

tannergooding commented Oct 10, 2018

fiigii commented Oct 17, 2018

tannergooding commented Oct 17, 2018

tannergooding commented Nov 5, 2018

tannergooding commented Nov 5, 2018

fiigii commented Nov 5, 2018

Intel hardware intrinsic API change #27584

Intel hardware intrinsic API change #27584

Comments

fiigii commented Oct 9, 2018

fiigii commented Oct 9, 2018

fiigii commented Oct 9, 2018

tannergooding commented Oct 9, 2018

4creators commented Oct 9, 2018 • edited Loading

tannergooding commented Oct 10, 2018

fiigii commented Oct 17, 2018

tannergooding commented Oct 17, 2018

tannergooding commented Nov 5, 2018

tannergooding commented Nov 5, 2018

fiigii commented Nov 5, 2018

4creators commented Oct 9, 2018 •

edited

Loading