Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel hardware intrinsic API change #27584

Closed
fiigii opened this issue Oct 9, 2018 · 10 comments
Closed

Intel hardware intrinsic API change #27584

fiigii opened this issue Oct 9, 2018 · 10 comments

Comments

@fiigii
Copy link
Contributor

fiigii commented Oct 9, 2018

According to the second hw intrinsic API review meeting https://github.com/dotnet/apireviews/tree/master/2018/Hardware-Intrinsics-Intel, we need to change some intrinsic APIs and their implementation

  1. [Leave for the next version] Design SSE4.2 STTNI intrinsic
  2. [Done] 64-bit only intrinsics
  3. [Done] Improve StaticCast
    1. exploding frequently-used intrinsic on more base-types Add all integer overloads for AlignRight/BlendVariable and unsigned overloads for MultiplyLow coreclr#19420
    2. move StaticCast and some other helpers into vector classes, and rename StaticCast to As.
  4. New intrinsic requests
  5. [DONE] Missing and incorrect APIs
  6. [DONE] Remove generic Intel hardware intrinsic and explode them on all the supported types
@fiigii
Copy link
Contributor Author

fiigii commented Oct 9, 2018

So far, the first two items (SSE4.2 STTNI intrinsic and 64-bit only intrinsic) needs more discussion for naming.

  1. [No Merge]Adding SSE4.2 STTNI intrinsic APIs coreclr#19958 shows the current progress of designing SSE4.2 STTNI intrinsic APIs.
  2. Expose 64-bit only hardware intrinsic in nested classes coreclr#20146 shows the 64-bit only intrinsic APIs exposed in nested classes, currently named X64. That will make the 64-bit checks more explicit.
if (Popcnt.X64.IsSupported)
{
     var res = Popcnt.X64.Popcnt(longVal);
     ...
}

We think this is better than if (Popcnt.IsSupported && Environment.Is64BitProcess).
However, the name X64 of the nested class may be confusing sometimes because we have used X86 to represent the whole ISA family (32-bit and 64-bit both), so a name like Only64Bit may be better to express the subset.

@fiigii
Copy link
Contributor Author

fiigii commented Oct 9, 2018

@tannergooding
Copy link
Member

I think just X64 will be self-explanatory, especially given the target audience for these APIs.

@4creators
Copy link
Contributor

4creators commented Oct 9, 2018

I think just X64 will be self-explanatory, especially given the target audience for these APIs.

Our personal opinions are heavily biased but the fact that already 3 discussion participants identified X64 as potentially ambigous should prevail. We should not look for consensus on this issue but try to avoid introducing naming that for very visible group of users could be confusing.

The point is that 3 discussion participants comprise 3/5 of whole group what should be alarming.

@tannergooding
Copy link
Member

X64 is no more ambiguous than the existing X86 namespace. It is short, concise, and is almost universally understood to mean the 64-bit version of the x86 instruction set.

It is also already used elsewhere in .NET, alongside X86 in many cases, to differentiate in the same manner as we are trying to here (see https://source.dot.net/#q=X64 and https://source.dot.net/#q=X86) and in Roslyn (see http://source.roslyn.io/#q=X64 and http://source.roslyn.io/#q=X86) and other Microsoft products (the list could go on).

@fiigii
Copy link
Contributor Author

fiigii commented Oct 17, 2018

@eerhardt @jkotas Do you have any suggestion for the name of 64bit-only classes?

@tannergooding
Copy link
Member

I've requested another API review be scheduled to cover this and a few other questions. cc. @terrajobst

@tannergooding
Copy link
Member

For 3, the PR is here: dotnet/coreclr#20147

The current API in the PR is:

public static class Vector64
{
    public static unsafe Vector64<byte> Create(byte value);
    public static unsafe Vector64<double> Create(double value);
    public static unsafe Vector64<short> Create(short value);
    public static unsafe Vector64<int> Create(int value);
    public static unsafe Vector64<long> Create(long value);
    public static unsafe Vector64<sbyte> Create(sbyte value);
    public static unsafe Vector64<float> Create(float value);
    public static unsafe Vector64<ushort> Create(ushort value);
    public static unsafe Vector64<uint> Create(uint value);
    public static unsafe Vector64<ulong> Create(ulong value);
    
    public static unsafe Vector64<byte> Create(byte e0, byte e1, byte e2, byte e3, byte e4, byte e5, byte e6, byte e7);
    public static unsafe Vector64<short> Create(short e0, short e1, short e2, short e3);
    public static unsafe Vector64<int> Create(int e0, int e1);
    public static unsafe Vector64<sbyte> Create(sbyte e0, sbyte e1, sbyte e2, sbyte e3, sbyte e4, sbyte e5, sbyte e6, sbyte e7);
    public static unsafe Vector64<float> Create(float e0, float e1);
    public static unsafe Vector64<ushort> Create(ushort e0, ushort e1, ushort e2, ushort e3);
    public static unsafe Vector64<uint> Create(uint e0, uint e1);

    public static unsafe Vector64<byte> CreateScalar(byte value);
    public static unsafe Vector64<double> CreateScalar(double value);
    public static unsafe Vector64<short> CreateScalar(short value);
    public static unsafe Vector64<int> CreateScalar(int value);
    public static unsafe Vector64<long> CreateScalar(long value);
    public static unsafe Vector64<sbyte> CreateScalar(sbyte value);
    public static unsafe Vector64<float> CreateScalar(float value);
    public static unsafe Vector64<ushort> CreateScalar(ushort value);
    public static unsafe Vector64<uint> CreateScalar(uint value);
    public static unsafe Vector64<ulong> CreateScalar(ulong value);
}

public static partial struct Vector64<T>
{
    public static Vector64<T> Zero { get; }

    public Vector64<byte> AsByte();
    public Vector64<double> AsDouble();
    public Vector64<short> AsInt16();
    public Vector64<int> AsInt32();
    public Vector64<long> AsInt64();
    public Vector64<sbyte> AsSByte();
    public Vector64<float> AsSingle();
    public Vector64<ushort> AsUInt16();
    public Vector64<uint> AsUInt32();
    public Vector64<ulong> AsUInt64();

    public T AsScalar();

    public T GetElement(int index);
    public void SetElement(int index, T value);
}

public static class Vector128
{
    public static unsafe Vector128<byte> Create(byte value);
    public static unsafe Vector128<double> Create(double value);
    public static unsafe Vector128<short> Create(short value);
    public static unsafe Vector128<int> Create(int value);
    public static unsafe Vector128<long> Create(long value);
    public static unsafe Vector128<sbyte> Create(sbyte value);
    public static unsafe Vector128<float> Create(float value);
    public static unsafe Vector128<ushort> Create(ushort value);
    public static unsafe Vector128<uint> Create(uint value);
    public static unsafe Vector128<ulong> Create(ulong value);
    
    public static unsafe Vector128<byte> Create(byte e0, byte e1, byte e2, byte e3, byte e4, byte e5, byte e6, byte e7, byte e8, byte e9, byte e10, byte e11, byte e12, byte e13, byte e14, byte e15);
    public static unsafe Vector128<double> Create(double e0, double e1);
    public static unsafe Vector128<short> Create(short e0, short e1, short e2, short e3, short e4, short e5, short e6, short e7);
    public static unsafe Vector128<int> Create(int e0, int e1, int e2, int e3);
    public static unsafe Vector128<long> Create(long e0, long e1);
    public static unsafe Vector128<sbyte> Create(sbyte e0, sbyte e1, sbyte e2, sbyte e3, sbyte e4, sbyte e5, sbyte e6, sbyte e7, sbyte e8, sbyte e9, sbyte e10, sbyte e11, sbyte e12, sbyte e13, sbyte e14, sbyte e15);
    public static unsafe Vector128<float> Create(float e0, float e1, float e2, float e3);
    public static unsafe Vector128<ushort> Create(ushort e0, ushort e1, ushort e2, ushort e3, ushort e4, ushort e5, ushort e6, ushort e7);
    public static unsafe Vector128<uint> Create(uint e0, uint e1, uint e2, uint e3);
    public static unsafe Vector128<ulong> Create(ulong e0, ulong e1);
    
    public static unsafe Vector128<T> Create<T>(Vector64<T> lower, Vector64<T> upper);

    public static unsafe Vector128<byte> CreateScalar(byte value);
    public static unsafe Vector128<double> CreateScalar(double value);
    public static unsafe Vector128<short> CreateScalar(short value);
    public static unsafe Vector128<int> CreateScalar(int value);
    public static unsafe Vector128<long> CreateScalar(long value);
    public static unsafe Vector128<sbyte> CreateScalar(sbyte value);
    public static unsafe Vector128<float> CreateScalar(float value);
    public static unsafe Vector128<ushort> CreateScalar(ushort value);
    public static unsafe Vector128<uint> CreateScalar(uint value);
    public static unsafe Vector128<ulong> CreateScalar(ulong value);

    public static unsafe Vector128<T> CreateScalar<T>(Vector64<T> value);
}

public static partial struct Vector128<T>
{
    public static Vector128<T> Zero { get; }

    public Vector128<byte> AsByte();
    public Vector128<double> AsDouble();
    public Vector128<short> AsInt16();
    public Vector128<int> AsInt32();
    public Vector128<long> AsInt64();
    public Vector128<sbyte> AsSByte();
    public Vector128<float> AsSingle();
    public Vector128<ushort> AsUInt16();
    public Vector128<uint> AsUInt32();
    public Vector128<ulong> AsUInt64();

    public T AsScalar();

    public T GetElement(int index);
    public void SetElement(int index, T value);

    public Vector64<T> GetLower();
    public Vector64<T> GetUpper();
}

public static class Vector256
{
    public static unsafe Vector256<byte> Create(byte value);
    public static unsafe Vector256<double> Create(double value);
    public static unsafe Vector256<short> Create(short value);
    public static unsafe Vector256<int> Create(int value);
    public static unsafe Vector256<long> Create(long value);
    public static unsafe Vector256<sbyte> Create(sbyte value);
    public static unsafe Vector256<float> Create(float value);
    public static unsafe Vector256<ushort> Create(ushort value);
    public static unsafe Vector256<uint> Create(uint value);
    public static unsafe Vector256<ulong> Create(ulong value);
    
    public static unsafe Vector256<byte> Create(byte e0, byte e1, byte e2, byte e3, byte e4, byte e5, byte e6, byte e7, byte e8, byte e9, byte e10, byte e11, byte e12, byte e13, byte e14, byte e15, byte e16, byte e17, byte e18, byte e19, byte e20, byte e21, byte e22, byte e23, byte e24, byte e25, byte e26, byte e27, byte e28, byte e29, byte e30, byte e31);
    public static unsafe Vector256<double> Create(double e0, double e1, double e2, double e3);
    public static unsafe Vector256<short> Create(short e0, short e1, short e2, short e3, short e4, short e5, short e6, short e7, short e8, short e9, short e10, short e11, short e12, short e13, short e14, short e15);
    public static unsafe Vector256<int> Create(int e0, int e1, int e2, int e3, int e4, int e5, int e6, int e7);
    public static unsafe Vector256<long> Create(long e0, long e1, long e2, long e3);
    public static unsafe Vector256<sbyte> Create(sbyte e0, sbyte e1, sbyte e2, sbyte e3, sbyte e4, sbyte e5, sbyte e6, sbyte e7, sbyte e8, sbyte e9, sbyte e10, sbyte e11, sbyte e12, sbyte e13, sbyte e14, sbyte e15, sbyte e16, sbyte e17, sbyte e18, sbyte e19, sbyte e20, sbyte e21, sbyte e22, sbyte e23, sbyte e24, sbyte e25, sbyte e26, sbyte e27, sbyte e28, sbyte e29, sbyte e30, sbyte e31);
    public static unsafe Vector256<float> Create(float e0, float e1, float e2, float e3, float e4, float e5, float e6, float e7);
    public static unsafe Vector256<ushort> Create(ushort e0, ushort e1, ushort e2, ushort e3, ushort e4, ushort e5, ushort e6, ushort e7, ushort e8, ushort e9, ushort e10, ushort e11, ushort e12, ushort e13, ushort e14, ushort e15);
    public static unsafe Vector256<uint> Create(uint e0, uint e1, uint e2, uint e3, uint e4, uint e5, uint e6, uint e7);
    public static unsafe Vector256<ulong> Create(ulong e0, ulong e1, ulong e2, ulong e3);
    
    public static unsafe Vector256<T> Create<T>(Vector128<T> lower, Vector128<T> upper);

    public static unsafe Vector256<byte> CreateScalar(byte value);
    public static unsafe Vector256<double> CreateScalar(double value);
    public static unsafe Vector256<short> CreateScalar(short value);
    public static unsafe Vector256<int> CreateScalar(int value);
    public static unsafe Vector256<long> CreateScalar(long value);
    public static unsafe Vector256<sbyte> CreateScalar(sbyte value);
    public static unsafe Vector256<float> CreateScalar(float value);
    public static unsafe Vector256<ushort> CreateScalar(ushort value);
    public static unsafe Vector256<uint> CreateScalar(uint value);
    public static unsafe Vector256<ulong> CreateScalar(ulong value);

    public static unsafe Vector256<T> CreateScalar<T>(Vector128<T> value);
}

public static partial struct Vector256<T>
{
    public static Vector256<T> Zero { get; }

    public Vector256<byte> AsByte();
    public Vector256<double> AsDouble();
    public Vector256<short> AsInt16();
    public Vector256<int> AsInt32();
    public Vector256<long> AsInt64();
    public Vector256<sbyte> AsSByte();
    public Vector256<float> AsSingle();
    public Vector256<ushort> AsUInt16();
    public Vector256<uint> AsUInt32();
    public Vector256<ulong> AsUInt64();

    public T AsScalar();

    public T GetElement(int index);
    public void SetElement(int index, T value);

    public Vector128<T> GetLower();
    public Vector128<T> GetUpper();
}

The open questions are:

  • Should we also expose a generic As<T, U() API?
    • Previously asked, was said users can use Unsafe.As<TFrom, TTo>
  • Is ordering parameters from least to greatest (e0, e1, ..., e15) okay?
    • The current ones use the reverse (e15, e14, ..., e0) to match the native x86 side
    • The native intrinsics also expose _mm_setr methods that match the least to greatest ordering
  • Should we explicitly throw for unsupported T, or only when T won't fit in the Vector128
    • Current PR throws for unsupported T
  • What should we name the "unsafe" CreateScalar overloads which leave the upper bits uninitialized (https://github.com/dotnet/corefx/issues/32834)
  • Should the ExtendTo APIs be exposed as static CreateScalar or explicit ZeroExtend methods on the instance

@tannergooding
Copy link
Member

For 2, the biggest question is the name. Is X64 sufficient, or do we want something more "explicit"...

We also want to confirm that the nested X64 class should inherit from the next lowest X64 class. That is:

  • Bmi1.X64 -> Object
  • Bmi2.X64 -> Object
  • Lzcnt.X64 -> Object
  • Popcnt.X64 -> Sse42.X64 -> Sse41.X64 -> Sse2.X64 -> Sse.X64 -> Object

@fiigii
Copy link
Contributor Author

fiigii commented Nov 5, 2018

We will have an API review meeting at 11/6 to finalize the Intel hardware intrinsic APIs. There are three topics we need to go through in this review.

  1. To review the current design of 64-bit only intrinsic and decide the nested class name (e.g., X64, X64Only, etc.)
  2. To dive into SSE4.2 STTNI API design
    • In the last review, we decided to have 3 enums StringComparisonMode, IndexStringComparisonMode, and MaskStringComparisonMode. They have some overlapped values and a few unique values to provide "safer" semantics for different environments (return bool, indexes, or mask). The names of these ComparisonMode match the manual and C/C++ counterparts (e.g., _SIDD_MASKED_NEGATIVE_POLARITY in C++ for MaskedNegativePolarity in C#).
    • For intrinsic function names, we decided to encode result flags into more understandable function names, for example, _mm_cmpistra (C and Z flag not set) would be named CompareNoMatchAndRightNotTerminated to reflect the instruction semantics (too verbose?).
  3. To review the redesigned helper intrinsic APIs in vector classes.
    • the redesigned helper intrinsic does not provide genetic cast intrinsic (i.e., As<T>) and let users write their own ones via Unsafe.As<TFrom, TTo>.
    • In my opinion, it is better to provide the genetic cast intrinsic in vector classes, which makes 1) the feature more self-contained 2) more stable perf with tiered JIT (Tiered0 does not inline Unsafe.As).

@fiigii fiigii closed this as completed Jan 11, 2019
@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants