API Proposal: Arm64 Simd Insert and Extract elements #24588

sdmaclea · 2018-01-05T16:48:22Z

Target Framework netcoreapp2.1

namespace System.Runtime.Intrinsics.Arm.Arm64
{
    /// <summary>
    /// This class provides access to the Arm64 AdvSIMD intrinsics
    ///
    /// Arm64 CPUs indicate support for this feature by setting
    /// ID_AA64PFR0_EL1.AdvSIMD == 0 or better.
    /// </summary>
    public static class Simd
    {
        public static bool IsSupported { get { throw null; } }

        /// <summary>
        /// Vector extract item
        ///
        /// result = vector[index]
        ///
        /// Note: In order to be inlined, index must be a JIT time const expression which can be used to 
        /// populate the literal immediate field.  Use of a non constant will result in generation of a switch table
        ///
        /// Corresponds to vector forms of ARM64 MOV
        /// </summary>
        public static T Extract<T>(Vector64<T>  vector, byte index) where T : struct { throw null; }
        public static T Extract<T>(Vector128<T> vector, byte index) where T : struct { throw null; }
        
        /// <summary>
        /// Vector insert item
        ///
        /// result = vector;
        /// result[index] = data;
        ///
        /// Note: In order to be inlined, index must be a JIT time const expression which can be used to 
        /// populate the literal immediate field.  Use of a non constant will result in generation of a switch table
        ///
        /// Corresponds to vector forms of ARM64 INS
        /// </summary>
        public static Vector64<T>  Insert<T>(Vector64<T>  vector, byte index, T data) where T : struct { throw null; }
        public static Vector128<T> Insert<T>(Vector128<T> vector, byte index, T data) where T : struct { throw null; }
    }
}

Argument order and parameter names should be consistent between X86 and ARM64. They are currently not

jkotas · 2018-01-05T17:41:47Z

Same methods are called Insert* and Extract* in Intel intrinsics.

sdmaclea · 2018-01-05T20:30:15Z

Updated for generics and to match XARCH naming conventions.

eerhardt · 2018-01-09T16:24:04Z

public static Vector64<T> Insert<T>(Vector64<T> left, byte index, T right) where T : struct { throw null; }

(nit) Are left and right good names for the parameters here? Maybe vector and value?

public static Vector64<T> Insert<T>(Vector64<T> left, byte index1, Vector64<T> right, byte index2) where T : struct { throw null; }

Should index1 and index2 be renamed to leftIndex and rightIndex?

sdmaclea · 2018-01-09T16:40:03Z

@eerhardt Done

tannergooding · 2018-01-09T17:30:49Z

For reference, x86 is using value and data, rather than vector and value. I don't have a preference either way, but it might be nice for them to be named consistently.

eerhardt · 2018-01-09T17:32:25Z

Good call - I don't have a preference either way as well. Consistency is my preference. I just wanted something better than left and right.

tannergooding · 2018-01-09T17:34:47Z

I don't have a preference either way as well. Consistency is my preference. I just wanted something better than left and right.

Agreed.

sdmaclea · 2018-01-09T17:57:53Z

For reference, x86 is using value and data, rather than vector and value

@tannergooding Like this?

Extract(value, valueIndex);
Insert(value, valueIndex, data);
Insert(value, valueIndex, data, dataIndex);

tannergooding · 2018-01-09T18:12:32Z

I don't think x86 has equivalents for Insert(value, valueIndex, data, dataIndex) (shuffle and unpack are generally used for these). But yes, there is basically the following signatures:

Vector256<T> Insert<T>(Vector256<T> value, T data, byte index)
Vector256<T> Insert<T>(Vector256<T> value, Vector128<T> data, byte index)
Vector256<T> Insert<T>(Vector256<T> value, T* address, byte index)

Like @eerhardt, I don't have a preferences if we use value/data or vector/value.

CC @fiigii

4creators · 2018-01-10T09:18:32Z

Like @eerhardt, I don't have a preferences if we use value/data or vector/value.

IMHO vector\value would better describe parameters

sdmaclea · 2018-01-10T16:23:18Z

Consistency is my preference

What scope do you intend for that statement? Methods with similar usage? All methods?

For simple binary operands left and right is simple and can be consistent.
result = left operator right makes a lot of sense and X86 is already using this naming

IMHO vector\value would better describe parameters

Arguments starting with same letter will slow comprehension slightly. People look at first and last letter, word length to recognize differences quickly.

vector/data seems like a better combination.

eerhardt · 2018-01-10T16:42:19Z

What scope do you intend for that statement? Methods with similar usage? All methods?

My intended scope was that for things that are the same, or very similar, if there is no value in doing something different, then they should be consistent.

Another inconsistency is x86 has the order vector, data, index, but this has the order vector, index, data - the index and the data parameters are switched around. Maybe the value in being different there is that the machine instructions are different order between the two chips, and we want the C# API to match the low-level intrinsic. (I'm not saying that we want that - just giving an example where inconsistency may be OK - because there is a reason to be inconsistent.) But in general, when there is no real reason to be different, I think we should be consistent.

4creators · 2018-01-10T17:17:25Z

But in general, when there is no real reason to be different, I think we should be consistent.

This has quite an impact on learning curve, having good and intuitive naming makes it less steep.

Arguments starting with same letter will slow comprehension slightly.

This is good point - I have myself played games on word recognition using this kind cognitive rules.

eerhardt · 2018-01-10T18:03:48Z

This has quite an impact on learning curve, having good and intuitive naming makes it less steep.

Agreed. Let's have good, intuitive, and consistent naming. 😄
We can still change the x86 names if there are better names than what currently exist.

sdmaclea · 2018-01-10T20:39:17Z

Another inconsistency is x86 has the order vector, data, index, but this has the order vector, index, data

The Arm64 order is consistent with left to right reading order

result = vector[index];
vector[index] = data;
vector[index1] = data[index2]

It is also consistent with Arm64 assembly syntax.

mov X2, V0[3]      // X2 = V0[3]
mov V0[3], X2      // V0[3] = X2
mov V0[3], V2[5]   // V0[3] = V2[5]

It is inconsistent with codegen order (which always confuses me)

emit_R_R_I(... target, vector, index)
emit_R_R_I(... vector, data, index)
emit_R_R_I_I(... vector, data, index1, index2)

My personal preference would be to

keep the Arm64 order as proposed
change X86 order to match

API is not my dominion so I will accept instruction.

sdmaclea · 2018-01-10T20:49:53Z

I don't think x86 has equivalents for Insert(vector, valueIndex, data, dataIndex)

I wonder if this API is needed.

It could also be written

Insert(vector, valueIndex, Extract(data, dataIndex))

During lowering the pattern could be recognized and the Extract contained in the Insert.

During import the tree was going to be created like this anyway marking the Extract contained.

I guess it is in part a philosophical question,

should lowering be done on HW intrinsics
OR should it be treated like inline assembly and kept intact

I guess I am prefering dropping this API and handling in containment analysis

sdmaclea · 2018-01-10T21:07:08Z

when there is no real reason to be different, I think we should be consistent.

I certainly agree.

The issue is that the X86 API surface is large enough already, that I miss overlap. The unfamiliar X86 terminology and documentation in terms of assembly instructions certainly does not make it easier.

In this case, when the overlap was identified, I made the mistake of just renaming the function instead of copying the API from X86. In fact until this discussion I didn't realize I made that mistake.

tannergooding · 2018-01-10T21:25:39Z

should it be treated like inline assembly and kept intact

I believe it should be treated as inline assembly (effectively). That is, outside of the helper functions which don't strictly map to a single instruction (e.g SetAll, StaticCast, etc), we should be emitting exactly what the user requested (meaning, we wouldn't do something like convert Sse.Shuffle(left, right, 68) into Sse2.UnpackLow(left, right) or Sse.MoveLowHigh(left, right) even though they are all functionally equivalent).

If there are cases where a user can get better codegen/perf by changing their code, an analyzer may be better suited to do that analysis and suggest the "fix".

This not only keeps the intrinsics simple, but it means the user gets exactly what they ask for which is predictable.

sdmaclea · 2018-01-10T22:55:18Z

I believe it should be treated as inline assembly (effectively)

In general I agree

outside of the helper functions which don't strictly map to a single instruction

I believe this vector[index1] = data[index2] case is actually one of the rare exceptions.

If the base type is integer the vector element accesses represent a vector <--> general purpose register file which can be slower. So fusing will eliminate the slow operations.

However both indicies must be immediate constants. In the case either index is not a constant, we need a switch to implement. In this case I would not want to fuse the instructions.

On ARM/ARM64 this case is probably best handle with structured loads and stores. The need for this flexibility seems like a rare corner case.

Finally allowing Insert to contain Extract is equivalent to silently eliminating a mov and dropping a temporary register. So containment is not risky.

My preference is still to remove the composite API, and contain.
If containment was prohibited I would still defer adding the composite API.

I updated OP to remove the composite API & use vector/data.

I left the Insert arguments in left to right order, pending further discussion.

tannergooding · 2018-01-10T22:58:06Z

Finally allowing Insert to contain Extract is equivalent to silently eliminating a mov and dropping a temporary register. So containment is not risky.

Fair enough 👍

sdmaclea · 2018-01-10T23:01:04Z

Sse2.UnpackLow(left, right) or Sse.MoveLowHigh(left, right) even though they are all functionally equivalent

If they are functionally equivalent, why do they use different C# method names?
Doesn't this violate "good, intuitive, and consistent naming" ?

tannergooding · 2018-01-10T23:11:53Z

Sse2.UnpackLow operates on double and does:
DEST[63:0] = SRC1[63:0]
DEST[127:64] = SRC2[63:0]

Sse.MoveLowToHigh operates on float and does:
DEST[31:0] = SRC1[31:0]
DEST[63:32] = SRC1[63:32]
DEST[95:64] = SRC2[31:0]
DEST[127:96] = SRC2[63:32]

From a bit agnostic view, they are functionally the same (and will do the same thing regardless of the data you operate on)

tannergooding · 2018-01-10T23:16:30Z

The functionality does differ when you operate on Vector256.

Avx.UnpackLow operates on double and does:
DEST[63:0] = SRC1[63:0]
DEST[127:64] = SRC2[63:0]
DEST[191:128] = SRC1[191:128]
DEST[255:192] = SRC2[191:128]

and MoveLowToHigh doesn't have a Vector256 overload

sdmaclea · 2018-01-10T23:19:35Z

@tannergooding 👍 X86 assembly is too ... for me to try to figure it out...

sdmaclea · 2018-01-15T15:47:56Z

@CarolEidt @eerhardt The argument order for the Insert<T>() case is the only open issue here.

I think the proposed order above is best, but it is inconsistent with X86. I would prefer X86 order got fixed.

However, if consensus favors current X86 argument order, I ready to change.

eerhardt · 2018-01-16T20:39:04Z

I think we can mark this as ready for review. We can bring up the ordering during the review - whether they should be consistent and if so, which one should be used.

sdmaclea · 2018-02-05T16:33:33Z

@eerhardt Any reason why this is marked future? For Extract the CoreCLR work is complete and ready for 2.1. The remaining Insert part is in review and should be ready for 2.1

terrajobst · 2018-09-05T02:54:29Z

Looks good as proposed. Feedback:

No need to match X86 ordering of parmeter, just match the ARM instruction ordering.

echesakov · 2020-02-06T01:36:42Z

@tannergooding Are you currently working on this? If not - can I take this issue?

tannergooding · 2020-02-10T18:52:09Z

No, I am not working on it right now and am trying to finish going through the architecture manual to find which APIs still need to be proposed

echesakov · 2020-03-23T23:54:57Z

@TamarChristinaArm Can you please help me to understand whether

public static Vector64<float> Insert(Vector64<float> vector, byte index, float data);
public static Vector128<float> Insert(Vector128<float> vector, byte index, float data);

public static Vector64<double> Insert(Vector64<double> vector, byte index, double data);
public static Vector128<double> Insert(Vector128<double> vector, byte index, double data);

should be lowered to

INS Vd.S[lane], Vn.S[0]
INS Vd.S[lane], Vn.S[0]

INS Vd.D[lane], Vn.D[0]
INS Vd.D[lane], Vn.D[0]

or something else?

https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics?search=vset_lane says that

float32x2_t vset_lane_f32 (float32_t a, float32x2_t v, const int lane);
float32x4_t vsetq_lane_f32 (float32_t a, float32x4_t v, const int lane);

float64x1_t vset_lane_f64 (float64_t a, float64x1_t v, const int lane);
float64x2_t vsetq_lane_f64 (float64_t a, float64x2_t v, const int lane);

is lowered to

INS Vd.S[lane],Rn
INS Vd.S[lane],Rn

INS Vd.D[lane],Rn
INS Vd.D[lane],Rn

and Visual Studio 2019 seems to be doing what prescribed on the website, i.e. it generates INS Vd.S[lane],Wn preceded by FMOV Wn, Sm

For example,

	float r0 = 5.1f;
	float r1 = 4.2f;
	
	float32x4_t v0 = { 0 };
	
	v0 = vsetq_lane_f32(r0, v0, 3);
	v0 = vsetq_lane_f32(r1, v0, 1);

	vst1q_f32(p, v0);

compiles to

  00000001400010E8: 910003E8  mov         x8,sp
  00000001400010EC: 1C0001F0  ldr         s16,0000000140001128
  00000001400010F0: A9007D1F  stp         xzr,xzr,[x8]
  00000001400010F4: 3DC003F2  ldr         q18,[sp]
  00000001400010F8: 1C0001B1  ldr         s17,000000014000112C
  00000001400010FC: 910043E9  add         x9,sp,#0x10
  0000000140001100: 1E260208  fmov        w8,s16
  0000000140001104: 52800000  mov         w0,#0
  0000000140001108: 4E1C1D12  mov         v18.s[3],w8
  000000014000110C: 1E260228  fmov        w8,s17
  0000000140001110: 4E0C1D12  mov         v18.s[1],w8
  0000000140001114: 4C007932  st1         {v18.4s},[x9]

Presumably, this could be compiled to?

mov         x8,sp
ldr         s16,0000000140001128
ldr         s17,000000014000112C
stp         xzr,xzr,[x8]
ldr         q18,[sp]
mov         v18.s[3],v16.s[0]
mov         v18.s[1],v17.s[0]
add         x9,sp,#0x10
st1         {v18.4s},[x9]

Am I missing something?

TamarChristinaArm · 2020-03-24T13:19:51Z

Presumably, this could be compiled to?

mov         x8,sp
ldr         s16,0000000140001128
ldr         s17,000000014000112C
stp         xzr,xzr,[x8]
ldr         q18,[sp]
mov         v18.s[3],v16.s[0]
mov         v18.s[1],v17.s[0]
add         x9,sp,#0x10
st1         {v18.4s},[x9]

Am I missing something?

No there's no particular reason it can't. In fact in GCC we do this. We only generate the instruction in the ACLE docs in the case that we have constructed the float value on the integer side to begin with.

e.g. modern GCC for

float r0 = 5.1f;

won't generate a literal pool but will instead create the bitpattern on the genreg side

        mov     w1, 13107
        movk    w1, 0x40a3, lsl 16
        fmov    s2, w1

This is because the first mov/movk is handled in the same cycle. So it ends up being cheaper to construct like this then addressing the literal pool and doing a load and you just replace the fmov with the mov.

echesakov · 2020-03-24T16:42:32Z

@TamarChristinaArm Thank you for clarifying this for me!

sdmaclea changed the title ~~API Proposal: Arm64 Simd GetItem & SetItem~~ API Proposal: Arm64 Simd Insert and Extract elements Jan 5, 2018

tannergooding self-assigned this Oct 16, 2019

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

echesakov mentioned this issue Feb 6, 2020

Finish adding back arm intrinsics that existing before refactoring #13575

Closed

30 tasks

tannergooding removed their assignment Feb 10, 2020

echesakov self-assigned this Feb 21, 2020

echesakov mentioned this issue Mar 13, 2020

Optimize System.Collections.BitArray using arm64 intrinsics #33309

Closed

tannergooding mentioned this issue Mar 13, 2020

API Proposal: More SIMD HW Intrinsics #24794

Closed

echesakov modified the milestones: Future, 5.0 Apr 4, 2020

echesakov removed the help wanted [up-for-grabs] Good issue for external contributors label Apr 4, 2020

echesakov mentioned this issue Apr 21, 2020

[Arm64] Implement ASIMD Extract Insert ExtractVector64 ExtractVector128 #35030

Merged

echesakov closed this as completed in #35030 Apr 22, 2020

ghost locked as resolved and limited conversation to collaborators Dec 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API Proposal: Arm64 Simd Insert and Extract elements #24588

API Proposal: Arm64 Simd Insert and Extract elements #24588

sdmaclea commented Jan 5, 2018

jkotas commented Jan 5, 2018

sdmaclea commented Jan 5, 2018

eerhardt commented Jan 9, 2018

sdmaclea commented Jan 9, 2018

tannergooding commented Jan 9, 2018

eerhardt commented Jan 9, 2018

tannergooding commented Jan 9, 2018

sdmaclea commented Jan 9, 2018

tannergooding commented Jan 9, 2018

4creators commented Jan 10, 2018

sdmaclea commented Jan 10, 2018

eerhardt commented Jan 10, 2018

4creators commented Jan 10, 2018

eerhardt commented Jan 10, 2018

sdmaclea commented Jan 10, 2018

sdmaclea commented Jan 10, 2018

sdmaclea commented Jan 10, 2018

tannergooding commented Jan 10, 2018

sdmaclea commented Jan 10, 2018 •

edited

Loading

tannergooding commented Jan 10, 2018

sdmaclea commented Jan 10, 2018

tannergooding commented Jan 10, 2018 •

edited

Loading

tannergooding commented Jan 10, 2018

sdmaclea commented Jan 10, 2018

sdmaclea commented Jan 15, 2018

eerhardt commented Jan 16, 2018

sdmaclea commented Feb 5, 2018

terrajobst commented Sep 5, 2018

echesakov commented Feb 6, 2020

tannergooding commented Feb 10, 2020

echesakov commented Mar 23, 2020 •

edited

Loading

TamarChristinaArm commented Mar 24, 2020

echesakov commented Mar 24, 2020

API Proposal: Arm64 Simd Insert and Extract elements #24588

API Proposal: Arm64 Simd Insert and Extract elements #24588

Comments

sdmaclea commented Jan 5, 2018

jkotas commented Jan 5, 2018

sdmaclea commented Jan 5, 2018

eerhardt commented Jan 9, 2018

sdmaclea commented Jan 9, 2018

tannergooding commented Jan 9, 2018

eerhardt commented Jan 9, 2018

tannergooding commented Jan 9, 2018

sdmaclea commented Jan 9, 2018

tannergooding commented Jan 9, 2018

4creators commented Jan 10, 2018

sdmaclea commented Jan 10, 2018

eerhardt commented Jan 10, 2018

4creators commented Jan 10, 2018

eerhardt commented Jan 10, 2018

sdmaclea commented Jan 10, 2018

sdmaclea commented Jan 10, 2018

sdmaclea commented Jan 10, 2018

tannergooding commented Jan 10, 2018

sdmaclea commented Jan 10, 2018 • edited Loading

tannergooding commented Jan 10, 2018

sdmaclea commented Jan 10, 2018

tannergooding commented Jan 10, 2018 • edited Loading

tannergooding commented Jan 10, 2018

sdmaclea commented Jan 10, 2018

sdmaclea commented Jan 15, 2018

eerhardt commented Jan 16, 2018

sdmaclea commented Feb 5, 2018

terrajobst commented Sep 5, 2018

echesakov commented Feb 6, 2020

tannergooding commented Feb 10, 2020

echesakov commented Mar 23, 2020 • edited Loading

TamarChristinaArm commented Mar 24, 2020

echesakov commented Mar 24, 2020

sdmaclea commented Jan 10, 2018 •

edited

Loading

tannergooding commented Jan 10, 2018 •

edited

Loading

echesakov commented Mar 23, 2020 •

edited

Loading