Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Proposal: Arm64 Simd Insert and Extract elements #24588

Closed
sdmaclea opened this issue Jan 5, 2018 · 33 comments · Fixed by #35030
Closed

API Proposal: Arm64 Simd Insert and Extract elements #24588

sdmaclea opened this issue Jan 5, 2018 · 33 comments · Fixed by #35030
Assignees
Labels
api-approved API was approved in API review, it can be implemented arch-arm64 area-System.Runtime.Intrinsics
Milestone

Comments

@sdmaclea
Copy link
Contributor

sdmaclea commented Jan 5, 2018

@eerhardt @CarolEidt @RussKeldorph

Target Framework netcoreapp2.1

namespace System.Runtime.Intrinsics.Arm.Arm64
{
    /// <summary>
    /// This class provides access to the Arm64 AdvSIMD intrinsics
    ///
    /// Arm64 CPUs indicate support for this feature by setting
    /// ID_AA64PFR0_EL1.AdvSIMD == 0 or better.
    /// </summary>
    public static class Simd
    {
        public static bool IsSupported { get { throw null; } }

        /// <summary>
        /// Vector extract item
        ///
        /// result = vector[index]
        ///
        /// Note: In order to be inlined, index must be a JIT time const expression which can be used to 
        /// populate the literal immediate field.  Use of a non constant will result in generation of a switch table
        ///
        /// Corresponds to vector forms of ARM64 MOV
        /// </summary>
        public static T Extract<T>(Vector64<T>  vector, byte index) where T : struct { throw null; }
        public static T Extract<T>(Vector128<T> vector, byte index) where T : struct { throw null; }
        
        /// <summary>
        /// Vector insert item
        ///
        /// result = vector;
        /// result[index] = data;
        ///
        /// Note: In order to be inlined, index must be a JIT time const expression which can be used to 
        /// populate the literal immediate field.  Use of a non constant will result in generation of a switch table
        ///
        /// Corresponds to vector forms of ARM64 INS
        /// </summary>
        public static Vector64<T>  Insert<T>(Vector64<T>  vector, byte index, T data) where T : struct { throw null; }
        public static Vector128<T> Insert<T>(Vector128<T> vector, byte index, T data) where T : struct { throw null; }
    }
}

Argument order and parameter names should be consistent between X86 and ARM64. They are currently not

@jkotas
Copy link
Member

jkotas commented Jan 5, 2018

Same methods are called Insert* and Extract* in Intel intrinsics.

@sdmaclea
Copy link
Contributor Author

sdmaclea commented Jan 5, 2018

Updated for generics and to match XARCH naming conventions.

@sdmaclea sdmaclea changed the title API Proposal: Arm64 Simd GetItem & SetItem API Proposal: Arm64 Simd Insert and Extract elements Jan 5, 2018
@eerhardt
Copy link
Member

eerhardt commented Jan 9, 2018

public static Vector64<T> Insert<T>(Vector64<T> left, byte index, T right) where T : struct { throw null; }

(nit) Are left and right good names for the parameters here? Maybe vector and value?

public static Vector64<T> Insert<T>(Vector64<T> left, byte index1, Vector64<T> right, byte index2) where T : struct { throw null; }

Should index1 and index2 be renamed to leftIndex and rightIndex?

@sdmaclea
Copy link
Contributor Author

sdmaclea commented Jan 9, 2018

@eerhardt Done

@tannergooding
Copy link
Member

For reference, x86 is using value and data, rather than vector and value. I don't have a preference either way, but it might be nice for them to be named consistently.

@eerhardt
Copy link
Member

eerhardt commented Jan 9, 2018

Good call - I don't have a preference either way as well. Consistency is my preference. I just wanted something better than left and right.

@tannergooding
Copy link
Member

I don't have a preference either way as well. Consistency is my preference. I just wanted something better than left and right.

Agreed.

@sdmaclea
Copy link
Contributor Author

sdmaclea commented Jan 9, 2018

For reference, x86 is using value and data, rather than vector and value

@tannergooding Like this?

Extract(value, valueIndex);
Insert(value, valueIndex, data);
Insert(value, valueIndex, data, dataIndex);

@tannergooding
Copy link
Member

I don't think x86 has equivalents for Insert(value, valueIndex, data, dataIndex) (shuffle and unpack are generally used for these). But yes, there is basically the following signatures:

Vector256<T> Insert<T>(Vector256<T> value, T data, byte index)
Vector256<T> Insert<T>(Vector256<T> value, Vector128<T> data, byte index)
Vector256<T> Insert<T>(Vector256<T> value, T* address, byte index)

Like @eerhardt, I don't have a preferences if we use value/data or vector/value.

CC @fiigii

@4creators
Copy link
Contributor

Like @eerhardt, I don't have a preferences if we use value/data or vector/value.

IMHO vector\value would better describe parameters

@sdmaclea
Copy link
Contributor Author

Consistency is my preference

What scope do you intend for that statement? Methods with similar usage? All methods?

For simple binary operands left and right is simple and can be consistent.
result = left operator right makes a lot of sense and X86 is already using this naming

IMHO vector\value would better describe parameters

Arguments starting with same letter will slow comprehension slightly. People look at first and last letter, word length to recognize differences quickly.

vector/data seems like a better combination.

@eerhardt
Copy link
Member

What scope do you intend for that statement? Methods with similar usage? All methods?

My intended scope was that for things that are the same, or very similar, if there is no value in doing something different, then they should be consistent.

Another inconsistency is x86 has the order vector, data, index, but this has the order vector, index, data - the index and the data parameters are switched around. Maybe the value in being different there is that the machine instructions are different order between the two chips, and we want the C# API to match the low-level intrinsic. (I'm not saying that we want that - just giving an example where inconsistency may be OK - because there is a reason to be inconsistent.) But in general, when there is no real reason to be different, I think we should be consistent.

@4creators
Copy link
Contributor

But in general, when there is no real reason to be different, I think we should be consistent.

This has quite an impact on learning curve, having good and intuitive naming makes it less steep.

Arguments starting with same letter will slow comprehension slightly.

This is good point - I have myself played games on word recognition using this kind cognitive rules.

@eerhardt
Copy link
Member

This has quite an impact on learning curve, having good and intuitive naming makes it less steep.

Agreed. Let's have good, intuitive, and consistent naming. 😄
We can still change the x86 names if there are better names than what currently exist.

@sdmaclea
Copy link
Contributor Author

Another inconsistency is x86 has the order vector, data, index, but this has the order vector, index, data

The Arm64 order is consistent with left to right reading order

result = vector[index];
vector[index] = data;
vector[index1] = data[index2]

It is also consistent with Arm64 assembly syntax.

mov X2, V0[3]      // X2 = V0[3]
mov V0[3], X2      // V0[3] = X2
mov V0[3], V2[5]   // V0[3] = V2[5]

It is inconsistent with codegen order (which always confuses me)

emit_R_R_I(... target, vector, index)
emit_R_R_I(... vector, data, index)
emit_R_R_I_I(... vector, data, index1, index2)

My personal preference would be to

  • keep the Arm64 order as proposed
  • change X86 order to match

API is not my dominion so I will accept instruction.

@sdmaclea
Copy link
Contributor Author

I don't think x86 has equivalents for Insert(vector, valueIndex, data, dataIndex)

I wonder if this API is needed.

It could also be written

Insert(vector, valueIndex, Extract(data, dataIndex))

During lowering the pattern could be recognized and the Extract contained in the Insert.

During import the tree was going to be created like this anyway marking the Extract contained.

I guess it is in part a philosophical question,

  • should lowering be done on HW intrinsics
  • OR should it be treated like inline assembly and kept intact

I guess I am prefering dropping this API and handling in containment analysis

@sdmaclea
Copy link
Contributor Author

when there is no real reason to be different, I think we should be consistent.

I certainly agree.

The issue is that the X86 API surface is large enough already, that I miss overlap. The unfamiliar X86 terminology and documentation in terms of assembly instructions certainly does not make it easier.

In this case, when the overlap was identified, I made the mistake of just renaming the function instead of copying the API from X86. In fact until this discussion I didn't realize I made that mistake.

@tannergooding
Copy link
Member

should it be treated like inline assembly and kept intact

I believe it should be treated as inline assembly (effectively). That is, outside of the helper functions which don't strictly map to a single instruction (e.g SetAll, StaticCast, etc), we should be emitting exactly what the user requested (meaning, we wouldn't do something like convert Sse.Shuffle(left, right, 68) into Sse2.UnpackLow(left, right) or Sse.MoveLowHigh(left, right) even though they are all functionally equivalent).

If there are cases where a user can get better codegen/perf by changing their code, an analyzer may be better suited to do that analysis and suggest the "fix".

This not only keeps the intrinsics simple, but it means the user gets exactly what they ask for which is predictable.

@sdmaclea
Copy link
Contributor Author

sdmaclea commented Jan 10, 2018

I believe it should be treated as inline assembly (effectively)

In general I agree

outside of the helper functions which don't strictly map to a single instruction

I believe this vector[index1] = data[index2] case is actually one of the rare exceptions.

If the base type is integer the vector element accesses represent a vector <--> general purpose register file which can be slower. So fusing will eliminate the slow operations.

However both indicies must be immediate constants. In the case either index is not a constant, we need a switch to implement. In this case I would not want to fuse the instructions.

On ARM/ARM64 this case is probably best handle with structured loads and stores. The need for this flexibility seems like a rare corner case.

Finally allowing Insert to contain Extract is equivalent to silently eliminating a mov and dropping a temporary register. So containment is not risky.

My preference is still to remove the composite API, and contain.
If containment was prohibited I would still defer adding the composite API.

I updated OP to remove the composite API & use vector/data.

I left the Insert arguments in left to right order, pending further discussion.

@tannergooding
Copy link
Member

Finally allowing Insert to contain Extract is equivalent to silently eliminating a mov and dropping a temporary register. So containment is not risky.

Fair enough 👍

@sdmaclea
Copy link
Contributor Author

Sse2.UnpackLow(left, right) or Sse.MoveLowHigh(left, right) even though they are all functionally equivalent

If they are functionally equivalent, why do they use different C# method names?
Doesn't this violate "good, intuitive, and consistent naming" ?

@tannergooding
Copy link
Member

tannergooding commented Jan 10, 2018

Sse2.UnpackLow operates on double and does:
DEST[63:0] = SRC1[63:0]
DEST[127:64] = SRC2[63:0]

Sse.MoveLowToHigh operates on float and does:
DEST[31:0] = SRC1[31:0]
DEST[63:32] = SRC1[63:32]
DEST[95:64] = SRC2[31:0]
DEST[127:96] = SRC2[63:32]

From a bit agnostic view, they are functionally the same (and will do the same thing regardless of the data you operate on)

@tannergooding
Copy link
Member

The functionality does differ when you operate on Vector256.

Avx.UnpackLow operates on double and does:
DEST[63:0] = SRC1[63:0]
DEST[127:64] = SRC2[63:0]
DEST[191:128] = SRC1[191:128]
DEST[255:192] = SRC2[191:128]

and MoveLowToHigh doesn't have a Vector256 overload

@sdmaclea
Copy link
Contributor Author

@tannergooding 👍 X86 assembly is too ... for me to try to figure it out...

@sdmaclea
Copy link
Contributor Author

@CarolEidt @eerhardt The argument order for the Insert<T>() case is the only open issue here.

I think the proposed order above is best, but it is inconsistent with X86. I would prefer X86 order got fixed.

However, if consensus favors current X86 argument order, I ready to change.

@eerhardt
Copy link
Member

I think we can mark this as ready for review. We can bring up the ordering during the review - whether they should be consistent and if so, which one should be used.

@sdmaclea
Copy link
Contributor Author

sdmaclea commented Feb 5, 2018

@eerhardt Any reason why this is marked future? For Extract the CoreCLR work is complete and ready for 2.1. The remaining Insert part is in review and should be ready for 2.1

@terrajobst
Copy link
Member

Looks good as proposed. Feedback:

  • No need to match X86 ordering of parmeter, just match the ARM instruction ordering.

@tannergooding tannergooding self-assigned this Oct 16, 2019
@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@echesakov
Copy link
Contributor

@tannergooding Are you currently working on this? If not - can I take this issue?

@tannergooding
Copy link
Member

No, I am not working on it right now and am trying to finish going through the architecture manual to find which APIs still need to be proposed

@echesakov
Copy link
Contributor

echesakov commented Mar 23, 2020

@TamarChristinaArm Can you please help me to understand whether

public static Vector64<float> Insert(Vector64<float> vector, byte index, float data);
public static Vector128<float> Insert(Vector128<float> vector, byte index, float data);

public static Vector64<double> Insert(Vector64<double> vector, byte index, double data);
public static Vector128<double> Insert(Vector128<double> vector, byte index, double data);

should be lowered to

INS Vd.S[lane], Vn.S[0]
INS Vd.S[lane], Vn.S[0]

INS Vd.D[lane], Vn.D[0]
INS Vd.D[lane], Vn.D[0]

or something else?

https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics?search=vset_lane says that

float32x2_t vset_lane_f32 (float32_t a, float32x2_t v, const int lane);
float32x4_t vsetq_lane_f32 (float32_t a, float32x4_t v, const int lane);

float64x1_t vset_lane_f64 (float64_t a, float64x1_t v, const int lane);
float64x2_t vsetq_lane_f64 (float64_t a, float64x2_t v, const int lane);

is lowered to

INS Vd.S[lane],Rn
INS Vd.S[lane],Rn

INS Vd.D[lane],Rn
INS Vd.D[lane],Rn

and Visual Studio 2019 seems to be doing what prescribed on the website, i.e. it generates INS Vd.S[lane],Wn preceded by FMOV Wn, Sm

For example,

	float r0 = 5.1f;
	float r1 = 4.2f;
	
	float32x4_t v0 = { 0 };
	
	v0 = vsetq_lane_f32(r0, v0, 3);
	v0 = vsetq_lane_f32(r1, v0, 1);

	vst1q_f32(p, v0);

compiles to

  00000001400010E8: 910003E8  mov         x8,sp
  00000001400010EC: 1C0001F0  ldr         s16,0000000140001128
  00000001400010F0: A9007D1F  stp         xzr,xzr,[x8]
  00000001400010F4: 3DC003F2  ldr         q18,[sp]
  00000001400010F8: 1C0001B1  ldr         s17,000000014000112C
  00000001400010FC: 910043E9  add         x9,sp,#0x10
  0000000140001100: 1E260208  fmov        w8,s16
  0000000140001104: 52800000  mov         w0,#0
  0000000140001108: 4E1C1D12  mov         v18.s[3],w8
  000000014000110C: 1E260228  fmov        w8,s17
  0000000140001110: 4E0C1D12  mov         v18.s[1],w8
  0000000140001114: 4C007932  st1         {v18.4s},[x9]

Presumably, this could be compiled to?

mov         x8,sp
ldr         s16,0000000140001128
ldr         s17,000000014000112C
stp         xzr,xzr,[x8]
ldr         q18,[sp]
mov         v18.s[3],v16.s[0]
mov         v18.s[1],v17.s[0]
add         x9,sp,#0x10
st1         {v18.4s},[x9]

Am I missing something?

@TamarChristinaArm
Copy link
Contributor

Presumably, this could be compiled to?

mov         x8,sp
ldr         s16,0000000140001128
ldr         s17,000000014000112C
stp         xzr,xzr,[x8]
ldr         q18,[sp]
mov         v18.s[3],v16.s[0]
mov         v18.s[1],v17.s[0]
add         x9,sp,#0x10
st1         {v18.4s},[x9]

Am I missing something?

No there's no particular reason it can't. In fact in GCC we do this. We only generate the instruction in the ACLE docs in the case that we have constructed the float value on the integer side to begin with.

e.g. modern GCC for

float r0 = 5.1f;

won't generate a literal pool but will instead create the bitpattern on the genreg side

        mov     w1, 13107
        movk    w1, 0x40a3, lsl 16
        fmov    s2, w1

This is because the first mov/movk is handled in the same cycle. So it ends up being cheaper to construct like this then addressing the literal pool and doing a load and you just replace the fmov with the mov.

@echesakov
Copy link
Contributor

@TamarChristinaArm Thank you for clarifying this for me!

@echesakov echesakov modified the milestones: Future, 5.0 Apr 4, 2020
@echesakov echesakov removed the help wanted [up-for-grabs] Good issue for external contributors label Apr 4, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 19, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented arch-arm64 area-System.Runtime.Intrinsics
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants