-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API Proposal: Arm64 Simd Insert and Extract elements #24588
Comments
Same methods are called |
Updated for generics and to match XARCH naming conventions. |
(nit) Are
Should |
@eerhardt Done |
For reference, |
Good call - I don't have a preference either way as well. Consistency is my preference. I just wanted something better than |
Agreed. |
@tannergooding Like this? Extract(value, valueIndex);
Insert(value, valueIndex, data);
Insert(value, valueIndex, data, dataIndex); |
I don't think x86 has equivalents for
Like @eerhardt, I don't have a preferences if we use CC @fiigii |
IMHO |
What scope do you intend for that statement? Methods with similar usage? All methods? For simple binary operands left and right is simple and can be consistent.
Arguments starting with same letter will slow comprehension slightly. People look at first and last letter, word length to recognize differences quickly.
|
My intended scope was that for things that are the same, or very similar, if there is no value in doing something different, then they should be consistent. Another inconsistency is x86 has the order |
This has quite an impact on learning curve, having good and intuitive naming makes it less steep.
This is good point - I have myself played games on word recognition using this kind cognitive rules. |
Agreed. Let's have good, intuitive, and consistent naming. 😄 |
The Arm64 order is consistent with left to right reading order
It is also consistent with Arm64 assembly syntax.
It is inconsistent with codegen order (which always confuses me)
My personal preference would be to
API is not my dominion so I will accept instruction. |
I wonder if this API is needed. It could also be written Insert(vector, valueIndex, Extract(data, dataIndex)) During lowering the pattern could be recognized and the During import the tree was going to be created like this anyway marking the I guess it is in part a philosophical question,
I guess I am prefering dropping this API and handling in containment analysis |
I certainly agree. The issue is that the X86 API surface is large enough already, that I miss overlap. The unfamiliar X86 terminology and documentation in terms of assembly instructions certainly does not make it easier. In this case, when the overlap was identified, I made the mistake of just renaming the function instead of copying the API from X86. In fact until this discussion I didn't realize I made that mistake. |
I believe it should be treated as inline assembly (effectively). That is, outside of the helper functions which don't strictly map to a single instruction (e.g If there are cases where a user can get better codegen/perf by changing their code, an analyzer may be better suited to do that analysis and suggest the "fix". This not only keeps the intrinsics simple, but it means the user gets exactly what they ask for which is predictable. |
In general I agree
I believe this If the base type is integer the vector element accesses represent a vector <--> general purpose register file which can be slower. So fusing will eliminate the slow operations. However both indicies must be immediate constants. In the case either index is not a constant, we need a switch to implement. In this case I would not want to fuse the instructions. On ARM/ARM64 this case is probably best handle with structured loads and stores. The need for this flexibility seems like a rare corner case. Finally allowing My preference is still to remove the composite API, and contain. I updated OP to remove the composite API & use vector/data. I left the Insert arguments in left to right order, pending further discussion. |
Fair enough 👍 |
If they are functionally equivalent, why do they use different C# method names? |
From a bit agnostic view, they are functionally the same (and will do the same thing regardless of the data you operate on) |
The functionality does differ when you operate on Vector256. Avx.UnpackLow operates on double and does: and |
@tannergooding 👍 X86 assembly is too ... for me to try to figure it out... |
@CarolEidt @eerhardt The argument order for the I think the proposed order above is best, but it is inconsistent with X86. I would prefer X86 order got fixed. However, if consensus favors current X86 argument order, I ready to change. |
I think we can mark this as ready for review. We can bring up the ordering during the review - whether they should be consistent and if so, which one should be used. |
@eerhardt Any reason why this is marked future? For |
Looks good as proposed. Feedback:
|
@tannergooding Are you currently working on this? If not - can I take this issue? |
No, I am not working on it right now and am trying to finish going through the architecture manual to find which APIs still need to be proposed |
@TamarChristinaArm Can you please help me to understand whether public static Vector64<float> Insert(Vector64<float> vector, byte index, float data);
public static Vector128<float> Insert(Vector128<float> vector, byte index, float data);
public static Vector64<double> Insert(Vector64<double> vector, byte index, double data);
public static Vector128<double> Insert(Vector128<double> vector, byte index, double data); should be lowered to INS Vd.S[lane], Vn.S[0]
INS Vd.S[lane], Vn.S[0]
INS Vd.D[lane], Vn.D[0]
INS Vd.D[lane], Vn.D[0] or something else? https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics?search=vset_lane says that float32x2_t vset_lane_f32 (float32_t a, float32x2_t v, const int lane);
float32x4_t vsetq_lane_f32 (float32_t a, float32x4_t v, const int lane);
float64x1_t vset_lane_f64 (float64_t a, float64x1_t v, const int lane);
float64x2_t vsetq_lane_f64 (float64_t a, float64x2_t v, const int lane); is lowered to INS Vd.S[lane],Rn
INS Vd.S[lane],Rn
INS Vd.D[lane],Rn
INS Vd.D[lane],Rn and Visual Studio 2019 seems to be doing what prescribed on the website, i.e. it generates For example, float r0 = 5.1f;
float r1 = 4.2f;
float32x4_t v0 = { 0 };
v0 = vsetq_lane_f32(r0, v0, 3);
v0 = vsetq_lane_f32(r1, v0, 1);
vst1q_f32(p, v0); compiles to 00000001400010E8: 910003E8 mov x8,sp
00000001400010EC: 1C0001F0 ldr s16,0000000140001128
00000001400010F0: A9007D1F stp xzr,xzr,[x8]
00000001400010F4: 3DC003F2 ldr q18,[sp]
00000001400010F8: 1C0001B1 ldr s17,000000014000112C
00000001400010FC: 910043E9 add x9,sp,#0x10
0000000140001100: 1E260208 fmov w8,s16
0000000140001104: 52800000 mov w0,#0
0000000140001108: 4E1C1D12 mov v18.s[3],w8
000000014000110C: 1E260228 fmov w8,s17
0000000140001110: 4E0C1D12 mov v18.s[1],w8
0000000140001114: 4C007932 st1 {v18.4s},[x9] Presumably, this could be compiled to? mov x8,sp
ldr s16,0000000140001128
ldr s17,000000014000112C
stp xzr,xzr,[x8]
ldr q18,[sp]
mov v18.s[3],v16.s[0]
mov v18.s[1],v17.s[0]
add x9,sp,#0x10
st1 {v18.4s},[x9] Am I missing something? |
No there's no particular reason it can't. In fact in GCC we do this. We only generate the instruction in the ACLE docs in the case that we have constructed the float value on the integer side to begin with. e.g. modern GCC for
won't generate a literal pool but will instead create the bitpattern on the genreg side
This is because the first |
@TamarChristinaArm Thank you for clarifying this for me! |
@eerhardt @CarolEidt @RussKeldorph
Target Framework netcoreapp2.1
Argument order and parameter names should be consistent between X86 and ARM64. They are currently not
The text was updated successfully, but these errors were encountered: