Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arm64 apple vm fixes for arg alignment. #46665

Merged
merged 25 commits into from
Jan 21, 2021
Merged

Conversation

sandreenko
Copy link
Contributor

@sandreenko sandreenko commented Jan 7, 2021

Contributes to #46456.

Move to byte sizes/offsets in VM so it works correctly with arguments that do not occupy TARGET_POINTER_SIZE stack slots on arm64.

It also unifies some platforms, for example, before ArgIteratorBase was using byte offsets for some and stack slot index for the others, now it uses byte offset for all.

Use TARGET_POINTER_SIZE instead of STACK_ELEM_SIZE because:

  1. there were many places where TARGET_POINTER_SIZE was already used in such context;
  2. it is confusing for arm64 apple to have STACK_ELEM_SIZE and I did not come up with a better name;

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 7, 2021
@sandreenko sandreenko marked this pull request as draft January 7, 2021 04:30
@sandreenko sandreenko added area-VM-coreclr and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Jan 7, 2021
@sandreenko sandreenko force-pushed the arm64AppleVM branch 3 times, most recently from ade1e53 to c50a5d6 Compare January 11, 2021 22:02
@sandreenko sandreenko changed the title Arm64 apple vm Arm64 apple vm fixes for arg alignment. Jan 11, 2021
@sandreenko sandreenko marked this pull request as ready for review January 11, 2021 22:06
@sandreenko
Copy link
Contributor Author

PTAL @janvorli , @jkotas , @sdmaclea , with these changes arm64 passes my small repro from sandreenko@678ce4b and does not show any new failures in my local run, generic test reflection is passing further but still failing, probably with an independent issue.

Copy link
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one more place to change, the https://github.com/dotnet/runtime/blob/master/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/DependencyAnalysis/ReadyToRun/ArgIterator.cs. This one is basically a port of the callingconvention.h to managed code.

src/coreclr/vm/arm64/cgencpu.h Show resolved Hide resolved
if (TransitionBlock::IsFloatArgumentRegisterOffset(argOffset))
{
pLoc->m_idxFloatReg = (argOffset - TransitionBlock::GetOffsetOfFloatArgumentRegisters()) / 4;
pLoc->m_cFloatReg = cSlots;
pLoc->m_idxFloatReg = (argOffset - TransitionBlock::GetOffsetOfFloatArgumentRegisters()) / TARGET_POINTER_SIZE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit misleading, the size is not a target pointer size, but a floating point size. I know they are the same on arm, but they are not the same on arm64 and x64 and the equivalent functions for those two use / 16. I would prefer defining FLOAT_REGISTER_SIZE in the cgencpu.h files and using it here and in the same arm64 and x64 methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would avoid the term FLOAT_REGISTER_SIZE on arm/arm64. I would prefer FLOAT_SIZE or SIZEOF_FLOAT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added FLOAT_REGISTER_SIZE,

I would prefer FLOAT_SIZE or SIZEOF_FLOAT.

would not it be confusing to have #define FLOAT_SIZE 16 on arm64?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I didn't expect the value to be 16.

  • For Arm32, a float register i.e. s0 would be 4 bytes, but it would be one of many in the q0/v0 register.
  • For arm64 s0 is the lower 4 bytes of the v0 SIMD register which is 16 bytes.

FLOAT_SIZE is definitely, worse, but I still think FLOAT_REGISTER_SIZE might be ambiguous for ARM architectures. VECTOR_REGISTER_SIZE might be better for ARM/ARM64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that we use m_idxFloatReg, m_cFloatReg fields in the surrounding code, thus the FLOAT_REGISTER_SIZE matches those.

@@ -620,35 +632,35 @@ class ArgIteratorTemplate : public ARGITERATOR_BASE

pLoc->m_fRequires64BitAlignment = m_fRequires64BitAlignment;

const unsigned byteArgSize = StackElemSize(GetArgSize());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable name is misleading, as it looks as if it was the size of argument in bytes, but is it the size aligned to stack element size. The more I look at the code below, the more I feel like we should just keep using the cSlots except for the places where we place arguments on stack (we set the m_byteStackSize). This comment is meant for all architectures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does cSlots meant slots count? Does slot size depend on context, like it could be a general reg size, float reg size, stack slot size?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cSlots meant slot count where slots were considered both register slots in the transition block and stack slots. Only on ARM32 the term has leaked into the floating point register count setting, which may be misleading. Now the stack slot concept is kind of broken by the Apple ARM64 calling convention, but the register slot still holds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed this code so we don't create cSlots and don't call StackElemSize unless we put byteArgSize on the stack.

src/coreclr/vm/callingconvention.h Outdated Show resolved Hide resolved

unsigned cSlots = (byteArgSize + TARGET_POINTER_SIZE - 1) / TARGET_POINTER_SIZE;

// Question: why do not arm and x86 have similar checks?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this check is needed here. The ArgIteratorTemplate<ARGITERATOR_BASE>::GetNextOffset() is responsible for getting the right offset depending on the available registers and argument type and size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are cases where we get into this check, for example, JIT\Stress\ABI\pinvokes_d:

CoreCLR!ArgIteratorTemplate<ArgIteratorBase>::GetArgLoc+0x10c
CoreCLR!GenerateShuffleArrayPortable+0x27c
CoreCLR!GenerateShuffleArray+0x14

comes here with byteArgSize = 17 and we change its size to 8 here, if we don't do this we will get a wrong result.

src/coreclr/vm/callingconvention.h Outdated Show resolved Hide resolved
src/coreclr/vm/callingconvention.h Outdated Show resolved Hide resolved
src/coreclr/vm/callingconvention.h Outdated Show resolved Hide resolved
src/coreclr/vm/callingconvention.h Outdated Show resolved Hide resolved
src/coreclr/vm/arm/cgencpu.h Outdated Show resolved Hide resolved
@sandreenko
Copy link
Contributor Author

There is one more place to change, the https://github.com/dotnet/runtime/blob/master/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/DependencyAnalysis/ReadyToRun/ArgIterator.cs. This one is basically a port of the callingconvention.h to managed code.

would do it in a separate PR because it looks like the same amount of work there.

@janvorli
Copy link
Member

would do it in a separate PR because it looks like the same amount of work there.

Makes sense

Copy link
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo the comments.


// Delegates cannot handle overly large argument stacks due to shuffle entry encoding limitations.
if (index >= ShuffleEntry::REGMASK)
{
COMPlusThrow(kNotSupportedException);
}

return (UINT16)index;
return -(int)byteIndex;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should make this change and the one in the caller when we can only handle the case when the offsets are multiples of TARGET_POINTER_SIZE anyways. I believe it would require quite complex change to the shuffle thunk generation to make it work for non-aligned stuff, so it seems that doing it half way doesn't bring any benefit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now exactly half-way and it is required to pass existing tests.
We have cases on arm64 apple when we return byteIndex that is not byteIndex % 8 == 0 but it is the same for source and destination, so we don't need to generate a move and don't hit the NYI assert.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok, makes sense then.

src/coreclr/vm/amd64/cgencpu.h Outdated Show resolved Hide resolved
@sdmaclea
Copy link
Contributor

@sandreenko Looks like this is breaking a pinvoke test on linux-arm64 & windows-arm64

@sandreenko sandreenko force-pushed the arm64AppleVM branch 2 times, most recently from 16c8b29 to 69195a9 Compare January 20, 2021 09:38
pLoc->m_cFloatReg = cSlots;
const int floatRegOfsInBytes = argOffset - TransitionBlock::GetOffsetOfFloatArgumentRegisters();
_ASSERTE((floatRegOfsInBytes % FLOAT_REGISTER_SIZE) == 0);
pLoc->m_idxFloatReg = (argOffset - TransitionBlock::GetOffsetOfFloatArgumentRegisters()) / FLOAT_REGISTER_SIZE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit - can you please use the floatRegOfsInBytes constant here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, thanks for catching

Sergey added 6 commits January 21, 2021 00:07
…atforms.

Before some platforms were using stackSlots, some curOfs in bytes.
Fix arm32.

x86 fixes.

use StackSize on ArgSizes

Add `GetStackArgumentByteIndexFromOffset` and return back the old values for asserts.

another fix
@sandreenko
Copy link
Contributor Author

I have checked that the updated version still works on arm64 apple and fixes JIT/HardwareIntrinsics/General/Vector* tests.

@sandreenko sandreenko merged commit 648437b into dotnet:master Jan 21, 2021
@sandreenko sandreenko deleted the arm64AppleVM branch January 21, 2021 21:37
@ghost ghost locked as resolved and limited conversation to collaborators Feb 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants