Skip to content

Commit

Permalink
Reimplement stubs to improve performance (dotnet#65738)
Browse files Browse the repository at this point in the history
* Reimplement stubs to improve performance

This change implements `FixupPrecodeStub`, `PrecodeStub`
and `CallCountingStub` using a new mechanism with fixed code and separate RW
data. The `LoaderHeap` was updated to support a new kind of allocation
using interleaved code and data pages to support this new mechanism.
The JIT now generates code that uses indirection slot to jump to the
methods using `FixupPrecode`, improving performance of the ASPNet
plaintext benchmark by 3-4% depending on the target platform (measured
on x64 Windows / Linux and arm64 Linux).

I have also removed the Holders, as the stubs are naturally properly
aligned due to the way they are allocated.

There is now only a single variant of each stub, there are no long /
short ones anymore as they are not needed - the indirect jumps we use
now are not range limited.

Most of the stubs stuff is now target agnostic and the originally split
implementation is now in single place for all targets. Only a few
constants are defined as target specific in these.

The code for the stubs is no longer generated as bytes by C++ code, but
rather written in asm and compiled. These precompiled templates are then
used as a source to copy the code from. The x86 is a bit more complex
than that due to the fact that it doesn't support PC relative indirect
addressing, so we need to relocate all access to the data slots when
generating the code pages.

As a further improvement, we could generate just a single page of the
code and then just map it many times. This is left for future work.

ARM64 Unix differs from the other targets / platforms - there are
various page sizes being used. So the asm templates are generated for
4k..64k page sizes and the variant is then picked at runtime based on
the page size extracted from the OS.

This also removes a lot of writeable mappings created for modifications
of the stub code when W^X is enabled, in the plaintext benchmark they
were reduced by 75%. That results in a significant reducing of the .NET
application startup time with W^X enabled.

I think the `LoaderHeap` would benefit from some refactoring, but I'd
prefer leaving it for a follow up. It seems that for the sake of the
review, it is better to keep it as is.

The change also implements logging of number of mappings and their exact
locations. This helped me to drive the work and I am planning to use it
for further changes. It can be removed in the future once we reach a
final state.

There are still opportunities for improvement, but these stubs allowed
me to scrape off the most significant portion of the mappings.
  • Loading branch information
janvorli authored and radekdoulik committed Mar 30, 2022
1 parent 9293446 commit e93b158
Show file tree
Hide file tree
Showing 69 changed files with 2,063 additions and 2,851 deletions.
1 change: 0 additions & 1 deletion src/coreclr/inc/daccess.h
Original file line number Diff line number Diff line change
Expand Up @@ -614,7 +614,6 @@ typedef struct _DacGlobals
#endif // TARGET_ARM

ULONG fn__ThePreStubPatchLabel;
ULONG fn__PrecodeFixupThunk;
#ifdef FEATURE_COMINTEROP
ULONG fn__Unknown_AddRef;
ULONG fn__Unknown_AddRefSpecial;
Expand Down
54 changes: 53 additions & 1 deletion src/coreclr/inc/executableallocator.h
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@

#ifndef DACCESS_COMPILE

//#define LOG_EXECUTABLE_ALLOCATOR_STATISTICS

// This class is responsible for allocation of all the executable memory in the runtime.
class ExecutableAllocator
{
Expand Down Expand Up @@ -49,7 +51,17 @@ class ExecutableAllocator
};

typedef void (*FatalErrorHandler)(UINT errorCode, LPCWSTR pszMessage);

#ifdef LOG_EXECUTABLE_ALLOCATOR_STATISTICS
static int64_t g_mapTimeSum;
static int64_t g_mapTimeWithLockSum;
static int64_t g_unmapTimeSum;
static int64_t g_unmapTimeWithLockSum;
static int64_t g_mapFindRXTimeSum;
static int64_t g_mapCreateTimeSum;

static int64_t g_releaseCount;
static int64_t g_reserveCount;
#endif
// Instance of the allocator
static ExecutableAllocator* g_instance;

Expand Down Expand Up @@ -142,8 +154,28 @@ class ExecutableAllocator
// Initialize the allocator instance
bool Initialize();

#ifdef LOG_EXECUTABLE_ALLOCATOR_STATISTICS
static CRITSEC_COOKIE s_LoggerCriticalSection;

struct LogEntry
{
const char* source;
const char* function;
int line;
int count;
};

static LogEntry s_usageLog[256];
static int s_logMaxIndex;
#endif

public:

#ifdef LOG_EXECUTABLE_ALLOCATOR_STATISTICS
static void LogUsage(const char* source, int line, const char* function);
static void DumpHolderUsage();
#endif

// Return the ExecuteAllocator singleton instance
static ExecutableAllocator* Instance();

Expand Down Expand Up @@ -201,6 +233,8 @@ class ExecutableAllocator
void UnmapRW(void* pRW);
};

#define ExecutableWriterHolder ExecutableWriterHolderNoLog

// Holder class to map read-execute memory as read-write so that it can be modified without using read-write-execute mapping.
// At the moment the implementation is dummy, returning the same addresses for both cases and expecting them to be read-write-execute.
// The class uses the move semantics to ensure proper unmapping in case of re-assigning of the holder value.
Expand Down Expand Up @@ -274,6 +308,24 @@ class ExecutableWriterHolder
{
return m_addressRW;
}

void AssignExecutableWriterHolder(T* addressRX, size_t size)
{
*this = ExecutableWriterHolder(addressRX, size);
}
};

#ifdef LOG_EXECUTABLE_ALLOCATOR_STATISTICS
#undef ExecutableWriterHolder
#ifdef TARGET_UNIX
#define ExecutableWriterHolder ExecutableAllocator::LogUsage(__FILE__, __LINE__, __PRETTY_FUNCTION__); ExecutableWriterHolderNoLog
#define AssignExecutableWriterHolder(addressRX, size) AssignExecutableWriterHolder(addressRX, size); ExecutableAllocator::LogUsage(__FILE__, __LINE__, __PRETTY_FUNCTION__);
#else
#define ExecutableWriterHolder ExecutableAllocator::LogUsage(__FILE__, __LINE__, __FUNCTION__); ExecutableWriterHolderNoLog
#define AssignExecutableWriterHolder(addressRX, size) AssignExecutableWriterHolder(addressRX, size); ExecutableAllocator::LogUsage(__FILE__, __LINE__, __FUNCTION__);
#endif
#else
#define ExecutableWriterHolder ExecutableWriterHolderNoLog
#endif

#endif // !DACCESS_COMPILE
16 changes: 13 additions & 3 deletions src/coreclr/inc/holder.h
Original file line number Diff line number Diff line change
Expand Up @@ -934,15 +934,25 @@ using NonVMComHolder = SpecializedWrapper<_TYPE, DoTheRelease<_TYPE>>;
// } // foo->DecRef() on out of scope
//
//-----------------------------------------------------------------------------

template<typename _TYPE>
class ExecutableWriterHolder;
class ExecutableWriterHolderNoLog;

template <typename TYPE>
class ExecutableAllocator;

template <typename TYPE, typename LOGGER=ExecutableAllocator>
FORCEINLINE void StubRelease(TYPE* value)
{
if (value)
{
ExecutableWriterHolder<TYPE> stubWriterHolder(value, sizeof(TYPE));
#ifdef LOG_EXECUTABLE_ALLOCATOR_STATISTICS
#ifdef TARGET_UNIX
LOGGER::LogUsage(__FILE__, __LINE__, __PRETTY_FUNCTION__);
#else
LOGGER::LogUsage(__FILE__, __LINE__, __FUNCTION__);
#endif
#endif // LOG_EXECUTABLE_ALLOCATOR_STATISTICS
ExecutableWriterHolderNoLog<TYPE> stubWriterHolder(value, sizeof(TYPE));
stubWriterHolder.GetRW()->DecRef();
}
}
Expand Down
42 changes: 33 additions & 9 deletions src/coreclr/inc/loaderheap.h
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,15 @@ class UnlockedLoaderHeap
friend class ClrDataAccess;
#endif

public:

enum class HeapKind
{
Data,
Executable,
Interleaved
};

private:
// Linked list of ClrVirtualAlloc'd pages
PTR_LoaderHeapBlock m_pFirstBlock;
Expand All @@ -208,12 +217,16 @@ class UnlockedLoaderHeap
// When we need to commit pages from our reserved list, number of bytes to commit at a time
DWORD m_dwCommitBlockSize;

// For interleaved heap (RX pages interleaved with RW ones), this specifies the allocation granularity,
// which is the individual code block size
DWORD m_dwGranularity;

// Range list to record memory ranges in
RangeList * m_pRangeList;

size_t m_dwTotalAlloc;

DWORD m_Options;
HeapKind m_kind;

LoaderHeapFreeBlock *m_pFirstFreeBlock;

Expand Down Expand Up @@ -263,6 +276,7 @@ class UnlockedLoaderHeap

public:
BOOL m_fExplicitControl; // Am I a LoaderHeap or an ExplicitControlLoaderHeap?
void (*m_codePageGenerator)(BYTE* pageBase, BYTE* pageBaseRX);

#ifdef DACCESS_COMPILE
public:
Expand All @@ -283,7 +297,9 @@ class UnlockedLoaderHeap
const BYTE* dwReservedRegionAddress,
SIZE_T dwReservedRegionSize,
RangeList *pRangeList = NULL,
BOOL fMakeExecutable = FALSE);
HeapKind kind = HeapKind::Data,
void (*codePageGenerator)(BYTE* pageBase, BYTE* pageBaseRX) = NULL,
DWORD dwGranularity = 1);

~UnlockedLoaderHeap();
#endif
Expand Down Expand Up @@ -400,6 +416,7 @@ class UnlockedLoaderHeap
}

BOOL IsExecutable();
BOOL IsInterleaved();

public:
#ifdef _DEBUG
Expand Down Expand Up @@ -443,14 +460,18 @@ class LoaderHeap : public UnlockedLoaderHeap, public ILoaderHeapBackout
LoaderHeap(DWORD dwReserveBlockSize,
DWORD dwCommitBlockSize,
RangeList *pRangeList = NULL,
BOOL fMakeExecutable = FALSE,
BOOL fUnlocked = FALSE
UnlockedLoaderHeap::HeapKind kind = UnlockedLoaderHeap::HeapKind::Data,
BOOL fUnlocked = FALSE,
void (*codePageGenerator)(BYTE* pageBase, BYTE* pageBaseRX) = NULL,
DWORD dwGranularity = 1
)
: UnlockedLoaderHeap(dwReserveBlockSize,
dwCommitBlockSize,
NULL, 0,
pRangeList,
fMakeExecutable),
kind,
codePageGenerator,
dwGranularity),
m_CriticalSection(fUnlocked ? NULL : CreateLoaderHeapLock())
{
WRAPPER_NO_CONTRACT;
Expand All @@ -463,15 +484,18 @@ class LoaderHeap : public UnlockedLoaderHeap, public ILoaderHeapBackout
const BYTE* dwReservedRegionAddress,
SIZE_T dwReservedRegionSize,
RangeList *pRangeList = NULL,
BOOL fMakeExecutable = FALSE,
BOOL fUnlocked = FALSE
UnlockedLoaderHeap::HeapKind kind = UnlockedLoaderHeap::HeapKind::Data,
BOOL fUnlocked = FALSE,
void (*codePageGenerator)(BYTE* pageBase, BYTE* pageBaseRX) = NULL,
DWORD dwGranularity = 1
)
: UnlockedLoaderHeap(dwReserveBlockSize,
dwCommitBlockSize,
dwReservedRegionAddress,
dwReservedRegionSize,
pRangeList,
fMakeExecutable),
kind,
codePageGenerator, dwGranularity),
m_CriticalSection(fUnlocked ? NULL : CreateLoaderHeapLock())
{
WRAPPER_NO_CONTRACT;
Expand Down Expand Up @@ -776,7 +800,7 @@ class ExplicitControlLoaderHeap : public UnlockedLoaderHeap
)
: UnlockedLoaderHeap(0, 0, NULL, 0,
pRangeList,
fMakeExecutable)
fMakeExecutable ? UnlockedLoaderHeap::HeapKind::Executable : UnlockedLoaderHeap::HeapKind::Data)
{
WRAPPER_NO_CONTRACT;
m_fExplicitControl = TRUE;
Expand Down
4 changes: 2 additions & 2 deletions src/coreclr/minipal/Windows/doublemapping.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -184,8 +184,8 @@ void *VMToOSInterface::CommitDoubleMappedMemory(void* pStart, size_t size, bool

bool VMToOSInterface::ReleaseDoubleMappedMemory(void *mapperHandle, void* pStart, size_t offset, size_t size)
{
// Zero the memory before the unmapping
VirtualAlloc(pStart, size, MEM_COMMIT, PAGE_READWRITE);
LPVOID result = VirtualAlloc(pStart, size, MEM_COMMIT, PAGE_READWRITE);
assert(result != NULL);
memset(pStart, 0, size);
return UnmapViewOfFile(pStart);
}
Expand Down
2 changes: 1 addition & 1 deletion src/coreclr/pal/inc/unixasmmacrosamd64.inc
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@

.macro PATCH_LABEL Name
.global C_FUNC(\Name)
C_FUNC(\Name):
C_FUNC(\Name) = .
.endm

.macro LEAF_ENTRY Name, Section
Expand Down
Loading

0 comments on commit e93b158

Please sign in to comment.