Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Migrate coreclr's worker thread pool to be able to use the portable t…
…hread pool in opt-in fashion (#38225) - Enables using the portable thread pool with coreclr as opt-in. Change is off by default for now, and can be enabled with COMPlus_ThreadPool_UsePortableThreadPool=1. Once it's had bake time and seen to be stable, at a reasonable time in the future the config flag would ideally be removed and the relevant parts of native implementation deleted. - The IO thread pool is not being migrated in this change, and remains on the native side - My goal was to get compatible behavior, compatible with diagnostics tools, and similar perf to the native implementation in coreclr. Tried to avoid changing scheduling behavior, behavior of heuristics, etc., compared with that implementation. - The eventual goal is to have one mostly managed thread pool implementation that can be shared between runtimes, to ease maintenance going forward Commit descriptions: - "Add dependencies" - Ported LowLevelLock from CoreRT, and moved LowLevelSpinWaiter to shared. Since we support Thread.Interrupt(), they were necessary in the wait subsystem in CoreRT partly to support that, and were also used in the portable thread pool implementation where a pending thread interrupt on a thread pool thread would otherwise crash the process. Interruptible waits are already used in the managed side of the thread pool in the queue implementations. It may be reasonable to ignore the thread interrupt problem and suggest that it not be used on thread pool threads, but for now I just brought in the dependencies to keep behavior consistent with the native implementation. - "Add config var" - Added config var COMPlus_ThreadPool_UsePortableThreadPool (disabled by default for now) - Flowed the new config var to the managed side and set up a mechanism to flow all of the thread pool config vars - Removed debug-only config var COMPlus_ThreadpoolTickCountAdjustment, which didn't seem to be too useful - Specialized native and managed thread pool paths based on the config var. Added assertions to paths that should not be reached depending on the config var. - "Move portable RegisteredWaitHandle implementation to shared ThreadPool.cs" - Just moved the portable implementation, no functional changes. In preparation for merging the two implementations. - "Merge RegisteredWaitHandle implementations" - Merged implementations of RegisteredWaitHandle using the portable version as the primary and specializing small parts of it for coreclr - Fixed PortableThreadPool's registered waits to track SafeWaitHandles instead of WaitHandles similarly to the native implementation. The SafeWaitHandle in a WaitHandle can be modified, so it is retrieved once and reused thereafter. Also added/removed refs for the SafeWaitHandles that are registered. - "Separate portable-only portion of RegisteredWaitHandle" - Separated RegisteredWaitHandle.UnregisterPortable into a different file, no functional changes. Those paths reference PortableThreadPool, which is conditionally included unlike ThreadPool.cs. Just for consistency such that the new file can be conditionally included similarly to PortableThreadPool. - "Fix timers, tiered compilation, introduced time-sensitive work item queue to simulate coreclr behavior" - Wired work items queued from the native side (appdomain timer callback, tiered compilation background work callback) to queue them into the managed side - The timer thread calls into managed code to queue the callback - Some tiered compilation work item queuing paths cannot call managed code, so used a timer with zero due time instead - Added a queue of "time-sensitive" work items to the managed side to mimic how work items queued from the native side ran previously. In particular, if the global queue is backed up, when using the native thread pool the native work items still run ahead of them periodically (based on the Dispatch quantum). Potentially they could be queued into the global queue but if it's backed up it can potentially significantly and perhaps artificially delay the appdomain timer callback and the tiering background jitting. I didn't want to change the behavior in an observable (and potentially bad) way here for now, a good time to revisit this would be when IO completion handling is added to the portable thread pool, then the native work items could be handled somewhat similarly. - "Implement ResetThreadPoolThread, set thread names for diagnostics" - Aside from that, setting the thread names (at OS level) allows debuggers to identify the threads better as before. For threads that may run user code, the thread Name property is kept as null as before, such that it may be set without exception. - "Cache-line-separate PortableThreadPool._numRequestedWorkers similarly to coreclr" - Was missed before, separated it for consistency - "Post wait completions to the IO completion port on Windows for coreclr, similarly to before" - On Windows, wait completions are queued to the IO thread pool, which is still implemented on the native side. On Unixes, they are queued to the global queue. - "Reroute managed gate thread into unmanaged side to perform gate activites, don't use unmanaged gate thread" - When the config var is enabled, removed the gate thread from the native side. Instead, the gate thread on the managed side calls into the native side to perform gate activities for the IO completion thread pool, and returns a value to indicate whether the gate thread is still necessary. - Also added a native-to-managed entry point to request the gate thread to run for the IO completion thread pool - "Flow config values from CoreCLR to the portable thread pool for compat" - Flowed the rest of the thread pool config vars to the managed side, such that COMPlus variables continue to work with the portable thread pool - Config var values are stored in AppContext, made the names consistent for supported and unsupported values - "Port - ..." * 3 - Ported a few fixes that did not make it into the portable thread pool implementation - "Fix ETW events" - Fixed the EventSource used by the portable thread pool, added missing events - For now, the event source uses the same name and GUID as the native side. It seems to work for now for ETW, we may switch to a separate provider (along with updating tools) before enabling the portable thread pool by default. - For enqueue/dequeue events, changed to use the object's hash code as the work item identifier instead of the pointer since the pointer may change between enqueue and dequeue - "Fix perf of counts structs" - Structs used for multiple counts with interlocked operations were implemented with explicit struct layout and field offsets. The JIT seems to generate stack-based code for such structs and it was showing up as higher overhead in perf profiles compared to the equivalent native implementation. Slower code in compare-exchange loops can cause a larger gap of time between the read and the compare-exchange, which can also cause higher contention. - Changed the structs to use manual bit manipulation instead, and microoptimized some paths. The code is still not as good as that generated by C++, but it seems to perform similarly based on perf profiles. - Code size also improved in many cases, for example one of the larger differences was in MaybeAddWorkingWorker(), which decreased from 585 bytes to 382 bytes and with far fewer stack memory operations - "Fix perf of dispatch loop" - Just some minor tweaks as I was looking at perf profiles and code of Dispatch() - "Fix perf of ThreadInt64PersistentCounter" - The implementation used to count completed work items was using `ThreadLocal<T>`, which turned out to be too slow for that purpose according to perf profiles - Changed it to rely on the user of the component to provide an object that tracks the count, which the user of the component would obtain from a ThreadStatic field - Also removed the thread-local lookup per iteration in one of the hot paths in Dispatch() and improved inlining - "Miscellaneous perf fixes" - A few more small tweaks as I was looking at perf profiles and code - In ConcurrentQueue, added check for empty into the fast path - For the portable thread pool, updated to trigger the worker thread Wait event after the short spin-wait completes and before actually waiting, the event is otherwise too verbose when profiling and changes performance characteristics - Cache-line-separated the gate thread running state as is done in the native implementation - Accessing PortableThreadPool.ThreadPoolInstance multiple times was generating less than ideal code that was noticeable in perf profiles. Tried to avoid it especially in hot paths, and in some cases where unnecessary for consistency if nothing else. - Removed an extra call to Environment.TickCount in Dispatch() per iteration - Noticed that a field that was intended to be cache-line-separated was not actually being separated, see #38215, fixed - "Fix starvation heuristic" - Described in comment - "Implement worker tracking" - Implemented the equivalent in the portable thread pool along with raising the relevant event - "Use smaller stack size for threads that don't run user code" - Using the same stack size as in the native side for those threads - "Note some SOS dependencies, small fixes in hill climbing to make equivalent to coreclr" - Corresponds with PR that updates SOS: dotnet/diagnostics#1274 - Also fixed a couple of things to work similarly to the native implementation - "Port some tests from CoreRT" - Also improved some of the tests - "Fail-fast in thread pool native entry points specific to thread pool implementations based on config" - Scanned all of the managed-to-native entry points from the thread pool and thread-entry functions, and promoted some assertions to be verified in all builds with fail-fast. May help to know in release builds when a path that should not be taken is taken and to avoid running further along that path. - "Fix SetMinThreads() and SetMaxThreads() to return true only when both changes are successful with synchronization" - These are a bit awkward when the portable thread pool is enabled, because they should return true only when both changes are valid and return false without making any changes otherwise, and since the worker thread pool is on the managed side and IO thread pool is on the native side - Added some managed-to-native entry points to allow checking validity before making the changes, all under a lock taken by the managed side - "Fix registered wait removals for fairness since there can be duplicate system wait objects in the wait array" - Described in comment - "Allow multiple DotNETRuntime event providers/sources in EventPipe" - Temporary change to EventPipe to be able to get events from dotnet-trace - For now, the event source uses the same name and GUID as the native side. It seems to work for now for ETW, and with this change it seems to work with EventPipe for getting events. Subscribing to the NativeRuntimeEventSource does not get thread pool events yet, that is left for later. We may switch to a separate provider (along with updating tools) before enabling the portable thread pool by default, as a long-term solution. - "Fix registered wait handle timeout logic in the wait thread" - The timeout logic was comparing against how long the last wait took and was not timing out waits sometimes, fixed to consider the total time since the last reset of timeout instead - "Fix Browser build" - Updated the Browser-specific thread pool variant based on the other changes Corresponding PR to update SOS: dotnet/diagnostics#1274 Fixes #32020
- Loading branch information