Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hostfxr] [Windows] High CPU usage on background thread after invoking hostfxr_get_runtime_delegate on non-main thread. #39176

Closed
Sewer56 opened this issue Jul 13, 2020 · 7 comments · Fixed by #43711
Labels
Milestone

Comments

@Sewer56
Copy link
Contributor

Sewer56 commented Jul 13, 2020

Description

In .NET 5, invoking hostfxr_get_runtime_delegate from a non-main thread appears to leave a background task running, consuming a full CPU thread.

Specifically, the issue can be reproduced in the following fashion on Windows:

int main(int argc, char* argv[])
{
    // load_runtime sets up the runtime, calling `hostfxr_get_runtime_delegate` down the road
    auto handle = CreateThread(nullptr, 0, &load_runtime, 0, 0, nullptr);
    Sleep(INFINITE); // Stall the program to observe the effect.
}

This does not occur if the runtime is initialised on the main thread.

Reproduction

Minimal reproduction is available at Reloaded.Core.Bootstrap/runtimebug.
Simply execute Reloaded.Core.Bootstrap.Example and observe CPU utilisation once the program begins to sleep.

The offending function call that triggers the issue can be found at Reloaded.Core.Bootstrap/CoreCLR.cpp#L71.

(Requires Visual Studio 16.7 preview or newer, with Desktop development with C++ workload as well as the latest .NET 5 SDK).

Reproduction Debugging

To trigger the correct, intended behaviour (no loop/bug) simply load the runtime on main thread. i.e. replace

initializeThreadHandle = CreateThread(nullptr, 0, &load_bootstrapper_async, 0, 0, nullptr);
WaitForSingleObject(initializeThreadHandle, INFINITE);

with

load_bootstrapper_async(0);

in the example's entry point.
I have set it up this way to make control flow comparisons easier between intended and unintended behaviour should that be helpful.

Configuration

Runtime: 5.0.0-preview.6
OS: Win 10 2004 (Build 19041.329)
Architecture: x86, x64

Regression?

This is a regression from .NET Core 3.X.
(i.e. Code runs as expected all the way from 3.0 Preview 6 when it became first available to the latest 3.1 release).

I have not yet identified the first .NET 5 Preview/build with the issue intact.

Images

.NET 5:
HLFar4wrQt

netcoreapp3.1:
HxklikQaVN

netcoreapp3.0:
WdN7XsRATY

@Sewer56 Sewer56 added the tenet-performance Performance related issue label Jul 13, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-Host untriaged New issue has not been triaged by the area owner labels Jul 13, 2020
@ghost
Copy link

ghost commented Jul 13, 2020

Tagging subscribers to this area: @vitek-karas, @swaroop-sridhar, @agocke
Notify danmosemsft if you want to be subscribed.

@vitek-karas
Copy link
Member

Thanks a lot for the repro @Sewer56 . I debugged into it a little bit and it seems to be related to the diagnostic server thread. It loops in this callstack:

>	coreclr.dll!IpcStreamFactory::GetNextAvailableStream(void(*)(const char *, unsigned int) callback) Line 166	C++
 	coreclr.dll!DiagnosticServer::DiagnosticsServerThread(void * __formal) Line 50	C++
 	kernel32.dll!BaseThreadInitThunk()	Unknown
 	ntdll.dll!RtlUserThreadStart()	Unknown

It calls the IpcStream::DiagnosticsIpc::Poll which "successfully" waits (seems to be no wait time actually) on the handle, then gets a stream which is always NULL - that causes the caller to loop and try again... and so on.

I don't know what's different running it from managed exe versus hosted via the hosting APIs - obviously a simple console app doesn't repro this.

@ghost
Copy link

ghost commented Jul 13, 2020

Tagging subscribers to this area: @tommcdon
Notify danmosemsft if you want to be subscribed.

@Sewer56
Copy link
Contributor Author

Sewer56 commented Jul 13, 2020

I don't know what's different running it from managed exe versus hosted via the hosting APIs - obviously a simple console app doesn't repro this.

Indeed, would be interesting to know, even if at a high level.
I'd be interested digging into the runtime internals myself someday.
Thanks for reaching out so quickly.

On an unrelated note.
Seems there was something I forgot to mention in the repro; this should be fairly obvious and implicit from the issue description but I should mention just in case.

To trigger the correct, intended behaviour (no loop) simply load the runtime on main thread. i.e. replace

initializeThreadHandle = CreateThread(nullptr, 0, &load_bootstrapper_async, 0, 0, nullptr);
WaitForSingleObject(initializeThreadHandle, INFINITE);

with

load_bootstrapper_async(0);

in the example's entry point. I edited the opening post to reflect this.

Explicitly, I set it up this way in the repro so it's easier to test & compare control flow.

@hoyosjs
Copy link
Member

hoyosjs commented Jul 13, 2020

@dotnet/dotnet-diag @josalem

@josalem
Copy link
Contributor

josalem commented Jul 13, 2020

Interesting! I haven't had a chance to try the repro since I'm on my Mac right now.

@vitek-karas when you run this under the debugger, does a NamedPipe with a name like dotnet-diagnostic-<pid> show up for the application? IpcStreamFactory::GetNextAvailableStream returning nullptr indicates a failure state of the underlying NamedPipe (on Windows). Could you give some more details about what's happening inside the call to GetNextAvailableStream and Poll? Alternatively, could you turn on StressLog with the diagnostics log facility (0x00001000) and post the result? I'll try to repro this tomorrow.

@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Jul 20, 2020
@tommcdon tommcdon added this to the 5.0.0 milestone Jul 20, 2020
@tommcdon tommcdon modified the milestones: 5.0.0, 6.0.0 Jul 30, 2020
@josalem
Copy link
Contributor

josalem commented Oct 20, 2020

I put this under a debugger this afternoon. The server is being put into an error state where the overlapped IO is continually returning ERROR_OPERATION_ABORTED. The description for that error states that

The I/O operation has been aborted because of either a thread exit or an application request.

In other words, the fact that you loaded the runtime on one thread but ended that thread broke the asynchronous IO. The abort is being recognized here at line 327:

case IpcStream::DiagnosticsIpc::PollEvents::SIGNALED:
if (pStream == nullptr) // only use first signaled stream; will get others on subsequent calls
{
pStream = ((DiagnosticPort*)(rgIpcPollHandles[i].pUserData))->GetConnectedStream(callback);
s_currentPort = (DiagnosticPort*)(rgIpcPollHandles[i].pUserData);
}
STRESS_LOG2(LF_DIAGNOSTICS_PORT, LL_INFO10, "IpcStreamFactory::GetNextAvailableStream - SIG :: Poll attempt: %d, connection %d signalled.\n", nPollAttempts, i);

which unfortunately doesn't set fSawError if the GetConnectedStream callback returns nullptr. In that case we never reset the asynchronous call to ConnectNamedPipe, so the pipe always comes back marked as signaled, and then fails in the same way.

This should be fixable by special casing the abort case and/or resetting the connection if the callback returns nullptr.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants