-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Environment.Exit hanging in RemoteExecutor #13563
Comments
This test doesn't even use any I/O! My guess is that it is hanging due to bugs(?) with RemoteExecutor or perhaps the CI machine is way too busy for dispatch a new process to run the RemoteExecutor. |
I'm fairly sure it's neither. It looks like a hang in Environment.Exit. This has been happening with some frequency in CI over the past week at least.
|
@ViktorHofer, this isn't specific to this particular test. |
Do you suggest to close the issue? Do we have a tracking issue for the Environment.Exit hang? |
I don't know if we have an issue for it. Assuming we don't, I suggest just re-appropriating this for that. |
cc @jkotas |
@stephentoub from the stack above, it seems you have a dump? Presumably this would be a bug in the runtime, as that's all Environment.Exit calls. |
No. You can see the same text in the message Viktor pasted in his initial post, just not as nicely formatted. A while back I augmented RemoteInvoke to (on Windows) use clrmd to attach to the child process and walk/print out the child stacks just before RemoteInvoke terminates it; that's what this is showing. A dump is not getting created/uploaded for the child process. According to the infrastructure folks, we in theory should be able to cause a dump by getting the child process to throw an unhandled exception instead of killing it, but doing that will require some work.
Yes, @jkotas thinks it's most likely an issue with EventPipe shutdown. |
Or we can call MiniDumpWriteDump to create the dump. Does RemoteExecutor have a way to get the right path to save the dump to, so that it gets included with the test results? |
That's a good idea. If @MattGal can tell us where we would need to save the dump for it to be discovered, I can make the relevant change. |
@stephentoub simply write anything you want to be archived in the results container to $HELIX_WORKITEM_UPLOAD_ROOT / %HELIX_WORKITEM_UPLOAD_ROOT% (pick appropriate env var). As long as the work item manages to finish, all these files can be later accessed via the Helix API's workitem "files" call. Do note we have a bug right now which I've merged but not rolled out a fix for that if you make SO many files in that directory that it takes > { the entire work item timeout + 10 minutes} to upload that things will go very, very badly (this is what happened recently on OSX 10.13). Less of a problem on Azure VMs though. |
Thanks, @MattGal. It works. |
@jkotas, didn't take long, we got a dump here:
|
Yes, this is EventPipe shutdown as expected. |
@dotnet/dotnet-diag Could you please take a look? This hang in EventPipe is one of the top causes of intermittent test failures and affects CI stability. |
@josalem It looks like CancelSynchronousIo inside of coreclr!DiagnosticServer::Shutdown is causing intermittent hangs on shutdown |
Why do we need to call |
It looks like it was introduced in dotnet/coreclr#25602 and modified in dotnet/coreclr#25786. If I recall, there was a small slowdown on the shutdown path from breaking out of the |
Trying to do cleanup on process shutdown is waste of time and a bug farm. You should only do the minimal cleanup during shutdown to cleanup machine-wide state. The process local state will be taken care of by the OS. |
I think we can safely remove this call, but that will regress https://github.com/dotnet/coreclr/issues/25463 - which was tracking a shutdown delay of 20-40ms showing up in PSCore's performance tests. A potential solution is to remove the call to |
I do not understand why deleting the |
If I recall correctly, I think what Jose said when he fixed the original issue by adding this call was that calling ::CloseHandle() on the diagnostics server thread was taking longer on some special hardware with hyper-v disabled because it was waiting for pending I/Os to complete. I didn't investigate the issue myself so I don't have the full context here, but I think we can remove it and come up with a better workaround for that particular case. |
You do not need to call CloseHandle on the diagnostic server thread during shutdown. You can leave the handle open and leave it to the OS to take care of it. Since the server thread handle is not really needed for anything, you may consider closing it immediately after CreateThread returns it and not keeping it around at all - but that is orthogonal to the shutdown fix. |
Reopening for 3.1 port. |
ok |
https://dev.azure.com/dnceng/public/_build/results?buildId=380511&view=ms.vss-test-web.build-test-results-tab&runId=11763518&paneView=debug&resultId=147963
Configuration:
netcoreapp-Windows_NT-Debug-x64-Windows.81.Amd64.Open
cc @davidsh, @wfurt, @scalablecory, @eiriktsarpalis, @karelz
cc @stephentoub for another hang
The text was updated successfully, but these errors were encountered: