Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Environment.Exit hanging in RemoteExecutor #13563

Closed
ViktorHofer opened this issue Oct 7, 2019 · 28 comments · Fixed by dotnet/coreclr#27136
Closed

Environment.Exit hanging in RemoteExecutor #13563

ViktorHofer opened this issue Oct 7, 2019 · 28 comments · Fixed by dotnet/coreclr#27136

Comments

@ViktorHofer
Copy link
Member

https://dev.azure.com/dnceng/public/_build/results?buildId=380511&view=ms.vss-test-web.build-test-results-tab&runId=11763518&paneView=debug&resultId=147963

Configuration: netcoreapp-Windows_NT-Debug-x64-Windows.81.Amd64.Open

System.Net.Tests.AuthenticationManagerTest.CustomTargetNameDictionary_ValidCollection

Timed out at 10/7/2019 8:39:55 PM after 60000ms waiting for remote process.\r\n\tProcess ID: 5404\r\n\tHandle: 1096\r\n\tName: dotnet\r\n\tMainModule: C:\\dotnetbuild\\work\\ddad0c4a-3d2b-409b-b8b2-09e253a66230\\Payload\\dotnet.exe\r\n\tStartTime: 10/7/2019 8:38:55 PM\r\n\tTotalProcessorTime: 00:00:00.1875000\r\n\tThreads:\r\n\t\tThread dotnet/corefx#1 (OS 0x152C) \r\n\t\t\t[InlinedCallFrame] (System.Environment._Exit)\r\n\t\t\t[InlinedCallFrame] (System.Environment._Exit)\r\n\t\t\tSystem.Environment.Exit(Int32)\r\n\t\t\tMicrosoft.DotNet.RemoteExecutor.Program.Main(System.String[])\r\n\t\t\t[GCFrame]\r\n\t\t\t[GCFrame]\r\n\t\tThread dotnet/corefx#2 (OS 0x1538) [Finalizer]\r\n\t\t\t[DebuggerU2MCatchHandlerFrame]\r\n\t\tThread dotnet/corefx#3 (OS 0x1528) [Thread pool worker]\r\n
   at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose(Boolean disposing) in /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs:line 131
   at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose() in /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs:line 55
   at System.Net.Tests.AuthenticationManagerTest.CustomTargetNameDictionary_ValidCollection() in /_/src/System.Net.Requests/tests/AuthenticationManagerTest.cs:line 93

cc @davidsh@wfurt@scalablecory@eiriktsarpalis@karelz
cc @stephentoub for another hang

@davidsh
Copy link
Contributor

davidsh commented Oct 7, 2019

This test doesn't even use any I/O!

My guess is that it is hanging due to bugs(?) with RemoteExecutor or perhaps the CI machine is way too busy for dispatch a new process to run the RemoteExecutor.

@stephentoub
Copy link
Member

My guess is that it is hanging due to bugs(?) with RemoteExecutor or perhaps the CI machine is way too busy for dispatch a new process to run the RemoteExecutor.

I'm fairly sure it's neither. It looks like a hang in Environment.Exit. This has been happening with some frequency in CI over the past week at least.

      		Thread dotnet/corefx#1 (OS 0x152C) 
      			[InlinedCallFrame] (System.Environment._Exit)
      			[InlinedCallFrame] (System.Environment._Exit)
      			System.Environment.Exit(Int32)
      			Microsoft.DotNet.RemoteExecutor.Program.Main(System.String[])
      			[GCFrame]
      			[GCFrame]
      		Thread dotnet/corefx#2 (OS 0x1538) [Finalizer]
      			[DebuggerU2MCatchHandlerFrame]
      		Thread dotnet/corefx#3 (OS 0x1528) [Thread pool worker]

@stephentoub
Copy link
Member

@ViktorHofer, this isn't specific to this particular test.

@ViktorHofer
Copy link
Member Author

Do you suggest to close the issue? Do we have a tracking issue for the Environment.Exit hang?

@stephentoub
Copy link
Member

I don't know if we have an issue for it. Assuming we don't, I suggest just re-appropriating this for that.

@ViktorHofer
Copy link
Member Author

cc @jkotas

@danmoseley
Copy link
Member

@stephentoub from the stack above, it seems you have a dump?

Presumably this would be a bug in the runtime, as that's all Environment.Exit calls.

@stephentoub
Copy link
Member

from the stack above, it seems you have a dump?

No. You can see the same text in the message Viktor pasted in his initial post, just not as nicely formatted. A while back I augmented RemoteInvoke to (on Windows) use clrmd to attach to the child process and walk/print out the child stacks just before RemoteInvoke terminates it; that's what this is showing. A dump is not getting created/uploaded for the child process. According to the infrastructure folks, we in theory should be able to cause a dump by getting the child process to throw an unhandled exception instead of killing it, but doing that will require some work.

Presumably this would be a bug in the runtime, as that's all Environment.Exit calls.

Yes, @jkotas thinks it's most likely an issue with EventPipe shutdown.

@jkotas
Copy link
Member

jkotas commented Oct 8, 2019

Or we can call MiniDumpWriteDump to create the dump. Does RemoteExecutor have a way to get the right path to save the dump to, so that it gets included with the test results?

@ViktorHofer
Copy link
Member Author

ViktorHofer commented Oct 9, 2019

@MattGal how are uploaded dumps discovered today? (see @jkotas's question above)

@stephentoub
Copy link
Member

Or we can call MiniDumpWriteDump to create the dump

That's a good idea.

If @MattGal can tell us where we would need to save the dump for it to be discovered, I can make the relevant change.

@MattGal
Copy link
Member

MattGal commented Oct 9, 2019

@stephentoub simply write anything you want to be archived in the results container to $HELIX_WORKITEM_UPLOAD_ROOT / %HELIX_WORKITEM_UPLOAD_ROOT% (pick appropriate env var). As long as the work item manages to finish, all these files can be later accessed via the Helix API's workitem "files" call.

Do note we have a bug right now which I've merged but not rolled out a fix for that if you make SO many files in that directory that it takes > { the entire work item timeout + 10 minutes} to upload that things will go very, very badly (this is what happened recently on OSX 10.13). Less of a problem on Azure VMs though.

@stephentoub
Copy link
Member

Thanks, @MattGal. It works.

@stephentoub stephentoub changed the title System.Net.Tests.AuthenticationManagerTest.CustomTargetNameDictionary_ValidCollection failed in CI Environment.Exit hanging in RemoteExecutor Oct 10, 2019
@stephentoub
Copy link
Member

stephentoub commented Oct 10, 2019

@jkotas, didn't take long, we got a dump here:
https://helix.dot.net/api/2019-06-17/jobs/58ed3af6-a888-4c33-ad66-2a0d6ed754a3/workitems/System.Runtime.Tests/files/3492.2rf1xaxh.qwv.dmp
from dotnet/corefx#41712 in @bartonjs' PR.
image

0:000> kP
 # Child-SP          RetAddr           Call Site
00 00000021`54b1e108 00007ffc`7ae70391 ntdll!NtCancelSynchronousIoFile+0xa
01 00000021`54b1e110 00007ffc`5a183ebe KERNELBASE!CancelSynchronousIo+0x11
02 00000021`54b1e150 00007ffc`5a12fd51 coreclr!DiagnosticServer::Shutdown(void)+0x6a [f:\workspace\_work\1\s\src\vm\diagnosticserver.cpp @ 223] 
03 00000021`54b1e1f0 00007ffc`5a12fc53 coreclr!EEShutDownHelper(
			int fIsDllUnloading = 0n0)+0xc5 [f:\workspace\_work\1\s\src\vm\ceemain.cpp @ 1339] 
04 00000021`54b1e410 00007ffc`5a2b2f23 coreclr!EEShutDown(
			int fIsDllUnloading = 0n0)+0x73 [f:\workspace\_work\1\s\src\vm\ceemain.cpp @ 1821] 
05 00000021`54b1e460 00007ffc`5a3cce8f coreclr!HandleExitProcessHelper(
			__MIDL___MIDL_itf_mscoree_0000_0000_0009 action = <Value unavailable error>, 
			unsigned int exitCode = 0, 
			ShutdownCompleteAction sca = SCA_ExitProcessWhenShutdownComplete (0n0))+0x47 [f:\workspace\_work\1\s\src\vm\eepolicy.cpp @ 567] 
06 00000021`54b1e490 00007ffb`fa674f7f coreclr!SystemNative::Exit(
			int exitcode = 0n42)+0x3f [f:\workspace\_work\1\s\src\classlibnative\bcltype\system.cpp @ 218] 
07 00000021`54b1e4d0 00000000`0000002a 0x00007ffb`fa674f7f
08 00000021`54b1e4d8 00000021`54b1e500 0x2a
09 00000021`54b1e4e0 000095b7`4d660449 0x00000021`54b1e500
0a 00000021`54b1e4e8 00000021`54b1e500 0x000095b7`4d660449
0b 00000021`54b1e4f0 00000000`00000000 0x00000021`54b1e500

https://github.com/dotnet/coreclr/blob/6b13065e7ac88671795a01577e017ff7cdc26f35/src/vm/diagnosticserver.cpp#L213

@jkotas
Copy link
Member

jkotas commented Oct 10, 2019

Yes, this is EventPipe shutdown as expected.

@stephentoub stephentoub transferred this issue from dotnet/corefx Oct 10, 2019
@jkotas
Copy link
Member

jkotas commented Oct 10, 2019

@dotnet/dotnet-diag Could you please take a look?

This hang in EventPipe is one of the top causes of intermittent test failures and affects CI stability.

@tommcdon
Copy link
Member

@josalem It looks like CancelSynchronousIo inside of coreclr!DiagnosticServer::Shutdown is causing intermittent hangs on shutdown

@jkotas
Copy link
Member

jkotas commented Oct 10, 2019

Why do we need to call CancelSynchronousIo at all? What is not going to work if that call is deleted?

@josalem
Copy link
Contributor

josalem commented Oct 10, 2019

It looks like it was introduced in dotnet/coreclr#25602 and modified in dotnet/coreclr#25786. If I recall, there was a small slowdown on the shutdown path from breaking out of the Accept call on the named pipe, so this was added to explicitly cancel the Accept on the server thread. I took a quick look at the docs for that Windows API, and it sounds like it shouldn't be a blocking call, so I'm not sure off the top of my head what could be going on here.

@jkotas
Copy link
Member

jkotas commented Oct 10, 2019

Trying to do cleanup on process shutdown is waste of time and a bug farm. You should only do the minimal cleanup during shutdown to cleanup machine-wide state. The process local state will be taken care of by the OS.

@sywhang
Copy link
Contributor

sywhang commented Oct 10, 2019

I think we can safely remove this call, but that will regress https://github.com/dotnet/coreclr/issues/25463 - which was tracking a shutdown delay of 20-40ms showing up in PSCore's performance tests. A potential solution is to remove the call to CancelSynchronousIo and add an environment variable to disable diagnostics server from getting created in the first place for cases like the PSCore tests. Does that sound ok @jkotas?

/cc @sergiy-k who reported #12991 first.

@jkotas
Copy link
Member

jkotas commented Oct 10, 2019

I do not understand why deleting the CancelSynchronousIo would introduce shutdown performance regression. If we delete this call, where do you expect the shutdown to be stuck for 20-40ms?

@sywhang
Copy link
Contributor

sywhang commented Oct 10, 2019

If I recall correctly, I think what Jose said when he fixed the original issue by adding this call was that calling ::CloseHandle() on the diagnostics server thread was taking longer on some special hardware with hyper-v disabled because it was waiting for pending I/Os to complete. I didn't investigate the issue myself so I don't have the full context here, but I think we can remove it and come up with a better workaround for that particular case.

@jkotas
Copy link
Member

jkotas commented Oct 10, 2019

::CloseHandle() on the diagnostics server thread

You do not need to call CloseHandle on the diagnostic server thread during shutdown. You can leave the handle open and leave it to the OS to take care of it.

Since the server thread handle is not really needed for anything, you may consider closing it immediately after CreateThread returns it and not keeping it around at all - but that is orthogonal to the shutdown fix.

@jkotas
Copy link
Member

jkotas commented Oct 11, 2019

Reopening for 3.1 port.

@jkotas jkotas reopened this Oct 11, 2019
@sywhang
Copy link
Contributor

sywhang commented Oct 14, 2019

@jkotas I was preparing a 3.1 port and realized that this actually doesn't apply for 3.1 because we never ported the "fix" to #12991 to 3.0 so the relevant code doesn't exist in the 3.1 branch.

@jkotas
Copy link
Member

jkotas commented Oct 14, 2019

ok

@jkotas jkotas closed this as completed Oct 14, 2019
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 3.1 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants