Environment.Exit hanging in RemoteExecutor #13563

ViktorHofer · 2019-10-07T21:33:58Z

https://dev.azure.com/dnceng/public/_build/results?buildId=380511&view=ms.vss-test-web.build-test-results-tab&runId=11763518&paneView=debug&resultId=147963

Configuration: netcoreapp-Windows_NT-Debug-x64-Windows.81.Amd64.Open

System.Net.Tests.AuthenticationManagerTest.CustomTargetNameDictionary_ValidCollection

Timed out at 10/7/2019 8:39:55 PM after 60000ms waiting for remote process.\r\n\tProcess ID: 5404\r\n\tHandle: 1096\r\n\tName: dotnet\r\n\tMainModule: C:\\dotnetbuild\\work\\ddad0c4a-3d2b-409b-b8b2-09e253a66230\\Payload\\dotnet.exe\r\n\tStartTime: 10/7/2019 8:38:55 PM\r\n\tTotalProcessorTime: 00:00:00.1875000\r\n\tThreads:\r\n\t\tThread dotnet/corefx#1 (OS 0x152C) \r\n\t\t\t[InlinedCallFrame] (System.Environment._Exit)\r\n\t\t\t[InlinedCallFrame] (System.Environment._Exit)\r\n\t\t\tSystem.Environment.Exit(Int32)\r\n\t\t\tMicrosoft.DotNet.RemoteExecutor.Program.Main(System.String[])\r\n\t\t\t[GCFrame]\r\n\t\t\t[GCFrame]\r\n\t\tThread dotnet/corefx#2 (OS 0x1538) [Finalizer]\r\n\t\t\t[DebuggerU2MCatchHandlerFrame]\r\n\t\tThread dotnet/corefx#3 (OS 0x1528) [Thread pool worker]\r\n
   at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose(Boolean disposing) in /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs:line 131
   at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose() in /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs:line 55
   at System.Net.Tests.AuthenticationManagerTest.CustomTargetNameDictionary_ValidCollection() in /_/src/System.Net.Requests/tests/AuthenticationManagerTest.cs:line 93

cc @davidsh, @wfurt, @scalablecory, @eiriktsarpalis, @karelz
cc @stephentoub for another hang

The text was updated successfully, but these errors were encountered:

davidsh · 2019-10-07T21:59:29Z

Test is here:
https://github.com/dotnet/corefx/blob/89b087ba5548338bb604db4e1cca0e42f8273e2b/src/System.Net.Requests/tests/AuthenticationManagerTest.cs#L86-L105

davidsh · 2019-10-07T22:00:38Z

This test doesn't even use any I/O!

My guess is that it is hanging due to bugs(?) with RemoteExecutor or perhaps the CI machine is way too busy for dispatch a new process to run the RemoteExecutor.

stephentoub · 2019-10-07T22:04:10Z

My guess is that it is hanging due to bugs(?) with RemoteExecutor or perhaps the CI machine is way too busy for dispatch a new process to run the RemoteExecutor.

I'm fairly sure it's neither. It looks like a hang in Environment.Exit. This has been happening with some frequency in CI over the past week at least.

      		Thread dotnet/corefx#1 (OS 0x152C) 
      			[InlinedCallFrame] (System.Environment._Exit)
      			[InlinedCallFrame] (System.Environment._Exit)
      			System.Environment.Exit(Int32)
      			Microsoft.DotNet.RemoteExecutor.Program.Main(System.String[])
      			[GCFrame]
      			[GCFrame]
      		Thread dotnet/corefx#2 (OS 0x1538) [Finalizer]
      			[DebuggerU2MCatchHandlerFrame]
      		Thread dotnet/corefx#3 (OS 0x1528) [Thread pool worker]

stephentoub · 2019-10-07T22:04:57Z

@ViktorHofer, this isn't specific to this particular test.

ViktorHofer · 2019-10-07T22:07:31Z

Do you suggest to close the issue? Do we have a tracking issue for the Environment.Exit hang?

stephentoub · 2019-10-07T22:14:44Z

I don't know if we have an issue for it. Assuming we don't, I suggest just re-appropriating this for that.

ViktorHofer · 2019-10-07T22:17:46Z

cc @jkotas

danmoseley · 2019-10-08T03:59:51Z

@stephentoub from the stack above, it seems you have a dump?

Presumably this would be a bug in the runtime, as that's all Environment.Exit calls.

stephentoub · 2019-10-08T13:04:34Z

from the stack above, it seems you have a dump?

No. You can see the same text in the message Viktor pasted in his initial post, just not as nicely formatted. A while back I augmented RemoteInvoke to (on Windows) use clrmd to attach to the child process and walk/print out the child stacks just before RemoteInvoke terminates it; that's what this is showing. A dump is not getting created/uploaded for the child process. According to the infrastructure folks, we in theory should be able to cause a dump by getting the child process to throw an unhandled exception instead of killing it, but doing that will require some work.

Presumably this would be a bug in the runtime, as that's all Environment.Exit calls.

Yes, @jkotas thinks it's most likely an issue with EventPipe shutdown.

jkotas · 2019-10-08T22:59:59Z

Or we can call MiniDumpWriteDump to create the dump. Does RemoteExecutor have a way to get the right path to save the dump to, so that it gets included with the test results?

ViktorHofer · 2019-10-09T08:16:36Z

@MattGal how are uploaded dumps discovered today? (see @jkotas's question above)

stephentoub · 2019-10-09T15:16:42Z

Or we can call MiniDumpWriteDump to create the dump

That's a good idea.

If @MattGal can tell us where we would need to save the dump for it to be discovered, I can make the relevant change.

MattGal · 2019-10-09T16:57:12Z

@stephentoub simply write anything you want to be archived in the results container to $HELIX_WORKITEM_UPLOAD_ROOT / %HELIX_WORKITEM_UPLOAD_ROOT% (pick appropriate env var). As long as the work item manages to finish, all these files can be later accessed via the Helix API's workitem "files" call.

Do note we have a bug right now which I've merged but not rolled out a fix for that if you make SO many files in that directory that it takes > { the entire work item timeout + 10 minutes} to upload that things will go very, very badly (this is what happened recently on OSX 10.13). Less of a problem on Azure VMs though.

stephentoub · 2019-10-09T22:05:20Z

Thanks, @MattGal. It works.

stephentoub · 2019-10-10T17:23:54Z

@jkotas, didn't take long, we got a dump here:
https://helix.dot.net/api/2019-06-17/jobs/58ed3af6-a888-4c33-ad66-2a0d6ed754a3/workitems/System.Runtime.Tests/files/3492.2rf1xaxh.qwv.dmp
from dotnet/corefx#41712 in @bartonjs' PR.

0:000> kP
 # Child-SP          RetAddr           Call Site
00 00000021`54b1e108 00007ffc`7ae70391 ntdll!NtCancelSynchronousIoFile+0xa
01 00000021`54b1e110 00007ffc`5a183ebe KERNELBASE!CancelSynchronousIo+0x11
02 00000021`54b1e150 00007ffc`5a12fd51 coreclr!DiagnosticServer::Shutdown(void)+0x6a [f:\workspace\_work\1\s\src\vm\diagnosticserver.cpp @ 223] 
03 00000021`54b1e1f0 00007ffc`5a12fc53 coreclr!EEShutDownHelper(
			int fIsDllUnloading = 0n0)+0xc5 [f:\workspace\_work\1\s\src\vm\ceemain.cpp @ 1339] 
04 00000021`54b1e410 00007ffc`5a2b2f23 coreclr!EEShutDown(
			int fIsDllUnloading = 0n0)+0x73 [f:\workspace\_work\1\s\src\vm\ceemain.cpp @ 1821] 
05 00000021`54b1e460 00007ffc`5a3cce8f coreclr!HandleExitProcessHelper(
			__MIDL___MIDL_itf_mscoree_0000_0000_0009 action = <Value unavailable error>, 
			unsigned int exitCode = 0, 
			ShutdownCompleteAction sca = SCA_ExitProcessWhenShutdownComplete (0n0))+0x47 [f:\workspace\_work\1\s\src\vm\eepolicy.cpp @ 567] 
06 00000021`54b1e490 00007ffb`fa674f7f coreclr!SystemNative::Exit(
			int exitcode = 0n42)+0x3f [f:\workspace\_work\1\s\src\classlibnative\bcltype\system.cpp @ 218] 
07 00000021`54b1e4d0 00000000`0000002a 0x00007ffb`fa674f7f
08 00000021`54b1e4d8 00000021`54b1e500 0x2a
09 00000021`54b1e4e0 000095b7`4d660449 0x00000021`54b1e500
0a 00000021`54b1e4e8 00000021`54b1e500 0x000095b7`4d660449
0b 00000021`54b1e4f0 00000000`00000000 0x00000021`54b1e500

https://github.com/dotnet/coreclr/blob/6b13065e7ac88671795a01577e017ff7cdc26f35/src/vm/diagnosticserver.cpp#L213

jkotas · 2019-10-10T18:11:08Z

Yes, this is EventPipe shutdown as expected.

jkotas · 2019-10-10T18:14:43Z

@dotnet/dotnet-diag Could you please take a look?

This hang in EventPipe is one of the top causes of intermittent test failures and affects CI stability.

tommcdon · 2019-10-10T18:23:21Z

@josalem It looks like CancelSynchronousIo inside of coreclr!DiagnosticServer::Shutdown is causing intermittent hangs on shutdown

jkotas · 2019-10-10T18:25:31Z

Why do we need to call CancelSynchronousIo at all? What is not going to work if that call is deleted?

josalem · 2019-10-10T18:30:37Z

It looks like it was introduced in dotnet/coreclr#25602 and modified in dotnet/coreclr#25786. If I recall, there was a small slowdown on the shutdown path from breaking out of the Accept call on the named pipe, so this was added to explicitly cancel the Accept on the server thread. I took a quick look at the docs for that Windows API, and it sounds like it shouldn't be a blocking call, so I'm not sure off the top of my head what could be going on here.

jkotas · 2019-10-10T18:38:51Z

Trying to do cleanup on process shutdown is waste of time and a bug farm. You should only do the minimal cleanup during shutdown to cleanup machine-wide state. The process local state will be taken care of by the OS.

sywhang · 2019-10-10T19:53:46Z

I think we can safely remove this call, but that will regress https://github.com/dotnet/coreclr/issues/25463 - which was tracking a shutdown delay of 20-40ms showing up in PSCore's performance tests. A potential solution is to remove the call to CancelSynchronousIo and add an environment variable to disable diagnostics server from getting created in the first place for cases like the PSCore tests. Does that sound ok @jkotas?

/cc @sergiy-k who reported #12991 first.

jkotas · 2019-10-10T20:39:40Z

I do not understand why deleting the CancelSynchronousIo would introduce shutdown performance regression. If we delete this call, where do you expect the shutdown to be stuck for 20-40ms?

sywhang · 2019-10-10T22:18:45Z

If I recall correctly, I think what Jose said when he fixed the original issue by adding this call was that calling ::CloseHandle() on the diagnostics server thread was taking longer on some special hardware with hyper-v disabled because it was waiting for pending I/Os to complete. I didn't investigate the issue myself so I don't have the full context here, but I think we can remove it and come up with a better workaround for that particular case.

jkotas · 2019-10-10T22:45:14Z

::CloseHandle() on the diagnostics server thread

You do not need to call CloseHandle on the diagnostic server thread during shutdown. You can leave the handle open and leave it to the OS to take care of it.

Since the server thread handle is not really needed for anything, you may consider closing it immediately after CreateThread returns it and not keeping it around at all - but that is orthogonal to the shutdown fix.

jkotas · 2019-10-11T06:19:55Z

Reopening for 3.1 port.

sywhang · 2019-10-14T20:57:43Z

@jkotas I was preparing a 3.1 port and realized that this actually doesn't apply for 3.1 because we never ported the "fix" to #12991 to 3.0 so the relevant code doesn't exist in the 3.1 branch.

jkotas · 2019-10-14T22:43:13Z

ok

stephentoub changed the title ~~System.Net.Tests.AuthenticationManagerTest.CustomTargetNameDictionary_ValidCollection failed in CI~~ Environment.Exit hanging in RemoteExecutor Oct 10, 2019

stephentoub transferred this issue from dotnet/corefx Oct 10, 2019

jkotas closed this as completed in dotnet/coreclr#27136 Oct 11, 2019

jkotas reopened this Oct 11, 2019

jkotas closed this as completed Oct 14, 2019

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the 3.1 milestone Jan 31, 2020

josalem mentioned this issue May 6, 2020

Modified PR for reverse diagnostics server #35850

Merged

ghost locked as resolved and limited conversation to collaborators Dec 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Environment.Exit hanging in RemoteExecutor #13563

Environment.Exit hanging in RemoteExecutor #13563

ViktorHofer commented Oct 7, 2019

davidsh commented Oct 7, 2019

davidsh commented Oct 7, 2019

stephentoub commented Oct 7, 2019

stephentoub commented Oct 7, 2019

ViktorHofer commented Oct 7, 2019

stephentoub commented Oct 7, 2019

ViktorHofer commented Oct 7, 2019

danmoseley commented Oct 8, 2019

stephentoub commented Oct 8, 2019

jkotas commented Oct 8, 2019

ViktorHofer commented Oct 9, 2019 •

edited

Loading

stephentoub commented Oct 9, 2019

MattGal commented Oct 9, 2019 •

edited

Loading

stephentoub commented Oct 9, 2019

stephentoub commented Oct 10, 2019 •

edited

Loading

jkotas commented Oct 10, 2019

jkotas commented Oct 10, 2019

tommcdon commented Oct 10, 2019

jkotas commented Oct 10, 2019

josalem commented Oct 10, 2019

jkotas commented Oct 10, 2019

sywhang commented Oct 10, 2019

jkotas commented Oct 10, 2019

sywhang commented Oct 10, 2019

jkotas commented Oct 10, 2019 •

edited

Loading

jkotas commented Oct 11, 2019

sywhang commented Oct 14, 2019

jkotas commented Oct 14, 2019

Environment.Exit hanging in RemoteExecutor #13563

Environment.Exit hanging in RemoteExecutor #13563

Comments

ViktorHofer commented Oct 7, 2019

davidsh commented Oct 7, 2019

davidsh commented Oct 7, 2019

stephentoub commented Oct 7, 2019

stephentoub commented Oct 7, 2019

ViktorHofer commented Oct 7, 2019

stephentoub commented Oct 7, 2019

ViktorHofer commented Oct 7, 2019

danmoseley commented Oct 8, 2019

stephentoub commented Oct 8, 2019

jkotas commented Oct 8, 2019

ViktorHofer commented Oct 9, 2019 • edited Loading

stephentoub commented Oct 9, 2019

MattGal commented Oct 9, 2019 • edited Loading

stephentoub commented Oct 9, 2019

stephentoub commented Oct 10, 2019 • edited Loading

jkotas commented Oct 10, 2019

jkotas commented Oct 10, 2019

tommcdon commented Oct 10, 2019

jkotas commented Oct 10, 2019

josalem commented Oct 10, 2019

jkotas commented Oct 10, 2019

sywhang commented Oct 10, 2019

jkotas commented Oct 10, 2019

sywhang commented Oct 10, 2019

jkotas commented Oct 10, 2019 • edited Loading

jkotas commented Oct 11, 2019

sywhang commented Oct 14, 2019

jkotas commented Oct 14, 2019

ViktorHofer commented Oct 9, 2019 •

edited

Loading

MattGal commented Oct 9, 2019 •

edited

Loading

stephentoub commented Oct 10, 2019 •

edited

Loading

jkotas commented Oct 10, 2019 •

edited

Loading