Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified PR for reverse diagnostics server #35850

Conversation

josalem
Copy link
Contributor

@josalem josalem commented May 5, 2020

This PR un-reverts the reverse feature of the Diagnostics Server and adds fixes for a HANDLE-use-after-close issue that cropped up in CI (see #35451).

I recommend looking at this PR commit by commit, to filter out the revert changes (776724f) from the fixes introduced in this PR.

I'll open this PR in draft mode to run some outer loop CI over it and then open it for further review.

Details of changes:

CC - @tommcdon @noahfalk @jkotas @sywhang

@josalem josalem added this to the 5.0 milestone May 5, 2020
@ghost
Copy link

ghost commented May 5, 2020

Tagging subscribers to this area: @tarekgh, @tommcdon
Notify danmosemsft if you want to be subscribed.

@josalem
Copy link
Contributor Author

josalem commented May 5, 2020

/azp list

@josalem
Copy link
Contributor Author

josalem commented May 5, 2020

/azp run runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@josalem
Copy link
Contributor Author

josalem commented May 5, 2020

Failed leg's test failure looks to have been hit before in #32955. Rerunning...

@josalem
Copy link
Contributor Author

josalem commented May 6, 2020

Socket test failures look to be existing with failures reported as recent as yesterday. It repros on my Mac for all currently shipped previews of 5.0, albeit with a slightly different failure:

> runtime/artifacts/bin/testhost/netcoreapp5.0-OSX-Release-x64/dotnet exec --runtimeconfig System.Net.Sockets.Tests.runtimeconfig.json --depsfile System.Net.Sockets.Tests.deps.json xunit.console.dll System.Net.Sockets.Tests.dll -xml testResults.xml -nologo -method System.Net.Sockets.Tests.TcpClientTest.Dispose_CancelsConnectAsync -notrait category=failing
  Discovering: System.Net.Sockets.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Net.Sockets.Tests (found 1 of 1201 test case)
  Starting:    System.Net.Sockets.Tests (parallel test collections = on, max threads = 8)
    System.Net.Sockets.Tests.TcpClientTest.Dispose_CancelsConnectAsync(connectByName: True) [FAIL]
      System.Net.Sockets.SocketException : Unknown error
      Stack Trace:
        runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.Tasks.cs(1138,0): at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
        runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.Tasks.cs(1123,0): at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
        runtime/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/ValueTask.cs(220,0): at System.Threading.Tasks.ValueTask.ValueTaskSourceAsTask.<>c.<.cctor>b__4_0(Object state)
        --- End of stack trace from previous location ---
        runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/TCPClient.cs(320,0): at System.Net.Sockets.TcpClient.CompleteConnectAsync(Task task)
        runtime/src/libraries/System.Net.Sockets/tests/FunctionalTests/TcpClientTest.cs(435,0): at System.Net.Sockets.Tests.TcpClientTest.Dispose_CancelsConnectAsync(Boolean connectByName)
        --- End of stack trace from previous location ---
  Finished:    System.Net.Sockets.Tests
=== TEST EXECUTION SUMMARY ===
   System.Net.Sockets.Tests  Total: 2, Errors: 0, Failed: 1, Skipped: 0, Time: 0.253s

Edit: Has an issue tracking it now -> #35886

Linq test failures also appear to be existing and have been failing libraries-outerloop since at least Last Friday 4/15.

The TestOnStartWithArgsThenStop failure appears to be a reoccurrence of #34801.

I'm not convinced that any of these failures are due to this change, but will do some due diligence to be more confident in that assertion.

@josalem
Copy link
Contributor Author

josalem commented May 6, 2020

Ran some benchmarking (avg, max, min for 1000 measurements) for the scenario in #12991 and got the following results:

3.1

$ C:\git\scratch\benchmark.ps1 { ./ConsoleApp.exe } -Samples 1000 -Silent
Average: 70.7435069ms
Minimum: 64.9551ms
Maximum: 87.1662ms

5.0-preview3

$ C:\git\scratch\benchmark.ps1 { ./ConsoleApp.exe } -Samples 1000 -Silent
Average: 62.6742732000001ms
Minimum: 56.4682ms
Maximum: 98.195ms

This PR's coreclr.dll built on top of master in a self-contained publish of preview 3:

$ C:\git\scratch\benchmark.ps1 { ./ConsoleApp.exe } -Samples 1000 -Silent
Average: 65.2096325ms
Minimum: 59.7252ms
Maximum: 85.6353ms

so we have equation between preview 3 and this PR on top of master (AKA preview 5).

If I recall, the original issue for #12991 only showed when the app was executed on a machine with Hyper-V turned off. I'm not sure if I can turn off Hyper-V on my work desktop since I'm remoting into it from home and can't touch the UEFI options, but I'll try to run these same experiments on a Windows box with Hyper-V off.

@jkotas
Copy link
Member

jkotas commented May 6, 2020

the original issue for #12991 only showed when the app was executed on a machine with Hyper-V turned off

That sounds suspect. Did we get to a root cause for #12991 ? I am wondering whether it could have been another shutdown race in eventpipe that - depending on timing - destroyed something that was still in use. Hyper-V turned off just changed the timing to trigger the problem.

@josalem
Copy link
Contributor Author

josalem commented May 6, 2020

I'm trying to find explicit documentation of the issue repro steps. In the meantime, this was the code added to fix the issue: https://github.com/dotnet/coreclr/pull/25602/files#diff-776272f612a5b488defb167f8afb2333R212-R232

According to the associated PR, the root cause was a difference in time between the OS unblocking the ConnectNamedPipe call + collecting the server thread and manually canceling the ConnectNamedPipe call to allow the Server thread to run to completion.

@jkotas
Copy link
Member

jkotas commented May 6, 2020

https://github.com/dotnet/coreclr/pull/25602/files#diff-776272f612a5b488defb167f8afb2333R212-R232

This code was deleted in dotnet/coreclr#27136 because of it was again causing intermittent hangs during shutdown.

@josalem
Copy link
Contributor Author

josalem commented May 6, 2020

Ah, found a more detailed description/discussion: #13563

In light of the discussion in that issue and the fact that 3.1 doesn't have this code path (it only attempts to unlink the unix domain socket on linux) I'm starting to think I should discount #12991 until I can find explicit documentation that this issue is applicable. Thoughts?

@sergiy-k I believe the original report for that issue came through you from a partner team. Do you have any further information on this floating around your inbox?

@josalem
Copy link
Contributor Author

josalem commented May 6, 2020

Talked with Sergiy offline, and it sounds like the original issue was specifically on x64 with "virtualization support" turned off at the BIOS/UEFI level. Additionally, the issue wasn't consistent in nature and was more that there was a variation in the execution time that sometimes had this delay. I'm working on getting a machine with virtualization support turned off to run these benchmarks on and see if this behavior is still present.

@josalem josalem force-pushed the dev/josalem/issue-35451-reverse-diagnostics-server branch from 91ecfac to 8df50c8 Compare May 7, 2020 00:51
@josalem
Copy link
Contributor Author

josalem commented May 7, 2020

Worked offline with @noahfalk to test runtimes on a machine with "Virtualization Support" turned off. We couldn't reproduce any slowdown effects. Furthermore, we concluded that there wasn't an apparent slowdown from removing the cleanup logic. I've modified the PR to only perform minimal cleanup for Diagnostics Server resources on shutdown, i.e., it will only unlink the Unix Domain Socket on non-Windows platforms.

@josalem josalem marked this pull request as ready for review May 7, 2020 02:58
@josalem
Copy link
Contributor Author

josalem commented May 7, 2020

CI failure looks to be #35877. I'm going to rebase and rerun CI.

@josalem josalem force-pushed the dev/josalem/issue-35451-reverse-diagnostics-server branch from 8df50c8 to 08267ee Compare May 7, 2020 17:17
@josalem
Copy link
Contributor Author

josalem commented May 8, 2020

@msftbot merge after @noahfalk approves

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@josalem josalem merged commit 0b32ff7 into dotnet:master May 8, 2020
@josalem josalem deleted the dev/josalem/issue-35451-reverse-diagnostics-server branch May 8, 2020 16:50
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants