-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test hang in System.Diagnostics.Tracing.EventProvider.EventUnregister #35451
Comments
I couldn't figure out the best area label to add to this issue. Please help me learn by adding exactly one area label. |
Hit by #35427 in https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-35427-merge-6fdb541e69394f00ba/System.Threading.Thread.Tests/console.72689c94.log?sv=2019-02-02&se=2020-05-05T01%3A21%3A24Z&sr=c&sp=rl&sig=Aalm%2BlDqE2HM88uUCuqEEJtCb%2F6EW2vB%2BQH9i9G1F6Y%3D (test System.Threading.Threads.Tests.ThreadTests.WindowsPrincipalPolicyTest_Windows) |
Saved dump at \jkotas9\drops\issue-35451 We are stuck inside
This may be duplicate of #13610 |
@noahfalk for thoughts on what might be causing this |
I'll see if I can take a closer look tomorrow (I'll need to get back on corpnet to access the dump). Off the cuff that callstack looked like the thread was blocked on the loader lock so probably some other thread is holding it. If so we'll spot the culprit in the dump. |
@jkotas - I wasn't able to access the dump on your share. I can list directories in \\jkotas9 and \\jkotas9\public, but I get access denied trying to open \\jkotas9\drops. For comparison I am able to access shares on \\cpvsbuild\ so I suspect this is a security configuration issue on your share rather than a broader issue with my credentials or VPN connection. |
Attaching windbg showed the loader lock was unowned:
Best guess the loader lock was owned earlier causing the OS to switch the thread out, later the lock was released and for whatever reason the OS hasn't yet switched the thread back in. Jan's suggestion of #13610 sounds very plausible to me. If we wanted more data on this failure (or failures of this class) we'd probably need to upgrade the telemetry to determine if the thread is making progress and what behavior is happening machine wide. A better set of telemetry for a suspected hung test to capture would be:
Closing this as duplicate of #13610 |
This was most likely introduced by #33307 according the CI telemetry. See #13610 (comment) for more context. |
Another data point for this, based on the Kusto data I see, this only happens in Windows7 and Windows8.1 -- I tried to repro locally on Windows 10 and after 150 runs it didn't repro. |
Interesting! I'll take a look at this today. Thanks for triaging things @safern! I'm not sure what would be causing this issue, but based on the versions being hit, I'm guessing it might be something with Overlapped IO in old versions of Windows. I'll have to do some investigating. |
I spun up a win7 VM and I'm trying to catch this, but I can't seem to reproduce the failure. I am running the System.Collections.Concurrent.Tests suite in a loop and having it ping me if the test takes longer than 60 seconds. So far I've noticed that the tests either take almost exactly 30 seconds or they take almost exactly 65 seconds. My VM is 2 vCPU 4GB, so it's a little bigger than the ones in CI unfortunately. I've narrowed the test cycle down to just the System.Collections.Concurrent.Tests.EtwTests test that I saw a failure on in the other thread. That test usually completes in <1 second, but every once in a while takes >10 seconds (1 in 750+ loops so far). I'll leave it running overnight and see if it gets hit more often. I might try artificially bogging down the machine as well and see if that changes things. @safern, were you ever able to locally reproduce this? While I try to get a repro set up, I'll take another look at the dump that Jan shared. Based on what I've already seen and what Noah found, I'm not sure there is enough in the dump to root cause this. The thing I found interesting, is that the dump shows that the DiagnosticsServer is inactive, i.e., it's in the middle of a |
By "ping" I meant that it literally made a ping sound from the console. I had stepped away from my computer for a while, and it started beeping. It seems like after 1000 loops most of the runs are taking >10 seconds! The xunit output claims the tests ran in 5 seconds (still a big increase over the 0.6 second runtime from the first 900 loops), but the |
I left my script running overnight. I'm seeing some interesting behavior that I'm going to try and reproduce the rest of today. I wrote up a powershell script does the following in an infinite loop: $Now = Get-Date
$job = Start-Job -ScriptBlock {
# invoke RuntTests.cmd scoped to just System.Collections.Concurrent.Tests.EtwTests
}
While ($job.State -eq 'Running') {
if ((Get-Date).AddSeconds(-10) -gt $Now) {
# beep and log
}
}
$Output = $job | Receive-Job
Write-Host $Output The time check is wall clock time, so it isn't timing how long xunit claims the tests executed for. For the first ~1000 iterations, the job would compelete in <10s and xunit would claim ~0.6s of test time. After that, almost every job took >10s and xunit claimed the tests tool 3-6s, which is 10x time increase at the upper end. After ~3000 iterations, it looks like something went haywire and the powershell job died, but was still in the running state. The time check loop never exited, but the state was still in I want to build a more robust script that does some more interesting logging, especially better timing. Right now, my thinking is there might be some kind of OS scheduling issue that we are exacerbating with high volumes of |
Can we revert the offending PR? This is causing at least one leg in the majority of PRs to fail. It's going to make it much more likely that other regressions slip in, as we're having to rerun lots of legs and merge with failures that might be masking other failures, and it's going to make it more likely devs assume that failures aren't theirs even if they are in fact caused by their PR. Thanks! |
I agree this is being hit in 50+ runs a day and we’re having a 0% pass rate now in rolling builds as well. |
Reverting in #35767 |
I have stepped through the Event Pipe IPC shutdown sequence. I believe that there is a clear HANDLE use-after-close bug: Handle that is being closed here:
is being waited on - or about to be waited on - here:
How does this explain the hang:
Why is it more likely to repro on Win7/8:
The shutdown sequence of the event pipe need careful review. Also, it would be a good idea to minimize work done during shutdown to only what's absolutely necessary, e.g. these event handles do not really need to be closed since the OS will close them anyway as part of the process shutdown. |
The dump at |
Thanks, Jan! |
Oof! That's a tricky one! Thanks for pinpointing the issue @jkotas! I'm writing a fix for it now. Earlier iterations of the PR specifically didn't close all the handles and didn't attempt to cancel any waits, but I recalled #12991 which was "fixed" by closing handles to break out of a blocking. I'll come up with a more elegant approach that cancels the blocking |
The text was updated successfully, but these errors were encountered: