Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal CLR exception in .NET 5.0.0-RC2 high async/await load application #43707

Closed
hoyosjs opened this issue Oct 21, 2020 · 12 comments
Closed

Fatal CLR exception in .NET 5.0.0-RC2 high async/await load application #43707

hoyosjs opened this issue Oct 21, 2020 · 12 comments
Milestone

Comments

@hoyosjs
Copy link
Member

hoyosjs commented Oct 21, 2020

I spoke to soon. Ignore the previous post.

For some bizarre reason I am getting this error today, where the code I am running has been the same, mostly and working for a while now. Bizarre.

I don't have the debugger attached, I am in (console) RELEASE mode running massive parallel async/await load consuming many chips.

Running .net 5 rc2 and possibly forgetting to set $env:DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS = 1 which I have been extensively using and playing around with. I left it off, maybe that is what was different. App running again, but this setting has not done anything like this before.

This is all the debug info I could find:

Console stack that was dumped this time:

Fatal error. Internal CLR error. (0x80131506)
   at Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.CheckPoint(Int32[], Int32[])
   at Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.EncodePoint(PointAccum, Byte[], Int32)
   at Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ImplVerify(Byte[], Int32, Byte[], Int32, Byte[], Byte, Byte[], Int32, Int32)
   at zero.cocoon.models.CcSubspaceMessage+<ConsumeAsync>d__28.MoveNext()
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[zero.cocoon.models.CcSubspaceMessage+<ConsumeAsync>d__28, zero.cocoon, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]](<ConsumeAsync>d__28 ByRef)
 ...

clr dotnet-dump stack clr threads with exceptions on them (threadid 486):

Unhandled exception: Microsoft.Diagnostics.DebugServices.DiagnosticsException: Invalid thread index: 486 ---> System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values.
Parameter name: index
   at System.Linq.Enumerable.ElementAt[TSource](IEnumerable`1 source, Int32 index)
   at Microsoft.Diagnostics.DebugServices.ThreadService.Microsoft.Diagnostics.DebugServices.IThreadService.GetThreadInfoFromIndex(Int32 threadIndex) in /_/src/Microsoft.Diagnostics.DebugServices/ThreadService.cs:line 285
   --- End of inner exception stack trace ---
   at Microsoft.Diagnostics.Repl.CommandProcessor.Handler.Invoke(MethodInfo methodInfo, InvocationContext context) in /_/src/Microsoft.Diagnostics.Repl/Command/CommandProcessor.cs:line 297
   at Microsoft.Diagnostics.Repl.CommandProcessor.Handler.System.CommandLine.Invocation.ICommandHandler.InvokeAsync(InvocationContext context) in /_/src/Microsoft.Diagnostics.Repl/Command/CommandProcessor.cs:line 256
--- End of stack trace from previous location where exception was thrown ---
   at System.CommandLine.Invocation.InvocationPipeline.<>c__DisplayClass4_0.<<BuildInvocationChain>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.CommandLine.Invocation.InvocationExtensions.<>c.<<UseParseErrorReporting>b__22_0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.CommandLine.Invocation.InvocationExtensions.<>c.<<UseSuggestDirective>b__7_0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.CommandLine.Invocation.InvocationExtensions.<>c.<<UseParseDirective>b__6_0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.CommandLine.Invocation.InvocationExtensions.<>c.<<UseHelp>b__20_0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.CommandLine.Invocation.InvocationExtensions.<>c__DisplayClass5_0.<<UseExceptionHandler>b__0>d.MoveNext()
>

Originally posted by @tactical-drone in #13590 (comment)

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-Diagnostics-coreclr untriaged New issue has not been triaged by the area owner labels Oct 21, 2020
@ghost
Copy link

ghost commented Oct 21, 2020

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

@hoyosjs
Copy link
Member Author

hoyosjs commented Oct 21, 2020

@tactical-drone the issue you were describing was really different. However, the data you provided was not . The second one caused dotnet-dump to fail, which is a completely different issue altogether (@mikem8361 fyi). However, the frames that I see there only tell me that some check in Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.CheckPoint threw an exception. Is there anyway you can share the dump you captured? Or to share a repro with us? There's not much information on this issue to do an investigation with as it stands right now. I see it was doing an operation here https://github.com/bcgit/bc-csharp/blob/5bd4c8c70f80f1e7ead8e3c73459b78eb93d0ef7/crypto/src/math/ec/rfc8032/Ed25519.cs#L120 and then something threw a critical exception that shouldn't get to the user. I am not sure what to classify this as, @jkotas would this be VM?

@mikem8361
Copy link
Member

The dotnet-dump setthread issue is that you need to add the --tid option or use the thread "index" (0 - N) from the threads command for this thread.

setthread --tid 486

@tactical-drone
Copy link

tactical-drone commented Oct 22, 2020

This commit had produced the error

To reproduce, just checkout the branch net5 and run on rc2. But you do need a beefy PC on those test parameters. I run my tests on a 3950x with 64Gb ram, which is effectively a mini datacenter. Just set the total nodes to something more manageable like 500 or something. Bump the value until your memory runs out. 2000 nodes requires 10Gb ram. Wait for the error to occur. (All the program does is spawn a bunch of TCP/IP clients that connect to each other and then continuously exchange test bit between each other, 2000 nodes cause 30K connections so this starts to add up)

With regards to the failure. I haven't seen it since. It occurred on a fresh boot (a boot that felt unstable and was rebooted not long after). I am 100% sure $env:DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS = 1 was not set. I don't think it has anything to do with this setting, but that is the only change that I made (apart from these commits a0b10c9f, af69df5) that fixed the issue.

The reason I initially thought a0b10c9f was the fix, is because no other changes I brought could cause a CLR crash. The very first crash looked to be caused because of a TaskCanceledException coming from zerosemaphore. So I thought I understood the cause and made a0b10c9f. But I spoke to soon, and the subsequent test failed with the same CLR error and this time with accompanying stack trace, this time not a TaskCanceledException, but something in Bouncy castle. This made it look like a new random CLR error not caused by me. (CLR can only be brought down by unhandled exceptions in certain types of closures).

I then thought wtf (started googling), set $env:DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS = 1 and never saw the issue again. I am still not sure what exactly happened or why things are back to normal. I just got that CLR error back to back out of nowhere and then gone. My other IDE (which is not vscode or vs2019 but uses dotnet.exe to build) sometimes screws up the build after a small core lib refactor. So I am not sure if the subsequent run's CLR error was caused by something like this? I have seen ~3/2000 .net5 strange builds where part of the previous code refactor did not show, but then doing a full rebuild in vs2019 fixes the missing refactor. Could be a race condition between that IDE's build and save all files logic. Who knows?

In any case, I am not using vs2019 or it's debugger since from RC1 those became massively unstable. Before RC1 I could regularly debug my program from vs2019 preview, but after rc1 only very small test parameters work. Otherwise the debug session just hangs. It hangs regardless actually.

@hoyosjs
Copy link
Member Author

hoyosjs commented Oct 22, 2020

In any case, I am not using vs2019 or it's debugger since from RC1 those became massively unstable. Before RC1 I could regularly debug my program from vs2019 preview, but after rc1 only very small test parameters work. Otherwise the debug session just hangs. It hangs regardless actually.

Do you have any feedback on this? It is concerning and I am wondering it it was us who caused such instability. There was a bug that had to be fixed not too long ago (1744896) that would explain this behavior. This was part of the RC2 release.

@tactical-drone
Copy link

tactical-drone commented Oct 23, 2020

@hoyosjs RC2 has exactly the same debugger issue as RC1.

I would say fairly regularly a debug session would just lock up when I run less that 30 nodes. The lockup would happen within 20 seconds of the debug session starting, or NEVER.

Between 30 ~ 100 nodes this fairly regularly becomes a certainty.

Failure type 1: Small test

Sometimes the debug session just locks up. This means, the IDE's debug area with CPU and Memory becomes unresponsive and frozen. A stop debug and "close the purple cmd window" and rerun normally mends the issue. It feels like the closing of that window increases the probability of the next run working and not closing it almost guarantees it failing. I could be wrong.

Failure type2: Progressively larger test

This one can go two ways. A total lockup as in failure type 1, or, a "lockup" but the IDE's CPU and Memory region still works, showing no CPU activity but the Memory region shows ~20Mb/s consistent linear growth (with no GCs) in memory usage that looks like a runaway process in the CLR.

@tactical-drone
Copy link

tactical-drone commented Oct 23, 2020

Another issue I am having since RC1 is the following:

I am testing memory leaks, this consists of a longish run in RELEASE mode and then a teardown cycle that ends with a manual GC.Collect.

I then go to vs2019 attach the debugger to the process and take a memory snapshot. During that heap snapshot calculation (which is a fat one that sometimes shows a terminate dialog but completes) the debugger sometimes crashes taking the process with it.

This did not happen pre RC1 with even larger HEAP snapshot calculations.

Further more, the debug became equally unstable in vs2019 as in Rider since RC1. They both exhibit exactly the same issues. So I don't think it has something to do with the IDE itself.

Something else I have seen since RC1 is, with these super large tests if they sometimes work I see strange things in the HEAP after teardown. Some of my objects are still floating around, but the numbers are not right. For example I spawned 2K nodes, and the HEAP shows that there is exactly ONE still in the heap after teardown. That just does not make sense. I have been doing this leak debugging for a while now and since RC1 the large tests produce snapshots that I cannot understand. In smaller (that are still large) runs everything works.

@hoyosjs
Copy link
Member Author

hoyosjs commented Oct 23, 2020

@dotnet/dotnet-diag as the last couple points seem worth investigating

@sdmaclea
Copy link
Contributor

strange things in the HEAP after teardown
I have been doing this leak debugging for a while now and since RC1 the large tests produce snapshots that I cannot understand

Could this be related to the addition of the pinned object heap?

/cc @Maoni0

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Oct 26, 2020
@mangod9 mangod9 added this to the 6.0.0 milestone Oct 26, 2020
@mangod9
Copy link
Member

mangod9 commented Oct 26, 2020

Is there a repro for this?

@tactical-drone
Copy link

The .net5 Release seems to have fixed these issues I had.

I can now debug again, like I could before.

Good job men!

@hoyosjs
Copy link
Member Author

hoyosjs commented Nov 11, 2020

Glad to hear the issues have gone away so far. I'll close this, but feel free to ping again if the behavior surfaces again.

@hoyosjs hoyosjs closed this as completed Nov 11, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants