Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in LibraryImportGenerator.Unit.Tests #67031

Closed
adamsitnik opened this issue Mar 23, 2022 · 13 comments
Closed

Segmentation fault in LibraryImportGenerator.Unit.Tests #67031

adamsitnik opened this issue Mar 23, 2022 · 13 comments
Labels
arch-arm64 area-System.Runtime.InteropServices blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' os-linux-musl Linux distributions using musl library.
Milestone

Comments

@adamsitnik
Copy link
Member

Observed in https://dev.azure.com/dnceng/public/_build/results?buildId=1676597&view=results:

  Discovering: LibraryImportGenerator.Unit.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  LibraryImportGenerator.Unit.Tests (found 117 test cases)
  Starting:    LibraryImportGenerator.Unit.Tests (parallel test collections = on, max threads = 4)
    LibraryImportGenerator.UnitTests.Compiles.ValidateSnippetsWithMarshalType [SKIP]
      No current scenarios to test.
./RunTests.sh: line 168:    23 Segmentation fault      (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig LibraryImportGenerator.Unit.Tests.runtimeconfig.json --depsfile LibraryImportGenerator.Unit.Tests.deps.json xunit.console.dll LibraryImportGenerator.Unit.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/root/helix/work/workitem/e
----- end Tue Mar 22 21:09:57 UTC 2022 ----- exit code 139 ----------------------------------------------------------
exit code 139 means SIGSEGV Illegal memory access. Deref invalid pointer, overrunning buffer, stack overflow etc. Core dumped.

For more details please go to the console logs

@dotnet/interop-contrib

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 23, 2022
@adamsitnik adamsitnik added arch-arm64 os-linux-musl Linux distributions using musl library. labels Mar 23, 2022
@AaronRobinsonMSFT AaronRobinsonMSFT added this to the 7.0.0 milestone Mar 23, 2022
@AaronRobinsonMSFT AaronRobinsonMSFT removed the untriaged New issue has not been triaged by the area owner label Mar 23, 2022
@ghost
Copy link

ghost commented Mar 23, 2022

Tagging subscribers to this area: @dotnet/interop-contrib
See info in area-owners.md if you want to be subscribed.

Issue Details

Observed in https://dev.azure.com/dnceng/public/_build/results?buildId=1676597&view=results:

  Discovering: LibraryImportGenerator.Unit.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  LibraryImportGenerator.Unit.Tests (found 117 test cases)
  Starting:    LibraryImportGenerator.Unit.Tests (parallel test collections = on, max threads = 4)
    LibraryImportGenerator.UnitTests.Compiles.ValidateSnippetsWithMarshalType [SKIP]
      No current scenarios to test.
./RunTests.sh: line 168:    23 Segmentation fault      (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig LibraryImportGenerator.Unit.Tests.runtimeconfig.json --depsfile LibraryImportGenerator.Unit.Tests.deps.json xunit.console.dll LibraryImportGenerator.Unit.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/root/helix/work/workitem/e
----- end Tue Mar 22 21:09:57 UTC 2022 ----- exit code 139 ----------------------------------------------------------
exit code 139 means SIGSEGV Illegal memory access. Deref invalid pointer, overrunning buffer, stack overflow etc. Core dumped.

For more details please go to the console logs

@dotnet/interop-contrib

Author: adamsitnik
Assignees: -
Labels:

arch-arm64, area-System.Runtime.InteropServices, os-linux-musl

Milestone: 7.0.0

@joperezr joperezr added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Mar 23, 2022
@danmoseley
Copy link
Member

Looking at the last few they're either

Assert failure(PID 25 [0x00000019], Thread: 49 [0x0031]): m_alignpad == 0
    File: /__w/1/s/src/coreclr/vm/syncblk.cpp Line: 2952
    Image: /root/helix/work/correlation/dotnet

or

Assert failure(PID 24 [0x00000018], Thread: 34 [0x0022]): Assertion failed 'arg.IsTemp()' in 'System.Type:op_Equality(System.Type,System.Type):bool' during 'Optimize Valnum CSEs' (IL size 38; hash 0xa9f2805d; Tier1)

    File: /__w/1/s/src/coreclr/jit/morph.cpp Line: 3207
    Image: /root/helix/work/correlation/dotnet

They seemed to stop on 4/15 though (the 1st kind was last). @jkotas do you recall either of these getting fixed? I can't find them in some quick searching.

Execute: Web | Desktop | Web (Lens) | Desktop (SAW)

https://engsrvprod.kusto.windows.net/engineeringdata

WorkItems
| where FriendlyName == "LibraryImportGenerator.Unit.Tests"
//| where Queued > ago(1d)
| where Status == "BadExit"
| where ExitCode  == 134
| join Jobs on JobId
| project
  Queued,
  FriendlyName, ExitCode,
  ConsoleUri,
  PhaseName = tostring(parse_json(Properties)["System.PhaseName"]),
  Pipeline = tostring(parse_json(Properties).DefinitionName),
  BuildId = tostring(parse_json(Properties).BuildId),
  QueueName, Source
| where Pipeline == "runtime"

@danmoseley
Copy link
Member

There is also a Windows dump (no Unix dumps for some reason). It may be unrelated. This was 4/16 on Windows 10 x64. @AndyAyersMS is this something already fixed?

0:004> k
 # ChildEBP RetAddr  
00 0ed4cbf8 70f02174 coreclr!TerminateOnAssert+0x17 [D:\a\_work\1\s\src\coreclr\utilcode\debug.cpp @ 189]
01 0ed4ccd0 70f017a9 coreclr!_DbgBreakCheck+0x411 [D:\a\_work\1\s\src\coreclr\utilcode\debug.cpp @ 427]
02 0ed4cd40 70f01ad1 coreclr!_DbgBreakCheckNoThrow+0x51 [D:\a\_work\1\s\src\coreclr\utilcode\debug.cpp @ 534]
03 0ed4cddc 70bdd6c0 coreclr!DbgAssertDialog+0x20b [D:\a\_work\1\s\src\coreclr\utilcode\debug.cpp @ 695]
04 0ed4cec4 70b89e7c coreclr!CLRLastThrownObjectException::Validate+0x144 [D:\a\_work\1\s\src\coreclr\vm\clrex.cpp @ 2170]
05 0ed4f0a4 70b8bcc1 coreclr!TieredCompilationManager::CompileCodeVersion+0x251 [D:\a\_work\1\s\src\coreclr\vm\tieredcompilation.cpp @ 919]
06 0ed4f120 70b8aa28 coreclr!TieredCompilationManager::OptimizeMethod+0xc1 [D:\a\_work\1\s\src\coreclr\vm\tieredcompilation.cpp @ 877]
07 0ed4f314 70b89b1a coreclr!TieredCompilationManager::DoBackgroundWork+0x635 [D:\a\_work\1\s\src\coreclr\vm\tieredcompilation.cpp @ 763]
08 0ed4f3c4 70b89917 coreclr!TieredCompilationManager::BackgroundWorkerStart+0x1d7 [D:\a\_work\1\s\src\coreclr\vm\tieredcompilation.cpp @ 482]
09 0ed4f480 70b80b9f coreclr!TieredCompilationManager::BackgroundWorkerBootstrapper1+0xb7 [D:\a\_work\1\s\src\coreclr\vm\tieredcompilation.cpp @ 432]
0a 0ed4f4f4 70b80c25 coreclr!ManagedThreadBase_DispatchInner+0x93 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7325]
0b 0ed4f5d8 70b82d7f coreclr!ManagedThreadBase_DispatchMiddle+0x68 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7369]
0c 0ed4f614 70b82e2a coreclr!``ManagedThreadBase_DispatchOuter'::`8'::__Body::Run'::`5'::__Body::Run+0x43 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7528]
0d 0ed4f660 70b80f98 coreclr!`ManagedThreadBase_DispatchOuter'::`8'::__Body::Run+0x5a [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7528]
0e 0ed4f6bc 70b8105e coreclr!ManagedThreadBase_DispatchOuter+0x7e [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7548]
0f 0ed4f73c 70b80860 coreclr!ManagedThreadBase_FullTransition+0x9c [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7573]
10 0ed4f750 70b8980d coreclr!ManagedThreadBase::KickOff+0x10 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7608]
11 0ed4f818 779162c4 coreclr!TieredCompilationManager::BackgroundWorkerBootstrapper0+0x10d [D:\a\_work\1\s\src\coreclr\vm\tieredcompilation.cpp @ 414]
12 0ed4f82c 77a41b69 kernel32!BaseThreadInitThunk+0x24
13 0ed4f874 77a41b34 ntdll!__RtlUserThreadStart+0x2f
14 0ed4f884 00000000 ntdll!_RtlUserThreadStart+0x1b

@jkotas
Copy link
Member

jkotas commented Apr 22, 2022

@jkotas do you recall either of these getting fixed?

  • m_alignpad == 0 is a classic GC hole type of crash. I do not recall a fix going in in last week that would be fixing a GC hole.
  • `Assertion failed 'arg.IsTemp()' is failure due to a problem in the PR (that was not merged yet).

There is also a Windows dump (no Unix dumps for some reason)

I do not see this dump. Also, the stacktrace does not look like anything else mentioned in this issue so far.

no Unix dumps for some reason

The source generator tests are running orders of magnitude more code than other libraries tests, so they will tend to hit intermittent runtime stress bugs more frequently.

The next action for this one should be to figure out why we are not getting dumps. We are not going to make progress of intermittent crashes like this without dumps.

@AndyAyersMS
Copy link
Member

@AndyAyersMS is this something already fixed?

I don't know -- from the stack this seems to be related to the TC mechanism.

Looks like there is a windows dump for this merged PR: #67184 under https://dev.azure.com/dnceng/public/_build/results?buildId=1701571&view=ms.vss-test-web.build-test-results-tab

  Starting:    LibraryImportGenerator.Unit.Tests (parallel test collections = on, max threads = 2)
    LibraryImportGenerator.UnitTests.Compiles.ValidateSnippetsWithMarshalType [SKIP]
      No current scenarios to test.
Fatal error. Internal CLR error. (0x80131506)

@AndyAyersMS
Copy link
Member

Updated query to see windows failures (-1234 is a cancelled job so excluded those)

WorkItems
| where FriendlyName == "LibraryImportGenerator.Unit.Tests"
//| where Queued > ago(1d)
| where Status == "BadExit"
| where ExitCode != -1234
| join Jobs on JobId
| project
  Queued,
  FriendlyName, ExitCode,
  ConsoleUri,
  PhaseName = tostring(parse_json(Properties)["System.PhaseName"]),
  Pipeline = tostring(parse_json(Properties).DefinitionName),
  BuildId = tostring(parse_json(Properties).BuildId),
  QueueName, Source
| where Pipeline == "runtime"

@danmoseley
Copy link
Member

There are in fact Linux dumps - query error.

Execute: Web | Desktop | Web (Lens) | Desktop (SAW)

https://engsrvprod.kusto.windows.net/engineeringdata

let wi = 
WorkItems
| join kind=leftsemi (Jobs | where Queued > ago (30d) | where Source == "ci/public/dotnet/runtime/refs/heads/main") on $left.JobName == $right.Name
| where ExitCode != 0
| where FriendlyName == "LibraryImportGenerator.Unit.Tests";
Files 
| lookup kind=inner wi on $left.WorkItemName == $right.Name
| where FileName == "how-to-debug-dump.md"
| project Timestamp, QueueName, ExitCode, Uri, ConsoleUri

Gives URL's to get "how-to-debug-dump.md" for these.

@jkotas
Copy link
Member

jkotas commented Apr 22, 2022

There are in fact Linux dumps - query error.

I have looked at number of these and they are all from JIT stress runs, typically hitting asserts in the JIT, not something that runs in standard CI by default.

@danmoseley
Copy link
Member

Non JIT stress ones too, but on checked CLR, and only up to 4/9.

let wi =
WorkItems
| join kind=leftsemi (Jobs | where Queued > ago (30d) | where Source == "ci/public/dotnet/runtime/refs/heads/main") on $left.JobName == $right.Name
| where ExitCode != 0
| where FriendlyName == "LibraryImportGenerator.Unit.Tests";
Files
| lookup kind=inner wi on $left.WorkItemName == $right.Name
| where ExitCode ==134
| where FileName == "how-to-debug-dump.md"
| join WorkItems on $left.WorkItemName == $right.Name
| join Jobs on $left.JobName == $right.Name
| extend PhaseName = tostring(parse_json(Properties)["System.PhaseName"]),
Pipeline = tostring(parse_json(Properties).DefinitionName),
BuildId = tostring(parse_json(Properties).BuildId)
| where Pipeline !contains("jitstress")
| project Timestamp, QueueName, ExitCode, Uri, ConsoleUri, PhaseName, Pipeline, BuildId

@danmoseley
Copy link
Member

It seems to me we can put aside the Linux issues and use this to track the Windows issues - @AndyAyersMS query works for those.

I probably can't debug further myself. Just trying to figure out next actions on these.

@AndyAyersMS
Copy link
Member

Looks like there is a windows dump for this merged PR: #67184 under https://dev.azure.com/dnceng/public/_build/results?buildId=1701571&view=ms.vss-test-web.build-test-results-tab

Seems like this crash is perhaps related to #67144 which was fixed on 4/15 by #67922.

Perhaps these tests are also stressing collectable assemblies?

00 (Inline Function)     : --------`-------- --------`-------- --------`-------- --------`-------- : coreclr!LoaderAllocator::GetLoaderAllocatorObjectHandle+0x4 [D:\a\_work\1\s\src\coreclr\vm\loaderallocator.hpp @ 483] 
01 (Inline Function)     : --------`-------- --------`-------- --------`-------- --------`-------- : coreclr!MethodTable::GetLoaderAllocatorObjectHandle+0x4 [D:\a\_work\1\s\src\coreclr\vm\methodtable.inl @ 1360] 
02 00007fff`53a0a571     : 00000106`00000000 000000c6`fbfd4c50 000000c7`060ae3e8 000000c7`060451c8 : coreclr!MethodTable::GetLoaderAllocatorObjectForGC+0xf [D:\a\_work\1\s\src\coreclr\vm\methodtable.cpp @ 8093] 
03 (Inline Function)     : --------`-------- --------`-------- --------`-------- --------`-------- : coreclr!GCToEEInterface::GetLoaderAllocatorObjectForGC+0x5 [D:\a\_work\1\s\src\coreclr\vm\gcenv.ee.cpp @ 444] 
04 00007fff`5399ea6d     : 00000000`00000000 0000b1c4`032d9d27 00000107`95fd87b0 00000107`95fd8718 : coreclr!WKS::gc_heap::mark_object_simple+0x101 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 24021] 
05 (Inline Function)     : --------`-------- --------`-------- --------`-------- --------`-------- : coreclr!WKS::gc_heap::mark_through_cards_helper+0x4e [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 36746] 
06 00007fff`53997999     : 00007fff`53a0a470 00000000`00000001 00000000`00000000 00007fff`00000000 : coreclr!WKS::gc_heap::mark_through_cards_for_uoh_objects+0x79d [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 42158] 
07 00007fff`53996a15     : 00000000`00000000 00000000`00000000 00000003`ae4bb4f2 00000000`00000000 : coreclr!WKS::gc_heap::mark_phase+0x51d [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 25768] 
08 00007fff`5399888b     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : coreclr!WKS::gc_heap::gc1+0xc1 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 20590] 
09 (Inline Function)     : --------`-------- --------`-------- --------`-------- --------`-------- : coreclr!GCToOSInterface::GetLowPrecisionTimeStamp+0x5 [D:\a\_work\1\s\src\coreclr\vm\gcenv.os.cpp @ 1023] 
0a 00007fff`539f1f17     : 00000000`00000001 00000000`00000000 00000000`00000000 00000000`00000000 : coreclr!WKS::gc_heap::garbage_collect+0x4df [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 22351] 
0b 00007fff`539f1ca3     : 00000000`00000000 00000000`00000000 00007fff`53d9b480 00000000`00000018 : coreclr!WKS::GCHeap::GarbageCollectGeneration+0x14f [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 45922] 
0c 00007fff`53999aeb     : 00000000`00000000 00000107`940ff238 00007fff`53d9b480 00000000`00000000 : coreclr!WKS::gc_heap::trigger_gc_for_alloc+0x2b [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 17320] 
0d 00007fff`53999879     : 00000000`00000000 00000107`940ff238 00000000`00000028 00007fff`53d9b480 : coreclr!WKS::gc_heap::try_allocate_more_space+0x24b [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 17466] 
0e 00007fff`5399a934     : 000000c7`046d8638 00000000`00000002 00000000`00000028 00000000`00000000 : coreclr!WKS::gc_heap::allocate_more_space+0x31 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 17934] 
0f (Inline Function)     : --------`-------- --------`-------- --------`-------- --------`-------- : coreclr!WKS::gc_heap::allocate+0x58 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 17965] 
10 00007fff`5395beb2     : 00000107`940ff1e0 00000107`95fd8f40 00000000`00000028 00000000`00000002 : coreclr!WKS::GCHeap::Alloc+0x84 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 44883] 
11 00007fff`5395a995     : 00007ffe`f53a7698 00000000`00000002 00007ffe`f53a7698 00007fff`5397301a : coreclr!Alloc+0x9a [D:\a\_work\1\s\src\coreclr\vm\gchelpers.cpp @ 237] 
12 00007fff`5396bedc     : 000000c7`046c93f0 00007ffe`f53a7698 00000107`95fd8bd8 00000107`940ff238 : coreclr!AllocateObject+0x7d [D:\a\_work\1\s\src\coreclr\vm\gchelpers.cpp @ 979] 
13 00007fff`5396ae22     : 00007ffe`f53a7698 00000000`00000000 00000000`00000078 00000107`940ff238 : coreclr!MethodTable::FastBox+0x28 [D:\a\_work\1\s\src\coreclr\vm\methodtable.cpp @ 3446] 
14 00007ffe`f6a13eb4     : 00007ffe`f53a7698 7fffffff`ffffffe0 00000000`00000000 000000c7`046d85c0 : coreclr!JIT_Box+0x112 [D:\a\_work\1\s\src\coreclr\vm\jithelpers.cpp @ 2714] 
15 00007ffe`f53a7698     : 7fffffff`ffffffe0 00000000`00000000 000000c7`046d85c0 000000c7`046d85c0 : System_Collections_Concurrent!System.Collections.Concurrent.ConcurrentDictionary<Key,Microsoft.CodeAnalysis.CSharp.Symbols.NamedTypeSymbol>.TryGetValue+0x84 [/_/src/libraries/System.Collections.Concurrent/src/System/Collections/Concurrent/ConcurrentDictionary.cs @ 408] 
16 7fffffff`ffffffe0     : 00000000`00000000 000000c7`046d85c0 000000c7`046d85c0 00000000`00000000 : 0x00007ffe`f53a7698
17 00000000`00000000     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x7fffffff`ffffffe0

@jkotas
Copy link
Member

jkotas commented Apr 22, 2022

Seems like this crash is perhaps related to #67144 which was fixed on 4/15 by #67922.

Yes, that sounds plausible.

@jkotas
Copy link
Member

jkotas commented Apr 23, 2022

This failed on linux-arm64 again in #68436. I have opened a fresh issue that is explicitly about the Segmentation fault on linux-arm64 only.

@ghost ghost locked as resolved and limited conversation to collaborators May 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-System.Runtime.InteropServices blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' os-linux-musl Linux distributions using musl library.
Projects
None yet
Development

No branches or pull requests

6 participants