Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert failure: !CREATE_CHECK_STRING(pMT && pMT->Validate()) #67046

Closed
BruceForstall opened this issue Mar 23, 2022 · 6 comments · Fixed by #68105
Closed

Assert failure: !CREATE_CHECK_STRING(pMT && pMT->Validate()) #67046

BruceForstall opened this issue Mar 23, 2022 · 6 comments · Fixed by #68105
Assignees
Labels
arch-arm32 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI GCStress JitStress CLR JIT issues involving JIT internal stress modes os-linux Linux OS (any supported distro)
Milestone

Comments

@BruceForstall
Copy link
Member

https://dev.azure.com/dnceng/public/_build/results?buildId=1675437&view=ms.vss-test-web.build-test-results-tab&runId=45952896&resultId=184681&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

+ grep COMPlus
COMPlus_TieredCompilation=1
COMPlus_JitStress=2
COMPlus_DbgMiniDumpName=/home/helixbot/dotnetbuild/dumps/coredump.%d.dmp
COMPlus_DbgEnableMiniDump=1
+ ./RunTests.sh --runtime-path /root/helix/work/correlation
----- start Tue Mar 22 07:59:07 UTC 2022 =============== To repro directly: =====================================================
pushd .
/root/helix/work/correlation/dotnet exec --runtimeconfig System.Text.Encodings.Web.Tests.runtimeconfig.json --depsfile System.Text.Encodings.Web.Tests.deps.json xunit.console.dll System.Text.Encodings.Web.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing 
popd
===========================================================================================================
/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: System.Text.Encodings.Web.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Text.Encodings.Web.Tests (found 212 test cases)
  Starting:    System.Text.Encodings.Web.Tests (parallel test collections = on, max threads = 4)

Assert failure(PID 62 [0x0000003e], Thread: 86 [0x0056]): !CREATE_CHECK_STRING(pMT && pMT->Validate())
    File: /__w/1/s/src/coreclr/vm/object.cpp Line: 514
    Image: /root/helix/work/correlation/dotnet

Gathering state for process 62 dotnet
Crashing thread 00000056 signal 00000006
Writing minidump with heap to file /home/helixbot/dotnetbuild/dumps/coredump.62.dmp
Written 164274176 bytes (40106 pages) to core file

GCStress wasn't set, but this is typically a GC info bug. JitStress is probably irrelevant.

@BruceForstall BruceForstall added arch-arm32 os-linux Linux OS (any supported distro) GCStress JitStress CLR JIT issues involving JIT internal stress modes area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Mar 23, 2022
@BruceForstall BruceForstall added this to the 7.0.0 milestone Mar 23, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 23, 2022
@ghost
Copy link

ghost commented Mar 23, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

https://dev.azure.com/dnceng/public/_build/results?buildId=1675437&view=ms.vss-test-web.build-test-results-tab&runId=45952896&resultId=184681&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

+ grep COMPlus
COMPlus_TieredCompilation=1
COMPlus_JitStress=2
COMPlus_DbgMiniDumpName=/home/helixbot/dotnetbuild/dumps/coredump.%d.dmp
COMPlus_DbgEnableMiniDump=1
+ ./RunTests.sh --runtime-path /root/helix/work/correlation
----- start Tue Mar 22 07:59:07 UTC 2022 =============== To repro directly: =====================================================
pushd .
/root/helix/work/correlation/dotnet exec --runtimeconfig System.Text.Encodings.Web.Tests.runtimeconfig.json --depsfile System.Text.Encodings.Web.Tests.deps.json xunit.console.dll System.Text.Encodings.Web.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing 
popd
===========================================================================================================
/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: System.Text.Encodings.Web.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Text.Encodings.Web.Tests (found 212 test cases)
  Starting:    System.Text.Encodings.Web.Tests (parallel test collections = on, max threads = 4)

Assert failure(PID 62 [0x0000003e], Thread: 86 [0x0056]): !CREATE_CHECK_STRING(pMT && pMT->Validate())
    File: /__w/1/s/src/coreclr/vm/object.cpp Line: 514
    Image: /root/helix/work/correlation/dotnet

Gathering state for process 62 dotnet
Crashing thread 00000056 signal 00000006
Writing minidump with heap to file /home/helixbot/dotnetbuild/dumps/coredump.62.dmp
Written 164274176 bytes (40106 pages) to core file

GCStress wasn't set, but this is typically a GC info bug. JitStress is probably irrelevant.

Author: BruceForstall
Assignees: -
Labels:

arch-arm32, os-linux, GCStress, JitStress, area-CodeGen-coreclr, untriaged

Milestone: 7.0.0

@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Mar 28, 2022
@BruceForstall
Copy link
Member Author

BruceForstall commented Apr 15, 2022

After configuring a Linux/arm32 environment (using Docker on Linux/arm64), with Release libraries and Checked coreclr, I can easily reproduce this failure, with COMPlus_TieredCompilation=1 and COMPlus_JitStress=2. However, it won't repro with:

  1. COMPlus_TieredCompilation=0
  2. COMPlus_JitStress unset
  3. COMPlus_GCStress=4
  4. COMPlus_ReadyToRun=0
  5. xunit -parallel none -maxthreads 1
  6. a Debug CLR or JIT
  7. Just a Checked JIT, but Release CoreCLR

The stack is:

Thread 23 ".NET Long Runni" received signal SIGSEGV, Segmentation fault.
MethodTable::SanityCheck (this=0x4) at /home/bruce/gh/runtime2/src/coreclr/vm/methodtable.cpp:5967
5967        _ASSERTE((GetComponentSize() <= 2) || IsArray());
(gdb) bt
#0  MethodTable::SanityCheck (this=0x4) at /home/bruce/gh/runtime2/src/coreclr/vm/methodtable.cpp:5967
#1  0xf6e939f4 in MethodTable::Validate (this=0x4) at /home/bruce/gh/runtime2/src/coreclr/vm/methodtable.cpp:8046
#2  0xf6e976b0 in Object::ValidateInner (this=0xe8429654, bDeep=0x1, bVerifyNextHeader=0x1, bVerifySyncBlock=0x1) at /home/bruce/gh/runtime2/src/coreclr/vm/object.cpp:514
#3  0xf70466fa in GcInfoDecoder::ReportStackSlotToGC (this=0xe7c233f8, spOffset=0x20, spBase=<optimized out>, gcFlags=0x0, pRD=0xe7c238d0, flags=<optimized out>, pCallBack=
    0xf6f6a61d <GcEnumObject(void*, OBJECTREF*, unsigned int)>, hCallBack=0xe7c23eb4) at /home/bruce/gh/runtime2/src/coreclr/vm/gcinfodecoder.cpp:1868
#4  0xf7044b02 in GcInfoDecoder::ReportSlotToGC (this=<optimized out>, slotDecoder=..., slotIndex=<optimized out>, pRD=<optimized out>, reportScratchSlots=<optimized out>, inputFlags=<optimized out>, pCallBack=<optimized out>,
    hCallBack=<optimized out>) at /home/bruce/gh/runtime2/src/coreclr/inc/gcinfodecoder.h:692
#5  GcInfoDecoder::ReportUntrackedSlots (this=0xe7c233f8, slotDecoder=..., pRD=0xe7c238d0, inputFlags=0x1, pCallBack=0xf6f6a61d <GcEnumObject(void*, OBJECTREF*, unsigned int)>, hCallBack=0xe7c23eb4)
    at /home/bruce/gh/runtime2/src/coreclr/vm/gcinfodecoder.cpp:1018
#6  GcInfoDecoder::EnumerateLiveSlots (this=0xe7c233f8, pRD=0xe7c238d0, reportScratchSlots=0x1, inputFlags=<optimized out>, pCallBack=0xf6f6a61d <GcEnumObject(void*, OBJECTREF*, unsigned int)>, hCallBack=0xe7c23eb4)
    at /home/bruce/gh/runtime2/src/coreclr/vm/gcinfodecoder.cpp:967
#7  0xf6e1f866 in EECodeManager::EnumGcRefs (this=<optimized out>, pRD=0xe7c238d0, pCodeInfo=0xe7c237f4, flags=0x1, pCallBack=0xf6f6a61d <GcEnumObject(void*, OBJECTREF*, unsigned int)>, hCallBack=0xe7c23eb4,
    relOffsetOverride=0xffffffff) at /home/bruce/gh/runtime2/src/coreclr/vm/eetwain.cpp:5308
#8  0xf6f6a9c6 in GcStackCrawlCallBack (pCF=0xe7c235d8, pData=0xe7c23eb4) at /home/bruce/gh/runtime2/src/coreclr/vm/gcenv.ee.common.cpp:282
#9  0xf6ebd5e4 in Thread::MakeStackwalkerCallback (this=0xbca218, pCF=0xe7c235d8, pCallback=0xf6f6a735 <GcStackCrawlCallBack(CrawlFrame*, void*)>, pData=0xe7c23eb4, uFramesProcessed=0x2)
    at /home/bruce/gh/runtime2/src/coreclr/vm/stackwalk.cpp:833
#10 0xf6ebd7da in Thread::StackWalkFramesEx (this=0xbca218, pRD=0xe7c238d0, pCallback=0xf6f6a735 <GcStackCrawlCallBack(CrawlFrame*, void*)>, pData=0xe7c23eb4, flags=0x8500, pStartFrame=0x0)
    at /home/bruce/gh/runtime2/src/coreclr/vm/stackwalk.cpp:913
#11 0xf6ebe03a in Thread::StackWalkFrames (this=0xbca218, pCallback=0xf6f6a735 <GcStackCrawlCallBack(CrawlFrame*, void*)>, pData=0xe7c23eb4, flags=0x8500, pStartFrame=0x0)
    at /home/bruce/gh/runtime2/src/coreclr/vm/stackwalk.cpp:996
#12 0xf6f67474 in ScanStackRoots (pThread=0xbca218, fn=0xf70a77b1 <WKS::GCHeap::Promote(Object**, ScanContext*, unsigned int)>, sc=0xe7c23f98) at /home/bruce/gh/runtime2/src/coreclr/vm/gcenv.ee.cpp:169
#13 0xf6f671b0 in GCToEEInterface::GcScanRoots (fn=0xf70a77b1 <WKS::GCHeap::Promote(Object**, ScanContext*, unsigned int)>, condemned=0x0, max_gen=0x2, sc=0xe7c23f98) at /home/bruce/gh/runtime2/src/coreclr/vm/gcenv.ee.cpp:268
#14 0xf709d6f2 in WKS::gc_heap::mark_phase (condemned_gen_number=0x0, mark_only_p=0x0) at /home/bruce/gh/runtime2/src/coreclr/gc/gc.cpp:25695
#15 0xf709afd2 in WKS::gc_heap::gc1 () at /home/bruce/gh/runtime2/src/coreclr/gc/gc.cpp:20608
#16 0xf70a357e in WKS::gc_heap::garbage_collect (n=0x0) at /home/bruce/gh/runtime2/src/coreclr/gc/gc.cpp:22368
#17 0xf709611c in WKS::GCHeap::GarbageCollectGeneration (this=<optimized out>, gen=0x0, reason=<optimized out>) at /home/bruce/gh/runtime2/src/coreclr/gc/gc.cpp:45988
#18 0xf7098436 in WKS::gc_heap::try_allocate_more_space (acontext=<optimized out>, size=<optimized out>, flags=<optimized out>, gen_number=<optimized out>) at /home/bruce/gh/runtime2/src/coreclr/gc/gc.cpp:17479
#19 0xf70b7bf6 in WKS::gc_heap::allocate_more_space (acontext=0xbc6b18, size=<optimized out>, flags=0x2, alloc_generation_number=0x0) at /home/bruce/gh/runtime2/src/coreclr/gc/gc.cpp:17954
#20 WKS::gc_heap::allocate (jsize=0x1c, acontext=0xbc6b18, flags=0x2) at /home/bruce/gh/runtime2/src/coreclr/gc/gc.cpp:17985
#21 WKS::GCHeap::Alloc (this=<optimized out>, context=0xbc6b18, size=0x1c, flags=0x2) at /home/bruce/gh/runtime2/src/coreclr/gc/gc.cpp:44950
#22 0xf6f6ced6 in Alloc (size=0x1c, flags=GC_ALLOC_CONTAINS_REF) at /home/bruce/gh/runtime2/src/coreclr/vm/gchelpers.cpp:226
#23 0xf6f6cb44 in AllocateObject (pMT=0xe6aaae34) at /home/bruce/gh/runtime2/src/coreclr/vm/gchelpers.cpp:979
#24 0xf6f881f2 in JIT_New (typeHnd_=0xe6aaae34) at /home/bruce/gh/runtime2/src/coreclr/vm/jithelpers.cpp:2312
#25 0xf6f8803c in JIT_NewS_MP_FastPortable (typeHnd_=0xe6aaae34) at /home/bruce/gh/runtime2/src/coreclr/vm/jithelpers.cpp:2286

andn it looks like we're reporting slot SP offset 0x20 in

System.UInt32.TryFormat(System.Span`1<Char>, Int32 ByRef, System.ReadOnlySpan`1<Char>, System.IFormatProvider)

:

(gdb) disas /r 0xe69893e4,+0x40
Dump of assembler code from 0xe69893e4 to 0xe6989424:
   0xe69893e4:  00 99   ldr     r1, [sp, #0]
   0xe69893e6:  01 9a   ldr     r2, [sp, #4]
   0xe69893e8:  00 68   ldr     r0, [r0, #0]
   0xe69893ea:  73 46   mov     r3, lr
   0xe69893ec:  44 f6 e8 3e     movw    lr, #19432      ; 0x4be8
   0xe69893f0:  cf f2 d1 0e     movt    lr, #61649      ; 0xf0d1j
   0xe69893f4:  de f8 00 c0     ldr.w   r12, [lr]
   0xe69893f8:  03 b0   add     sp, #12
   0xe69893fa:  5d f8 04 eb     ldr.w   lr, [sp], #4
   0xe69893fe:  02 b0   add     sp, #8
   0xe6989400:  60 47   bx      r12
   0xe6989402:  00 00   movs    r0, r0
   0xe6989404:  1f 00   movs    r7, r3                                                  // this is offset 0x20
   0xe6989406:  90 30   adds    r0, #144        ; 0x90

but I'm less sure of that. This does match the JitStress=2 codegen from TryFormat, but the m_codeAddress field of a codeInfo struct I found doesn't point to the first address of the function, which makes me think this might not be the correct function, or maybe there's some corruption?

I don't currently have a working LLDB in my Linux/arm32 Docker setup, so no SOS either.

@BruceForstall
Copy link
Member Author

BruceForstall commented Apr 15, 2022

I set up a Raspberry Pi 4 with a 32-bit server Ubuntu installation (Ubuntu 20.04.4 LTS), and I have a working LLDB 10 there, but I can't get the bug to repro in that environment.

@BruceForstall
Copy link
Member Author

@janvorli @AndyAyersMS I probably need your help to make any more progress on this (or maybe @janvorli I can hand this off to you?)

@BruceForstall
Copy link
Member Author

@BruceForstall
Copy link
Member Author

BruceForstall commented Apr 16, 2022

Ok, I've figured this out.

The problem was introduced by the enabling of fast tail call optimization for arm32 (#66282).

The code that fails is:

public bool TryFormat(Span<char> destination, out int charsWritten, [StringSyntax(StringSyntaxAttribute.NumericFormat)] ReadOnlySpan<char> format = default, IFormatProvider? provider = null)
{
  return Number.TryFormatUInt32(m_value, format, provider, destination, out charsWritten);
}

which does a fast tail call. When it is setting up the tail call arguments, it turns off GC reporting, since the incoming argument space GC info might be incompatible with the GC info needed by the tail call arguments put in the same location. However, one of the IR nodes in the argument setup is:

STORE_BLK struct<System.ReadOnlySpan`1[System.Char], 8> (copy) (Unroll)

When we generate this, we disable GC tracking (again) then re-enable GC tracking (which we shouldn't do). Thus, we end up reporting a range of the tailcall argument setup as interruptible. Most of the slots end up being marked untracked, and gc pointers match with gc pointers. However, one slot is a gcref incoming, and is a byref outgoing, but the untracked slot is marked gcref, not byref. So the byref value is not a legal object pointer, leading to a GC assert.

I think the solution will be to have a "ref count" of "disable gc reporting" requests, not just a true/false.

cc @jakobbotsch

BruceForstall added a commit to BruceForstall/runtime that referenced this issue Apr 16, 2022
When doing fast tail call on arm32, we disable GC reporting.
But some IR nodes, like unrolled STORE_BLK, also disable GC
reporting. If one of those is within a fast tail call argument
setup region, we'll attempt to disable GC reporting twice. Since
we didn't keep track of nesting, we end up marking some of the
tail call region after the STORE_BLK as interruptible, leading
to be GC info in the argument area.

Change the enable/disable GC calls to keep a nesting count,
and only re-enable GC reporting when the count reaches zero.

Fixes dotnet#67046
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 16, 2022
BruceForstall added a commit that referenced this issue Apr 18, 2022
When doing fast tail call on arm32, we disable GC reporting.
But some IR nodes, like unrolled STORE_BLK, also disable GC
reporting. If one of those is within a fast tail call argument
setup region, we'll attempt to disable GC reporting twice. Since
we didn't keep track of nesting, we end up marking some of the
tail call region after the STORE_BLK as interruptible, leading
to be GC info in the argument area.

Change the enable/disable GC calls to keep a nesting count,
and only re-enable GC reporting when the count reaches zero.

Fixes #67046
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 18, 2022
directhex pushed a commit to directhex/runtime that referenced this issue Apr 21, 2022
When doing fast tail call on arm32, we disable GC reporting.
But some IR nodes, like unrolled STORE_BLK, also disable GC
reporting. If one of those is within a fast tail call argument
setup region, we'll attempt to disable GC reporting twice. Since
we didn't keep track of nesting, we end up marking some of the
tail call region after the STORE_BLK as interruptible, leading
to be GC info in the argument area.

Change the enable/disable GC calls to keep a nesting count,
and only re-enable GC reporting when the count reaches zero.

Fixes dotnet#67046
@ghost ghost locked as resolved and limited conversation to collaborators May 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm32 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI GCStress JitStress CLR JIT issues involving JIT internal stress modes os-linux Linux OS (any supported distro)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants