Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failed: (GetComponentSize() <= 2) || IsArray() #86273

Closed
BruceForstall opened this issue May 15, 2023 · 38 comments · Fixed by #100376
Closed

Assertion failed: (GetComponentSize() <= 2) || IsArray() #86273

BruceForstall opened this issue May 15, 2023 · 38 comments · Fixed by #100376
Assignees
Labels
arch-arm32 area-VM-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' in-pr There is an active PR which will close this issue when it is merged Known Build Error Use this to report build issues in the .NET Helix tab
Milestone

Comments

@BruceForstall
Copy link
Member

BruceForstall commented May 15, 2023

net8.0-linux-Release-arm-CoreCLR_checked-jitstress2_jitstressregs1-(Ubuntu.1804.Arm32.Open)[email protected]/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7

System.Memory.Tests Work Item

https://dev.azure.com/dnceng-public/public/_build/results?buildId=272702&view=ms.vss-test-web.build-test-results-tab&runId=5384424&paneView=debug&resultId=195466

export DOTNET_TieredCompilation=0
export DOTNET_DbgEnableMiniDump=1
export DOTNET_EnableCrashReport=1
export DOTNET_DbgMiniDumpName=$HELIX_DUMP_FOLDER/coredump.%d.dmp
export DOTNET_JitStress=2
export DOTNET_JitStressRegs=1
+ ./RunTests.sh --runtime-path /root/helix/work/correlation
----- start Sat May 13 09:40:19 UTC 2023 =============== To repro directly: =====================================================
pushd .
/root/helix/work/correlation/dotnet exec --runtimeconfig System.Memory.Tests.runtimeconfig.json --depsfile System.Memory.Tests.deps.json xunit.console.dll System.Memory.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing 
popd
===========================================================================================================
/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: System.Memory.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Memory.Tests (found 2682 of 2709 test cases)
  Starting:    System.Memory.Tests (parallel test collections = on, max threads = 2)

Assert failure(PID 28 [0x0000001c], Thread: 42 [0x002a]): (GetComponentSize() <= 2) || IsArray()
    File: /__w/1/s/src/coreclr/vm/methodtable.cpp Line: 7265
    Image: /root/helix/work/correlation/dotnet

[createdump] Gathering state for process 28 dotnet
[createdump] Crashing thread 002a signal 6 (0006)
[createdump] Writing crash report to file /home/helixbot/dotnetbuild/dumps/coredump.28.dmp.crashreport.json
[createdump] Crash report successfully written
[createdump] Writing minidump with heap to file /home/helixbot/dotnetbuild/dumps/coredump.28.dmp
[createdump] Written 201752576 bytes (49256 pages) to core file
[createdump] Target process is alive
[createdump] Dump successfully written in 389ms
waitpid() returned successfully (wstatus 00000000) WEXITSTATUS 0 WTERMSIG 0
./RunTests.sh: line 168:    28 Aborted                 (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Memory.Tests.runtimeconfig.json --depsfile System.Memory.Tests.deps.json xunit.console.dll System.Memory.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/root/helix/work/workitem/e
----- end Sat May 13 09:40:40 UTC 2023 ----- exit code 134 ----------------------------------------------------------

Known Issue Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "CancelItselfOutsideOfTryCatchFinally",
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Known issue validation

Build: 🔎
Result validation: ⚠️ Validation could not be done without an Azure DevOps build URL on the issue. Please add it to the "Build: 🔎" line.
Validation performed at: 3/22/2024 4:10:23 PM UTC

Report

Build Definition Test Pull Request
660346 dotnet/runtime WasmTestOnChrome-MT-System.Runtime.Tests.WorkItemExecution #101162
659851 dotnet/runtime WasmTestOnChrome-MT-System.Runtime.Tests.WorkItemExecution

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
2 2 2
@ghost ghost added the untriaged New issue has not been triaged by the area owner label May 15, 2023
@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Jul 24, 2023
@mangod9 mangod9 added this to the Future milestone Jul 24, 2023
@jkotas
Copy link
Member

jkotas commented Feb 26, 2024

Assert hit in #98744 - Libraries Test Run checked coreclr linux_musl arm Release, System.Runtime.Tests

https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-98744-merge-c7e869af78dc49fb82/System.Runtime.Tests/1/console.c4c36baf.log

@kunalspathak
Copy link
Member

kunalspathak commented Feb 27, 2024

@jkotas jkotas added blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' Known Build Error Use this to report build issues in the .NET Helix tab labels Feb 27, 2024
@BruceForstall BruceForstall modified the milestones: Future, 9.0.0 Mar 7, 2024
@BruceForstall
Copy link
Member Author

@mangod9 Milestone Future is not appropriate for this continuing failure. I marked it as 9.0.

@mangod9
Copy link
Member

mangod9 commented Mar 7, 2024

Ok, will take a look. Looks like it has recently started occurring in regular CI and not just JITStress

@mangod9
Copy link
Member

mangod9 commented Mar 11, 2024

from what I see, the dumps collected dont seem to be too useful. @tommcdon fyi.

@jkotas
Copy link
Member

jkotas commented Mar 12, 2024

It looks like the compiler generates broken unwind info for PROCAbort due to NORETURN annotation:

(gdb) bt
#0  0xf7c6f196 in __syscall4 (n=175, d=8, c=0, b=-386086992, a=2) at ./arch/arm/syscall_arch.h:75
#1  __restore_sigs (set=set@entry=0xe8fcc7b0) at src/signal/block.c:43
#2  0xf7c6f236 in raise (sig=sig@entry=6) at src/signal/raise.c:11
#3  0xf7c4bf8a in abort () at src/exit/abort.c:11
#4  0xf78ab788 in PROCAbort (signal=<optimized out>, siginfo=0x0)
    at /__w/1/s/src/coreclr/pal/src/thread/process.cpp:2558
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

jkotas added a commit to jkotas/runtime that referenced this issue Mar 12, 2024
jkotas added a commit that referenced this issue Mar 13, 2024
* Disable NORETURN annotation on PROCAbort for Arm

Contributes to #86273

* Fix build break
@BruceForstall
Copy link
Member Author

Another hit:

DOTNET_EnableCrashReport=1
DOTNET_TieredCompilation=0
DOTNET_DbgMiniDumpName=/home/helixbot/dotnetbuild/dumps/coredump.%d.dmp
DOTNET_JitStressRegs=1
DOTNET_DbgEnableMiniDump=1
+ ./RunTests.sh --runtime-path /root/helix/work/correlation
========================= Begin custom configuration settings ==============================
export __IsXUnitLogCheckerSupported=1
========================== End custom configuration settings ===============================
----- start Mon Mar 18 17:24:11 UTC 2024 =============== To repro directly: =====================================================
pushd .
/root/helix/work/correlation/dotnet exec --runtimeconfig System.Runtime.Tests.runtimeconfig.json --depsfile System.Runtime.Tests.deps.json xunit.console.dll System.Runtime.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=AdditionalTimezoneChecks -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing 
popd
===========================================================================================================
/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: System.Runtime.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Runtime.Tests (found 9294 of 9341 test cases)
  Starting:    System.Runtime.Tests (parallel test collections = on [2 threads], stop on fail = off)
    System.Tests.DateTimeOffsetTests.ToLocalTime_MaxValue [SKIP]
      Condition(s) not met: "IsMaxValuePositiveLocalOffset"
    System.Tests.DateTimeOffsetTests.ToLocalTime_MinValue [SKIP]
      Condition(s) not met: "IsMinValueNegativeLocalOffset"
    System.Tests.DateTimeOffsetTests.ToLocalTime_Ambiguous [SKIP]
      Condition(s) not met: "IsPacificTime"
    System.Tests.TimeZoneInfoTests.UnsupportedImplicitConversionTest [SKIP]
      Condition(s) not met: "DoesNotSupportIanaNamesConversion"
    System.Tests.GCExtendedTests.GetGCMemoryInfo [SKIP]
      Condition(s) not met: "IsNotArmProcessAndRemoteExecutorSupported"
    System.Tests.StringTests.IndexOf_SingleLetter(s: "Hello", target: '\0', startIndex: 0, count: 5, expected: -1) [SKIP]
      Target \0 is not supported in ICU

Assert failure(PID 27 [0x0000001b], Thread: 40 [0x0028]): (GetComponentSize() <= 2) || IsArray()
    File: /__w/1/s/src/coreclr/vm/methodtable.cpp:7366
    Image: /root/helix/work/correlation/dotnet

net9.0-linux-Release-arm-jitstressregs1

https://dev.azure.com/dnceng-public/public/_build/results?buildId=607036&view=ms.vss-test-web.build-test-results-tab&runId=14798014&paneView=debug

@jkotas apparently your fix didn't cover all cases?

@BruceForstall
Copy link
Member Author

BruceForstall commented Mar 18, 2024

Another: https://dev.azure.com/dnceng-public/public/_build/results?buildId=607034&view=ms.vss-test-web.build-test-results-tab

(net9.0-linux-Release-arm-jitstress2_jitstressregs8)

@jkotas
Copy link
Member

jkotas commented Mar 18, 2024

@jkotas apparently your fix didn't cover all cases?

My fix was meant to address #86273 (comment) and make the dumps diagnosable. The fix works as expected as far as I can tell. The debugger produces good stacktraces now. This should unblock further investigation.

Crash stacktrace from https://dev.azure.com/dnceng-public/public/_build/results?buildId=607034&view=ms.vss-test-web.build-test-results-tab

#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
#1  0xf78a3b32 in __libc_signal_restore_set (set=0xe628f03c) at ../sysdeps/unix/sysv/linux/nptl-signals.h:80
#2  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:48
#3  0xf78a482e in __GI_abort () at abort.c:79
#4  0xf76f6250 in RaiseFailFastException () from /helix_payload/System.Runtime.Tests/libcoreclr.so
#5  0xf75e0030 in FailFastOnAssert() () from /helix_payload/System.Runtime.Tests/libcoreclr.so
#6  0xf75dfee2 in _DbgBreakCheck () from /helix_payload/System.Runtime.Tests/libcoreclr.so
#7  0xf75e008a in _DbgBreakCheckNoThrow(char const*, int, char const*, int) ()
   from /helix_payload/System.Runtime.Tests/libcoreclr.so
#8  0xf75e0278 in DbgAssertDialog () from /helix_payload/System.Runtime.Tests/libcoreclr.so
#9  0xf7379fae in MethodTable::SanityCheck() () from /helix_payload/System.Runtime.Tests/libcoreclr.so
#10 0xf737d4bc in MethodTable::Validate() () from /helix_payload/System.Runtime.Tests/libcoreclr.so
#11 0xf7380268 in Object::ValidateInner(int, int, int) () from /helix_payload/System.Runtime.Tests/libcoreclr.so
#12 0xf74ffe4c in GcInfoDecoder::ReportRegisterToGC(int, unsigned int, REGDISPLAY*, unsigned int, void (*)(void*, OBJECTREF*, unsigned int), void*) () from /helix_payload/System.Runtime.Tests/libcoreclr.so
#13 0xf74ff098 in GcInfoDecoder::ReportSlotToGC(GcSlotDecoder&, unsigned int, REGDISPLAY*, bool, unsigned int, void (*)(void*, OBJECTREF*, unsigned int), void*) () from /helix_payload/System.Runtime.Tests/libcoreclr.so
#14 0xf74fdbb6 in GcInfoDecoder::EnumerateLiveSlots(REGDISPLAY*, bool, unsigned int, void (*)(void*, OBJECTREF*, unsigned int), void*) () from /helix_payload/System.Runtime.Tests/libcoreclr.so
#15 0xf7310046 in EECodeManager::EnumGcRefs(REGDISPLAY*, EECodeInfo*, unsigned int, void (*)(void*, OBJECTREF*, unsigned int), void*, unsigned int) () from /helix_payload/System.Runtime.Tests/libcoreclr.so
#16 0xf7438e80 in GcStackCrawlCallBack(CrawlFrame*, void*) () from /helix_payload/System.Runtime.Tests/libcoreclr.so
#17 0xf73a6f04 in Thread::MakeStackwalkerCallback(CrawlFrame*, StackWalkAction (*)(CrawlFrame*, void*), void*, unsigned int) () from /helix_payload/System.Runtime.Tests/libcoreclr.so
#18 0xf73a70a6 in Thread::StackWalkFramesEx(REGDISPLAY*, StackWalkAction (*)(CrawlFrame*, void*), void*, unsigned int, Frame*) () from /helix_payload/System.Runtime.Tests/libcoreclr.so
#19 0xf73a78d8 in Thread::StackWalkFrames(StackWalkAction (*)(CrawlFrame*, void*), void*, unsigned int, Frame*) ()
   from /helix_payload/System.Runtime.Tests/libcoreclr.so
#20 0xf74355fe in ScanStackRoots(Thread*, void (*)(Object**, ScanContext*, unsigned int), ScanContext*) ()
   from /helix_payload/System.Runtime.Tests/libcoreclr.so
#21 0xf743530e in GCToEEInterface::GcScanRoots(void (*)(Object**, ScanContext*, unsigned int), int, int, ScanContext*)
    () from /helix_payload/System.Runtime.Tests/libcoreclr.so
#22 0xf7559be8 in WKS::gc_heap::mark_phase(int) () from /helix_payload/System.Runtime.Tests/libcoreclr.so
#23 0xf75573ca in WKS::gc_heap::gc1() () from /helix_payload/System.Runtime.Tests/libcoreclr.so
#24 0xf755fc48 in WKS::gc_heap::garbage_collect(int) () from /helix_payload/System.Runtime.Tests/libcoreclr.so
#25 0xf7552a62 in WKS::GCHeap::GarbageCollectGeneration(unsigned int, gc_reason) ()
   from /helix_payload/System.Runtime.Tests/libcoreclr.so
#26 0xf7553a98 in WKS::gc_heap::trigger_gc_for_alloc(int, gc_reason, WKS::GCDebugSpinLock*, bool, WKS::msl_take_state)
    () from /helix_payload/System.Runtime.Tests/libcoreclr.so
#27 0xf75545d0 in WKS::gc_heap::try_allocate_more_space(alloc_context*, unsigned int, unsigned int, int) ()
   from /helix_payload/System.Runtime.Tests/libcoreclr.so
#28 0xf7574a08 in WKS::GCHeap::Alloc(gc_alloc_context*, unsigned int, unsigned int) ()
   from /helix_payload/System.Runtime.Tests/libcoreclr.so
#29 0xf7439776 in Alloc(unsigned int, GC_ALLOC_FLAGS) () from /helix_payload/System.Runtime.Tests/libcoreclr.so
#30 0xf7439610 in AllocateSzArray(MethodTable*, int, GC_ALLOC_FLAGS) ()
   from /helix_payload/System.Runtime.Tests/libcoreclr.so
#31 0xf7455c8e in JIT_NewArr1(CORINFO_CLASS_STRUCT_*, int) () from /helix_payload/System.Runtime.Tests/libcoreclr.so

@jkotas
Copy link
Member

jkotas commented Mar 19, 2024

Crash stacktrace from https://dev.azure.com/dnceng-public/public/_build/results?buildId=607034&view=ms.vss-test-web.build-test-results-tab

This crashed while stackwalking aborted CancellationTokenSource.ExecuteCallbackHandlers in ControlledExecutionTests. The managed stacktrace of the thread with bad object references is:

System.SR.InternalGetResourceString(System.String)
System.SR.GetResourceString(System.String)
System.Threading.ThreadAbortException..ctor()
--- RedirectForThreadAbort ---
System.Threading.CancellationTokenSource.ExecuteCallbackHandlers(Boolean) <- the bad object reference is in this frame
System.Threading.CancellationTokenSource.NotifyCancellation(Boolean)
System.Threading.CancellationTokenSource.Cancel()
System.Runtime.Tests.ControlledExecutionTests+<>c__DisplayClass13_0.<CancelItselfOutsideOfTryCatchFinally>g__Test|0()
System.Runtime.ControlledExecution.Run(System.Action, System.Threading.CancellationToken)
System.Runtime.Tests.ControlledExecutionTests.RunTest(System.Action, System.Threading.CancellationToken)
System.Runtime.Tests.ControlledExecutionTests.CancelItselfOutsideOfTryCatchFinally()

@mangod9
Copy link
Member

mangod9 commented Mar 19, 2024

Thanks for looking into it Jan. @VSadov, Since this appears to be a heap corruption and you were recently trying to repro an arm32 issue recently -- could you please check if this repros with these settings: #86273 (comment).

Also adding @janvorli in case you had seen these during your exceptions gc-hole investigations

@VSadov
Copy link
Member

VSadov commented Mar 19, 2024

I will take a look. It looks like ThreadAbort specific though and not with GCStress, so it may be quite different from things I was looking at recently.

@VSadov
Copy link
Member

VSadov commented Mar 22, 2024

This is not arm32-specific. There are failures on x64 as well.

(in Work Item System.Threading.Channels.Tests)

+ export __TestArchitecture=x64
+ ./RunTests.sh --runtime-path /datadisks/disk1/work/C2E10A96/p
========================= Begin custom configuration settings ==============================
export __IsXUnitLogCheckerSupported=1
========================== End custom configuration settings ===============================
----- start Thu Mar 21 11:48:28 PM UTC 2024 =============== To repro directly: =====================================================
pushd .
/datadisks/disk1/work/C2E10A96/p/dotnet exec --runtimeconfig System.Threading.Channels.Tests.runtimeconfig.json --depsfile System.Threading.Channels.Tests.deps.json xunit.console.dll System.Threading.Channels.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing 
popd
===========================================================================================================
/datadisks/disk1/work/C2E10A96/w/A52F0964/e /datadisks/disk1/work/C2E10A96/w/A52F0964/e
  Discovering: System.Threading.Channels.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Threading.Channels.Tests (found 454 test cases)
  Starting:    System.Threading.Channels.Tests (parallel test collections = on [2 threads], stop on fail = off)
    System.Threading.Channels.Tests.StressTests.CanceledReads [SKIP]
      Condition(s) not met: "IsStressModeEnabled"
    System.Threading.Channels.Tests.StressTests.ReadWriteVariations [SKIP]
      Condition(s) not met: "IsStressModeEnabled"

Assert failure(PID 101455 [0x00018c4f], Thread: 101477 [0x18c65]): (GetComponentSize() <= 2) || IsArray()
    File: /__w/1/s/src/coreclr/vm/methodtable.cpp:7298
    Image: /datadisks/disk1/work/C2E10A96/p/dotnet

https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-pull-99183-merge-a382fbcfd3cc4a53b9/System.Threading.Channels.Tests/1/console.11466e8a.log?helixlogtype=result

@VSadov VSadov removed the arch-arm32 label Mar 22, 2024
@jkotas
Copy link
Member

jkotas commented Mar 22, 2024

This is not arm32-specific. There are failures on x64 as well.

This assert is a general symptom of a GC hole, bad GCInfo or corrupted GC heap. It can have many different root causes.

I think it is very likely that the root cause for the arm32 failure in System.Runtime tests is different from the root cause for the System.Threading.Channels.Tests failure.

@VSadov
Copy link
Member

VSadov commented Mar 22, 2024

I've changed the error to be CancelItselfOutsideOfTryCatchFinally

Confusingly, the original report is for System.Memory.Tests, which is probably a different, less common, issue

@mangod9
Copy link
Member

mangod9 commented Mar 22, 2024

Yeah agree, the original error is possibly no longer occurring. Note that it was from over a year ago.

@VSadov
Copy link
Member

VSadov commented Mar 22, 2024

I am able to reproduce this locally

@VSadov
Copy link
Member

VSadov commented Mar 22, 2024

It looks like we are reporting junk to GC in register R4

The crash does not require DOTNET_JitStressRegs. Enabling GCStress=0xC triggers the crash as well.
Turning R2R off makes the crash disappear.

Is there a way to do JitDisasm for R2R methods? (on arm32)

Forcing JIT to emit code similar to R2R would work too as then I could just do regular JitDisasm

@mangod9
Copy link
Member

mangod9 commented Mar 22, 2024

yeah R2RDump should be able to disasm them. Adding @cshung since ILSpy would also work here.

@VSadov
Copy link
Member

VSadov commented Mar 22, 2024

With DOTNET_JitStressRegs=1 it fails in GCStress even with R2R disabled. I have a JitDisasm for that.

@cshung
Copy link
Member

cshung commented Mar 22, 2024

It is probably best to get a jitdump, something you can capture if you turn on some switches and crossgen again. Thanks to determinism you should always get the same code. With the JIT dump, look at register allocation to see why the JIT believe R4 has a GC reference at that point while it is junk.

@VSadov
Copy link
Member

VSadov commented Mar 23, 2024

This is for the JIT code (with DOTNET_JitStressRegs=1)

G_M33701_IG06:        ; bbWeight=1, gcVars=0000000100010002 {V00 V02 V05}, gcrefRegs=0030 {r4 r5}, byrefRegs=0000 {}, gcvars, byref
                       ; gcrRegs +[r4-r5]
                       ; GC ptr vars +{V00 V01 V02 V05 V16 V32}
000086      movw    r6, 0x8141
00008A      movt    r6, 0xf2ac
00008E      blx     r6          // System.Environment:get_CurrentManagedThreadId():int
                       ; gcr arg pop 0
000090      dmb     15
000094      str     r0, [r5+0x24]
000096      movs    r6, 0
000098      str     r6, [sp+0x08]       // [V03 loc1]
                       ; GC ptr vars +{V03}

  /===  GC happens when we are about to execute the next instruction  (offset 00009A).
 V       if we'd come from the above, R4 would be live (although unused), but if we branched from below, R4 contains junk.

         label G_M33701_IG07 seems to be killing R4, so why is it reported at 00009A?

                                                ;; size=20 bbWeight=1 PerfScore 7.00
G_M33701_IG07:        ; bbWeight=8, gcVars=0000000100410002 {V00 V02 V03 V05}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref, isz
                       ; gcrRegs -[r4-r5]
                       ; GC ptr vars -{V01 V16 V32}
00009A      ldr     r5, [sp+0x0C]       // [V02 loc0]
                       ; gcrRegs +[r5]
00009C      add     r4, r5, 40
                       ; byrRegs +[r4]
0000A0      cmp     r4, 0
0000A2      beq     SHORT G_M33701_IG20
0000A4      mov     r0, r4
                       ; byrRegs +[r0]

@mangod9
Copy link
Member

mangod9 commented Mar 23, 2024

so feels like a codegen issue?

@VSadov
Copy link
Member

VSadov commented Mar 25, 2024

so feels like a codegen issue?

Not completely sure.
What JIT dump shows should not, in theory, result in reporting R4 to GC.
Either it is a bad GC info (it would be codegen issue), or the GC info actually matches what we see in the dump, then it could be something with GC info interpretation.

Another confusing part is that this code pattern does not seem overly uncommon, so why we see a failure only here?
The failure is in a test that validates some ThreadAbort scenario. Maybe that is a contributing factor, but I do not see how.

@VSadov
Copy link
Member

VSadov commented Mar 27, 2024

The issue is understood.

  • The GC info is emitted correctly in this case (and parsed correctly too)
  • We are doing a stack walk for GC reporting once we had a fault for a thread abort.
  • We are not in a leaf method and we try to adjust our location by -1.
    Since GC liveness is not always contiguous around calls, this adjustment will improve our chances to get "good" GC info by pretending that our IP is somewhere inside the preceding call instruction.
  • However in this scenario it is a Fully Interruptible method and we are not even near any calls. We are on some random instruction that faulted.
  • The crash happens when the instruction turns out to be at the beginning of a basic block that kills R4
  • Moving by -1 shifts us into a preceding basic block where R4 should have been alive. But in our case it contains junk.

We are using GC info for an instruction that is different from where we are (by -1). Our compensating heuristics for GC discontinuities works against us in this case.

@VSadov
Copy link
Member

VSadov commented Mar 27, 2024

Ideally the GC info would be contiguous around calls, at least in terms of nonvolatile registers. (kind of a definition of nonvolatile that calls do not trash them). This is a bigger issue though, that may be addressed in #95565

In this particular case, I think, we can ignore the bigger issue for a bit longer and just pull a piece from #95565 that addresses continuity at throwing calls. Having that, we would not need to adjust in throwing cases regardless of leaf/nonleaf or whether the method is interruptible.

I am testing a fix.

@VSadov
Copy link
Member

VSadov commented Mar 27, 2024

Also - it does not look like ARM32-specific.
Maybe there are just "not too many, not too few" nonvolatile registers which affects register allocation in a way that makes the situation more likely on ARM32.

@VSadov
Copy link
Member

VSadov commented Mar 30, 2024

The root cause for this issue should be fixed now.

Note: we may still see Assertion failed: (GetComponentSize() <= 2) || IsArray() - since that is just a general indication of a corrupted heap object. Other GC hole bugs may end up causing this assert as well.

If scenario does not involve CancelItselfOutsideOfTryCatchFinally test, it is a different issue.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 29, 2024
@JulieLeeMSFT JulieLeeMSFT reopened this Apr 29, 2024
@JulieLeeMSFT
Copy link
Member

@VSadov, this assert happened in runtime-coreclr superpmi-collect pipeline in System.Security.Cryptography.Tests Work Item. Could you please take a look?

@VSadov
Copy link
Member

VSadov commented Apr 29, 2024

Assertion failed: (GetComponentSize() <= 2) || IsArray() is just an indication of a GC hole or a heap corruption. There are many ways a hole can be introduced.

If this does not involve CancelItselfOutsideOfTryCatchFinally test, this is a different failure and needs a different fix and thus a different issue needs to be opened.

@VSadov
Copy link
Member

VSadov commented Apr 29, 2024

There are many jobs in https://dev.azure.com/dnceng/internal/_build?definitionId=977&_a=summary queue. Most jobs are failing, which is not surprising since the pipeline is triggered on PRs and PRs often have failures.

I've checked a few jobs, but can't see one that failed with Assertion failed: (GetComponentSize() <= 2) || IsArray().

@JulieLeeMSFT - do you have a link to the actual job which failed this way?

@VSadov
Copy link
Member

VSadov commented Apr 30, 2024

There is also a bunch of asserts like the following:

Assert failure(PID 3516 [0x00000dbc], Thread: 33049 [0x8119]): !"Heap contamination detected! HeapFree was called on a heap other than the one that memory was allocated from.\n" "Possible cause: you used new (executable) to allocate the memory, but didn't use DeleteExecutable() to free it."
    File: /Users/runner/work/1/s/src/coreclr/utilcode/clrhost_nodependencies.cpp:290
    Image: /private/tmp/helix/working/A91D0961/p/coreclr/superpmi

I think these are more indicative of what went wrong. I will open a separate bug for that.

@VSadov VSadov closed this as completed Apr 30, 2024
@VSadov
Copy link
Member

VSadov commented Apr 30, 2024

Actually there is already bug for that:

#101708

@JulieLeeMSFT
Copy link
Member

Thanks. We will check what went wrong.

@dotnet dotnet deleted a comment from JulieLeeMSFT Apr 30, 2024
@AlitzelMendez
Copy link
Member

Hi @JulieLeeMSFT ,
I deleted one of your comments with a link to a console log as it included a SAS token, please be careful when sharing this information as secrets should not be exposed to the public.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm32 area-VM-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' in-pr There is an active PR which will close this issue when it is merged Known Build Error Use this to report build issues in the .NET Helix tab
Projects
None yet
9 participants