Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mono GC crash in CI #58062

Closed
lewing opened this issue Aug 24, 2021 · 11 comments · Fixed by #59007
Closed

Mono GC crash in CI #58062

lewing opened this issue Aug 24, 2021 · 11 comments · Fixed by #59007
Assignees
Labels
area-GC-mono disabled-test The test is disabled in source code against the issue
Milestone

Comments

@lewing
Copy link
Member

lewing commented Aug 24, 2021

      =================================================================
      Managed Stacktrace:
      =================================================================
        at <unknown> <0xffffffff>
        at System.Object:__icall_wrapper_mono_gc_alloc_vector <0x0008a>
        at System.Object:AllocVector <0x000f2>
        at System.Collections.Generic.Dictionary`2:Initialize <0x0006e>
        at System.Collections.Generic.Dictionary`2:.ctor <0x0003f>
        at System.Collections.Generic.Dictionary`2:.ctor <0x00021>
        at Microsoft.Cci.MetadataHeapsBuilder:.ctor <0x000ed>
        at Microsoft.Cci.FullMetadataWriter:Create <0x00077>
        at Microsoft.Cci.PeWriter:WritePeToStream <0x00123>
        at Microsoft.CodeAnalysis.Compilation:SerializeToPeStream <0x0061b>
        at Microsoft.CodeAnalysis.Compilation:Emit <0x00716>
        at Microsoft.CodeAnalysis.Compilation:Emit <0x001a7>
        at Microsoft.CodeAnalysis.Compilation:Emit <0x000bf>
        at CscBench:CompileBench <0x0028b>
        at CscBench:Main <0x00025>
        at <Module>:runtime_invoke_int <0x0005f>
      =================================================================

https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-58028-merge-0f18162b6a714630a6/JIT/1/console.71179669.log?sv=2019-07-07&se=2021-09-13T17%3A30%3A21Z&sr=c&sp=rl&sig=pNe9viQHA2TTqAOjNwkXgMqxXDAXFiw0aGviMCTofe4%3D

https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-57928-merge-872ab46efd2b494988/JIT/1/console.607ded22.log?sv=2019-07-07&se=2021-09-12T13%3A23%3A40Z&sr=c&sp=rl&sig=t5BAIXU9yCaDHr2wqN84Z3t79kp2WWqPzP86f4JD2RA%3D

cc @marek-safar @lambdageek @vargaz


The native stack traces look like this:

  Thread 1 (Thread 0x7fd2c37bd740 (LWP 3830)):
      #0  0x00007fd2c319632a in __waitpid (pid=3938, stat_loc=0x7ffe506b3700, options=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
      #1  0x00007fd2c08738f7 in dump_native_stacktrace (signal=<optimized out>, mctx=<optimized out>) at /__w/1/s/src/mono/mono/mini/mini-posix.c:842
      #2  mono_dump_native_crash_info (signal=<optimized out>, mctx=0x7ffe506b4270, info=<optimized out>) at /__w/1/s/src/mono/mono/mini/mini-posix.c:869
      #3  0x00007fd2c081660e in mono_handle_native_crash (signal=0x7fd2c1d0fcb3 "SIGSEGV", mctx=0x7ffe506b4270, info=0x7ffe506b4530) at /__w/1/s/src/mono/mono/mini/mini-exceptions.c:2940
      #4  0x00007fd2c0777ee1 in mono_sigsegv_signal_handler_debug (_dummy=11, _info=0x7ffe506b4530, context=0x7ffe506b4400, debug_fault_addr=0x8) at /__w/1/s/src/mono/mono/mini/mini-runtime.c:3658
      #5  <signal handler called>
      #6  copy_object_no_checks (obj=0x7fd2bf8f1680, queue=0x7ffe506b4c60) at /__w/1/s/src/mono/mono/mini/../sgen/sgen-copy-object.h:69
      #7  0x00007fd2c07570b0 in simple_nursery_serial_copy_object_from_obj (obj_slot=0x7fd2bfcff5e8, queue=0x7ffe506b4c60) at /__w/1/s/src/mono/mono/mini/../sgen/sgen-minor-copy-object.h:238
      #8  simple_nursery_serial_scan_object (full_object=<optimized out>, desc=<optimized out>, queue=0x7ffe506b4c60) at /__w/1/s/src/mono/mono/mini/../sgen/sgen-scan-object.h:64
      #9  0x00007fd2c0758157 in simple_nursery_serial_drain_gray_stack (queue=0x7ffe506b4c60) at /__w/1/s/src/mono/mono/mini/../sgen/sgen-minor-scan-object.h:142
      #10 0x00007fd2c072db4b in sgen_drain_gray_stack (ctx=...) at /__w/1/s/src/mono/mono/sgen/sgen-gc.c:578
      #11 finish_gray_stack (generation=0, ctx=...) at /__w/1/s/src/mono/mono/sgen/sgen-gc.c:1140
      #12 0x00007fd2c072ccb9 in collect_nursery (reason=0x7fd2c1d112cf "Nursery full", is_overflow=<optimized out>) at /__w/1/s/src/mono/mono/sgen/sgen-gc.c:1932
      #13 0x00007fd2c0729003 in sgen_perform_collection_inner (requested_size=<optimized out>, generation_to_collect=<optimized out>, reason=<optimized out>, forced_serial=<optimized out>, stw=<optimized out>) at /__w/1/s/src/mono/mono/sgen/sgen-gc.c:2656
      #14 sgen_perform_collection (requested_size=3176, generation_to_collect=0, reason=0x7fd2c1d112cf "Nursery full", forced_serial=0, stw=1) at /__w/1/s/src/mono/mono/sgen/sgen-gc.c:2768
      #15 0x00007fd2c0728e03 in sgen_ensure_free_space (size=3176, generation=<optimized out>) at /__w/1/s/src/mono/mono/sgen/sgen-gc.c:2622
      #16 0x00007fd2c071cf86 in sgen_alloc_obj_nolock (vtable=0x5579c627f110, size=3176) at /__w/1/s/src/mono/mono/sgen/sgen-alloc.c:258
      #17 0x00007fd2c06ecc97 in mono_gc_alloc_vector (vtable=0x5579c627f110, size=3176, max_length=131) at /__w/1/s/src/mono/mono/metadata/sgen-mono.c:1340
      #18 0x0000000040688d6b in ?? ()
      #19 0x00007fd2bfc00438 in ?? ()
      #20 0x0000000000000083 in ?? ()
      #21 0x00005579c643b610 in ?? ()
      #22 0x0000000000000083 in ?? ()
      #23 0x00005579c627f110 in ?? ()
      #24 0x00005579c07bbd10 in ?? ()
      #25 0x00005579c627f110 in ?? ()
      #26 0x00007ffe506b4e70 in ?? ()
      #27 0x0000000000000000 in ?? ()
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Aug 24, 2021
@ghost
Copy link

ghost commented Aug 24, 2021

Tagging subscribers to this area: @BrzVlad
See info in area-owners.md if you want to be subscribed.

Issue Details
      =================================================================
      Managed Stacktrace:
      =================================================================
        at <unknown> <0xffffffff>
        at System.Object:__icall_wrapper_mono_gc_alloc_vector <0x0008a>
        at System.Object:AllocVector <0x000f2>
        at System.Collections.Generic.Dictionary`2:Initialize <0x0006e>
        at System.Collections.Generic.Dictionary`2:.ctor <0x0003f>
        at System.Collections.Generic.Dictionary`2:.ctor <0x00021>
        at Microsoft.Cci.MetadataHeapsBuilder:.ctor <0x000ed>
        at Microsoft.Cci.FullMetadataWriter:Create <0x00077>
        at Microsoft.Cci.PeWriter:WritePeToStream <0x00123>
        at Microsoft.CodeAnalysis.Compilation:SerializeToPeStream <0x0061b>
        at Microsoft.CodeAnalysis.Compilation:Emit <0x00716>
        at Microsoft.CodeAnalysis.Compilation:Emit <0x001a7>
        at Microsoft.CodeAnalysis.Compilation:Emit <0x000bf>
        at CscBench:CompileBench <0x0028b>
        at CscBench:Main <0x00025>
        at <Module>:runtime_invoke_int <0x0005f>
      =================================================================

https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-58028-merge-0f18162b6a714630a6/JIT/1/console.71179669.log?sv=2019-07-07&se=2021-09-13T17%3A30%3A21Z&sr=c&sp=rl&sig=pNe9viQHA2TTqAOjNwkXgMqxXDAXFiw0aGviMCTofe4%3D

https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-57928-merge-872ab46efd2b494988/JIT/1/console.607ded22.log?sv=2019-07-07&se=2021-09-12T13%3A23%3A40Z&sr=c&sp=rl&sig=t5BAIXU9yCaDHr2wqN84Z3t79kp2WWqPzP86f4JD2RA%3D

cc @marek-safar @lambdageek @vargaz

Author: lewing
Assignees: -
Labels:

area-GC-mono

Milestone: -

@lambdageek
Copy link
Member

Grabbed one of the crash logs here: https://gist.github.com/lambdageek/cc4e2e34186e1b183c44d2c85bd00e96 in case they expire.

@BrzVlad BrzVlad self-assigned this Aug 24, 2021
@SamMonoRT SamMonoRT removed the untriaged New issue has not been triaged by the area owner label Aug 26, 2021
@SamMonoRT SamMonoRT added this to the 7.0.0 milestone Aug 26, 2021
@SamMonoRT
Copy link
Member

Adding to milestone 7.0 - not a 100% reproable issue. @BrzVlad will do more investigation and we can consider a backport if necessary.

@BrzVlad
Copy link
Member

BrzVlad commented Aug 26, 2021

This seems to be only happening on llvm. I suspect maybe a relatively recent regression there. @vargaz @imhameed Any potential culprits ?

@imhameed
Copy link
Contributor

When did this start happening? I can't think of anything offhand that has happened in the last week or two that would break allocation or GC.

@imhameed
Copy link
Contributor

imhameed commented Sep 1, 2021

Possibly related (not as a cause, but as a workaround for a similar issue): #55598

@BrzVlad
Copy link
Member

BrzVlad commented Sep 3, 2021

A few observations about the causes of this issue :

  • happens because LLVM is generating code that reconstructs an object pointer using arithmetic operations, instead of just loading it normally into a register. The GC thus fails to pin the object.
  • the issue was triggered by moving to LLVM11. LLVM9 doesn't reproduce the issue.
  • this pointer computing capability seems to exist also on LLVM9, so I wouldn't be confident that LLVM9 is safe with GC.
  • disabling sroa fixes the issue

@marek-safar marek-safar modified the milestones: 7.0.0, 6.0.0 Sep 3, 2021
@fanyang-mono fanyang-mono added the disabled-test The test is disabled in source code against the issue label Sep 3, 2021
@imhameed imhameed self-assigned this Sep 7, 2021
@BrzVlad
Copy link
Member

BrzVlad commented Sep 7, 2021

We'll see if anything can be done so llvm doesn't reconstruct obj refs. Otherwise we'll just disable sroa for 6.0 and investigate this issue later.

@SamMonoRT
Copy link
Member

@BrzVlad @imhameed - any update on this ? Do we have a perf comparison with/out sroa disabled ? or other possible llvm fix ?

@imhameed
Copy link
Contributor

imhameed commented Sep 9, 2021

I'm taking a look at workarounds now. I don't have a perf comparison.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Sep 12, 2021
imhameed added a commit to imhameed/runtime that referenced this issue Sep 13, 2021
LLVM's SROA can decompose loads and stores of aggregate type into a
sequence of aggregate-element-typed loads and stores. Before this
change, Mono translates .NET-level value types into LLVM IR-level
structs containing nothing but `i8` elements.

When a value type field has reference type, and a value of this value
type is copied using a `memcpy` intrinsic or an LLVM IR load followed by
a store, LLVM will emit code that loads managed references in multiple
byte-sized fragments before reconstructing the original pointer using a
sequence of ALU ops. This causes sgen to fail to pin the referent.

This change works around this by translating value types to LLVM IR
structs with pointer-sized fields. Packed value types with non-standard
alignment will be translated into LLVM IR structs containing
alignment-sized fields.

Note that this does not completely guarantee that the code we generate
will respect sgen's requirements. No specific guarantees are provided
about the translation of non-atomic LLVM IR loads and stores to machine
code. And we'll need some alternative means (perhaps a special
`gc_copy_unaligned` runtime call or similar) to copy packed or
misaligned value types that contain managed references. For stronger
LLVM IR-level guarantees, we'll want to make use of unordered atomic
loads and stores and unordered atomic memcpy, but that work is out of
scope for this change.

Fixes dotnet#58062, but see the previous paragraph for caveats.

See:
- https://github.com/dotnet/llvm-project/blob/release/11.x/llvm/lib/Transforms/Scalar/SROA.cpp#L3371-L3388
- https://github.com/dotnet/llvm-project/blob/release/11.x/llvm/lib/Transforms/Scalar/SROA.cpp#L3327-L3340
imhameed added a commit that referenced this issue Sep 14, 2021
LLVM's SROA can decompose loads and stores of aggregate type into a
sequence of aggregate-element-typed loads and stores. Before this
change, Mono translated .NET-level value types into LLVM IR-level
structs containing nothing but `i8` elements.

When a value type field has reference type, and a value of this value
type is copied using a `memcpy` intrinsic or an LLVM IR load followed by
a store, LLVM will emit code that loads managed references in multiple
byte-sized fragments before reconstructing the original pointer using a
sequence of ALU ops. This causes sgen to fail to pin the referent.

This change works around this by translating value types to LLVM IR
structs with pointer-sized fields. Packed value types with non-standard
alignment will be translated into LLVM IR structs containing
alignment-sized fields.

Note that this does not completely guarantee that the code we generate
will respect sgen's requirements. No specific guarantees are provided
about the translation of non-atomic LLVM IR loads and stores to machine
code. And we'll need some alternative means (perhaps a special
`gc_copy_unaligned` runtime call or similar) to copy packed or
misaligned value types that contain managed references. For stronger
LLVM IR-level guarantees, we'll want to make use of unordered atomic
loads and stores and unordered atomic memcpy, but that work is out of
scope for this change.

Fixes #58062, but see the previous paragraph for caveats.

See:
- https://github.com/dotnet/llvm-project/blob/release/11.x/llvm/lib/Transforms/Scalar/SROA.cpp#L3371-L3388
- https://github.com/dotnet/llvm-project/blob/release/11.x/llvm/lib/Transforms/Scalar/SROA.cpp#L3327-L3340
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 14, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Nov 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-GC-mono disabled-test The test is disabled in source code against the issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants