Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mono] Deadlock on mono_loader_lock in mono_class_create_from_typedef #93686

Open
uweigand opened this issue Oct 18, 2023 · 9 comments
Open

[mono] Deadlock on mono_loader_lock in mono_class_create_from_typedef #93686

uweigand opened this issue Oct 18, 2023 · 9 comments

Comments

@uweigand
Copy link
Contributor

We've seen a deadlock in msbuild when performing a source-build of rc2 natively on s390x (using the Mono runtime). In a nutshell, the deadlock occurs due to a lock-order inversion between the Mono JIT's mono_loader_lock and the managed Microsoft.Build.BackEnd.Logging.LoggingService._lockObject.

Specifically, we have two involved threads, the main thread and a thread pool thread.

The main thread runs

  • NuGet.Build.Tasks.Console.Program:Main which calls
  • NuGet.Build.Tasks.Console.MSBuildStaticGraphRestore:LoadProjects which calls
  • Microsoft.Build.Evaluation.ProjectCollection:.ctor which calls
  • Microsoft.Build.BackEnd.Logging.LoggingService:RegisterLogger which TAKES Microsoft.Build.BackEnd.Logging.LoggingService._lockObject and calls
  • Microsoft.Build.Evaluation.ProjectCollection/ReusableLogger:RegisterForEvents
  • At this point, the JIT compiler for Microsoft.Build.BackEnd.Logging.EventSourceSink:add_BuildStarted gets invoked
  • This calls at some point mono_class_create_from_typedef which TAKES mono_loader_lock
  • And deadlocks here

The thread pool thread runs

  • NuGet.Build.Tasks.Console.ConsoleLoggingQueue:Process which calls
  • NuGet.Build.Tasks.ConsoleOutLogMessage:ToJson which calls
  • System.Collections.Concurrent.ConcurrentDictionary'2<TKey_REF, TValue_REF>:GetOrAdd
  • At this point, the JIT compiler for Newtonsoft.Json.Serialization.DefaultContractResolver:CreateContract gets invoked
  • This calls at some point mono_class_create_from_typedef which TAKES mono_loader_lock
  • And calls mono_metadata_interfaces_from_typedef_full, which loads for the first time System.Linq.Expressions.dll
  • This invokes the assembly load hook chain, in particular calling
  • Microsoft.Build.BackEnd.Components.RequestBuilder.AssemblyLoadsTracker:CurrentDomainOnAssemblyLoad which calls
  • Microsoft.Build.BackEnd.Logging.LoggingService:ProcessLoggingEvent which TAKES Microsoft.Build.BackEnd.Logging.LoggingService._lockObject
  • And deadlocks here

It seems to me the root cause of the deadlock is that mono_class_create_from_typedef holds mono_loader_lock across function calls that may trigger invoking managed code (the assembly load hooks) which might do anything - this doesn't seem a good idea.

CC - @directhex @lambdageek @vargaz @akoeplinger
FYI - @giritrivedi @alhad-deshpande @janani66 @omajid @tmds

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 18, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Oct 18, 2023
@lambdageek
Copy link
Member

It seems to me the root cause of the deadlock is that mono_class_create_from_typedef holds mono_loader_lock across function calls that may trigger invoking managed code (the assembly load hooks) which might do anything - this doesn't seem a good idea.

Good catch!

@lambdageek lambdageek added this to the 8.0.0 milestone Oct 18, 2023
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Oct 18, 2023
@lambdageek lambdageek removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 18, 2023
@lambdageek lambdageek modified the milestones: 8.0.0, 9.0.0 Oct 18, 2023
@lambdageek
Copy link
Member

Related to / duplicate of #51864

This is hard

@lambdageek
Copy link
Member

@uweigand this will be quite tricky to fix and I think even if we put something together, it is highly unlikely that we will backport the fix in upstream net8.0 as it's bound to be risky (any time we make big changes to class-init.c we uncover issues with the change).

I would encourage, to whatever extent possible, to fix this in user code by rearranging the calling code to trigger assembly loading early. (I realize in this case I'm proposing that you change MSBuild... sorry.)

@Vishwanatha-HD
Copy link

I did reproduce the hang issue on dotnet 8 on different Redhat beaker machines with different RHEL distros.

I was able to reproduce for around 10 or 12 hang situations and in each of the case, when I attached gdb and looked at the backtrace information, I see the same backtrace as attached in the file..

I did work with Ulrich and local team in IBM, and Ulrich helped us to figure out that its the exact same issue as this one.

I have attached the logfile with all the details of all the process and threads backtrace..
dotnet_8_hang_issue_gdb_all_proc_info.txt

@uweigand
Copy link
Contributor Author

uweigand commented Apr 5, 2024

@Vishwanatha-HD has now found an instance of this deadlock in the .NET 6 source-build. Unfortunately, while one of the involved locks is also the mono_loader_lock, the other involved lock is a different one; not a Microsoft.Build.BackEnd.Logging.LoggingService._lockObject, but rather an _lock object inside the implementation of ArrayPool<byte>.Shared in the runtime.

Again, we have a deadlock between two of msbuild's threads, this time a dedicated event logging thread and some thread pool thread.

The event logging thread

  • handles some event via Microsoft.Build.BackEnd.Logging.LoggingService:RouteBuildEvent
  • which in the end boils down to a System.IO.BinaryWriter:Write (string) call
  • which has a "fast UTF-8" shortcut using a temporary buffer allocated via ArrayPool<byte>.Shared.Rent and returned via ArrayPool<byte>.Shared.Return
  • Shared.Return stores the returned buffer for re-use in a per-thread pool, which on first invocation is initialized by InitializeTlsBucketsAndTrimming
  • the latter calls ConditionalWeakTable::Add on its _allTlsBuckets variable
  • and Add TAKES an internal _lock object
  • while holding this lock object, we call some more routines, one of which is not yet JITted
  • so Mono calls mono_jit_compile_method, which TAKES the mono_loader_lock, where we deadlock

The thread pool thread

  • is initializing by calling Microsoft.Build.BackEnd.AssemblyTaskFactory:InitializeFactory
  • which loads the task via Microsoft.Build.Shared.TypeLoader:Load
  • which in a subroutine calls Microsoft.Build.Shared.TypeLoader/AssemblyInfoToLoadedTypes:ScanAssemblyForPublicTypes
  • which loads the assembly via a non-default Microsoft.Build.Shared.MSBuildLoadContext and then inspects it via System.Reflection.RuntimeAssembly:GetExportedTypes
  • which is implemented as a Mono intrinsic, which in a subroutine calls mono_class_create_from_typedef
  • which TAKES the mono_loader_lock
  • and then calls mono_class_get_checked, which ends up calling mono_alc_invoke_resolve_using_load
  • which calls the non-default handler Microsoft.Build.Shared.MSBuildLoadContext:Load installed above
  • which performs various file system checks via System.IO.File:Exists
  • which on Linux ends up calling Interop.Sys.LStat
  • which creates and disposes of a ValueUtf8Converter to perform string marshalling
  • which uses a temporary Array<byte>.Shared buffer, and the dispose operation calls Return on it
  • and as above, Shared.Return calls InitializeTlsBucketsAndTrimming
  • which TAKES the internal lock object of _allTlsBuckets via ConditionalWeakTable::Add and deadlocks

@Vishwanatha-HD
Copy link

Hi All,
I have attached a txt file with the complete backtrace information of all the processes and threads involved in reproducing the .NET6 source build hang issue.
dotnet-6-source-build-hang-process-thread-bt-info.txt

@lambdageek
Copy link
Member

@uweigand overall how frequent are these deadlocks?

I tried to prototype one idea for a fix (essentially avoid loader callbacks into managed code while holding the global loader lock by pre-loading any assemblies mentioned by a a MonoClass before taking the lock), and it was very invasive, was likely to result in overall performance degradation and also never quite converged on a correct solution (I could still see the loader lock held when loading assemblies - and I ran out of time on the experiment to fix them all).

I'm going to think about some alternate approaches during .NET 9, but I'm not certain there is any reasonable approach here.


mono_jit_compile_method specifically might be somewhat fixable:

		// FIXME Locking here is somewhat historical due to mono_register_jit_icall_wrapper taking loader lock.
		// atomic_compare_exchange should suffice.
		mono_loader_lock ();
		mono_jit_lock ();
		if (!callinfo->wrapper) {
			callinfo->wrapper = p;
		}
		mono_jit_unlock ();
		mono_loader_unlock ();

This will likely just shift around the specific circumstances of the deadlock, however. It wont' fix the fundamental problem.

lambdageek added a commit to lambdageek/runtime that referenced this issue Apr 11, 2024
Related to dotnet#93686

While this doesn't eliminate all deadlocks related to the global
loader lock and managed locks, it removes one unneeded use of the
loader lock.

The wrapper (and trampoline) of a JIT icall are only ever set from
NULL to non-NULL.  We can use atomics to deal with races instad of
double checked locking.  This was not the case historically, because
the JIT info was dynamically allocated - so we used the loader lock to
protect the integrity of the hash table
lambdageek added a commit to lambdageek/runtime that referenced this issue Apr 11, 2024
Related to dotnet#93686

While this doesn't eliminate all deadlocks related to the global
loader lock and managed locks, it removes one unneeded use of the
loader lock.

The wrapper (and trampoline) of a JIT icall are only ever set from
NULL to non-NULL.  We can use atomics to deal with races instad of
double checked locking.  This was not the case historically, because
the JIT info was dynamically allocated - so we used the loader lock to
protect the integrity of the hash table
lambdageek added a commit to lambdageek/runtime that referenced this issue Apr 11, 2024
Related to dotnet#93686

While this doesn't eliminate all deadlocks related to the global
loader lock and managed locks, it removes one unneeded use of the
loader lock.

The wrapper (and trampoline) of a JIT icall are only ever set from
NULL to non-NULL.  We can use atomics to deal with races instad of
double checked locking.  This was not the case historically, because
the JIT info was dynamically allocated - so we used the loader lock to
protect the integrity of the hash table
lambdageek added a commit to lambdageek/runtime that referenced this issue Apr 20, 2024
Related to dotnet#93686

While this doesn't eliminate all deadlocks related to the global
loader lock and managed locks, it removes one unneeded use of the
loader lock.

The wrapper (and trampoline) of a JIT icall are only ever set from
NULL to non-NULL.  We can use atomics to deal with races instad of
double checked locking.  This was not the case historically, because
the JIT info was dynamically allocated - so we used the loader lock to
protect the integrity of the hash table
@uweigand
Copy link
Contributor Author

uweigand commented May 1, 2024

I've now done a bit of code-flow analysis to try and figure out the places where Mono invokes a C# routine while holding the loader lock. The following list contains functions that take the loader lock, and lists all calls to subroutines while holding the lock that potentially invoke a C# routine on some path. This may not be 100% accurate (there may be false positives as in many cases, the invocation of the C# routine is conditional, and it might happen that at this call site the condition is never true; and there may be false negatives in case I missed part of the CFG due to indirect calls or macro magic), but it does show that this problem might be more widespread that I originally thought.

I've done the same analysis on the code after applying the "ready levels" patch #95927, but unfortunately this doesn't look significantly different - even when accounting for the fact that some of the recursive class loader calls under the lock no longer load the parent classes, most of these functions still have other code paths where C# is invoked.

mono/metadata/class-init.c:mono_class_setup_fields
--> mono_class_layout_fields
mono/metadata/class-init.c:mono_class_create_from_typedef[_at_level]
--> mono_metadata_load_generic_params  (LEVEL: no longer under loader lock, but now under image lock)
--> mono_metadata_lookup_generic_class  (LEVEL: no longer under loader lock, but now under image lock)
--> mono_class_get_checked
--> mono_class_inflate_generic_class_checked
--> mono_class_setup_mono_type
--> fix_gclass_incomplete_instantiation
--> mono_class_create_from_typedef
--> mono_metadata_interfaces_from_typedef_full
--> mono_class_find_enum_basetype
--> mono_class_from_mono_type_internal
--> class_has_isbyreflike_attribute
--> class_has_inlinearray_attribute
mono/metadata/class-init.c:mono_class_create_generic_inst[_at_level]
--> check_valid_generic_inst_arguments  (LEVEL: additionally under lock)
--> mono_type_get_full_name  (LEVEL: additionally under lock)
--> mono_class_get_nullable_param_internal
--> mono_generic_class_setup_parent
--> class_loaded
mono/metadata/class-init.c:mono_class_create_ptr_at_level:
--> class_loaded  (LEVEL: additionally under lock)
mono/metadata/class-init.c:mono_class_layout_fields
--> mono_class_full_name
mono/metadata/class-init.c:mono_class_setup_interface_id
--> mono_class_setup_interface_id_nolock --> mono_get_unique_iid

mono/metadata/class-setup-vtable.c:mono_class_setup_interface_offsets_internal
--> mono_class_setup_interface_id_nolock
--> mono_class_get_implemented_interfaces
--> mono_type_get_full_name
--> mono_class_setup_count_virtual_methods

mono/metadata/object.c:mono_vtable_build_imt_slot
--> build_imt_slot
mono/metadata/object.c:mono_method_add_generic_virtual_invocation
--> mini_get_imt_trampoline (via callbacks.get_imt_trampoline)
--> mini_get_vtable_trampoline (via callbacks.get_vtable_trampoline)
--> mono_arch_build_imt_trampoline (via mono_get_runtime_callbacks ()->build_imt_trampoline)
mono/metadata/object.c:allocate_collectible_static_fields
--> mono_class_get_fields_internal
--> allocate_loader_alloc_slot
--> mono_class_from_mono_type_internal
--> mono_object_new_pinned
mono/metadata/object.c:mono_class_create_runtime_vtable
--> mono_class_init_internal
--> mono_class_setup_vtable
--> mono_class_check_vtable_constraints
--> mono_class_has_finalizer
--> mono_class_data_size
--> alloc_vtable
--> mono_mem_manager_get_loader_alloc
--> mono_class_compute_gc_descriptor
--> allocate_collectible_static_fields
--> compute_class_bitmap
--> mono_class_get_fields_internal
--> field_is_special_static
--> mono_class_from_mono_type_internal
--> mono_type_size
--> mono_field_get_data
--> mono_class_value_size
--> mono_type_get_object_handle

mono/metadata/sre.c:reflection_setup_class_hierarchy
--> mono_class_from_mono_type_internal
--> mono_class_setup_mono_type
mono/metadata/sre.c:reflection_setup_internal_class_internal
--> reflection_init_generic_class
--> reflection_setup_internal_class
--> mono_reflection_type_handle_mono_type
--> mono_class_setup_vtable_general
--> mono_class_setup_mono_type
--> mono_class_from_mono_type_internal
mono/metadata/sre.c:reflection_methodbuilder_to_mono_method
--> mono_reflection_type_get_handle
--> mono_metadata_type_dup
--> mono_get_shared_generic_inst
--> mono_marshal_spec_from_builder
mono/metadata/sre.c:ctorbuilder_to_mono_method
--> mono_reflection_methodbuilder_from_ctor_builder
--> ctor_builder_to_signature_raw
mono/metadata/sre.c:methodbuilder_to_mono_method
--> method_builder_to_signature
mono/metadata/sre.c:ves_icall_TypeBuilder_create_runtime_class
--> mono_class_setup_mono_type
--> ensure_runtime_vtable
--> reflection_setup_internal_class
--> mono_class_from_mono_type_internal
--> typebuilder_setup_fields
--> remove_instantiations_of_and_ensure_contents
mono/metadata/sre.c:reflection_create_dynamic_method
--> mono_gc_reference_queue_add_internal

mono/metadata/reflection.c:mono_type_get_object_checked /*FIXME mono_class_init_internal and mono_class_vtable acquire it*/
--> mono_type_normalize
--> mono_type_get_full_name
--> mono_weak_hash_table_insert
--> mono_mem_manager_get_loader_alloc
--> mono_object_new_pinned
--> mono_type_get_object_checked (!!! recursive call under lock !!!)
--> mono_mem_manager_init_reflection_hashes  (recursive!)
--> mono_class_from_mono_type_internal  (recursive!)
mono/metadata/reflection.c:mono_reflection_bind_generic_parameters
--> mono_reflection_type_handle_mono_type
--> mono_class_from_mono_type_internal

mono/metadata/metadata.c:mono_metadata_get_canonical_generic_inst // Hashtable key equal func can take loader lock
--> mono_metadata_generic_inst_equal
--> mono_metadata_type_dup
mono/metadata/metadata.c:mono_metadata_get_canonical_aggregate_modifiers // Hashtable key equal func can take loader lock
--> aggregate_modifiers_equal
--> mono_metadata_type_dup
mono/metadata/metadata.c:mono_metadata_lookup_generic_class // Hashtable key equal func can take loader lock
--> mono_generic_class_equal (via mono_conc_hashtable_insert)

mono/mini/mini.c:mono_jit_compile_method_inner
--> mono_method_is_generic_sharable
--> mono_method_desc_full_match
--> mono_method_get_full_name
--> mono_emit_jit_map

mono/mini/mini-generic-sharing.c:mono_class_get_runtime_generic_context_template
--> mono_class_get_runtime_generic_context_template  (!!! recursive call under lock !!!)
--> class_get_rgctx_template_oti

mono/mini/mini-llvm.c:mono_llvm_emit_method
--> mono_aot_get_mangled_method_name
--> mono_aot_get_method_name
--> mono_method_full_name
--> emit_method_inner
--> get_callee

mono/mini/aot-runtime.c:init_amodule_got
--> decode_patches
--> mini_llvm_init_method
--> mono_resolve_patch_target
mono/mini/aot-runtime.c:load_container_amodule
--> load_aot_module

mono/component/hot_reload.c:hot_reload_get_or_add_ginst_update_info
--> recompute_ginst_update_info

mono/component/debugger-engine.c:mono_de_add_pending_breakpoints
--> mono_jit_search_all_backends_for_jit_info
--> insert_breakpoint
mono/component/debugger-engine.c:mono_de_set_breakpoint
--> set_bp_in_method
--> mono_jit_search_all_backends_for_jit_info
--> mono_debugger_log_add_bp
mono/component/debugger-engine.c:mono_de_clear_breakpoint
--> mono_debugger_log_remove_bp
mono/component/debugger-engine.c:mono_de_clear_breakpoints_for_domain
--> remove_breakpoint
mono/component/debugger-engine.c:mono_de_process_breakpoint
--> mono_find_prev_seq_point_for_native_offset
--> no_seq_points_found
--> mono_debugger_log_bp_hit
--> mono_de_frame_async_id
--> mono_de_ss_update
--> mono_de_ss_start
mono/component/debugger-engine.c:mono_de_ss_start
--> mono_find_prev_seq_point_for_native_offset
--> mono_de_start_single_stepping

mono/component/debugger-agent.c:suspend_vm
--> mono_de_start_single_stepping
--> notify_thread
mono/component/debugger-agent.c:process_frame
--> calc_il_offset
--> mono_method_full_name
--> mono_marshal_method_from_wrapper
mono/component/debugger-agent.c:send_types_for_domain
--> emit_type_load
mono/component/debugger-agent.c:event_requests_cleanup
--> mono_de_clear_breakpoint
mono/component/debugger-agent.c:clear_event_request
--> mono_de_clear_breakpoint
--> mono_de_cancel_ss
mono/component/debugger-agent.c:clear_event_requests_for_assembly
--> clear_event_request
--> ss_clear_for_assembly
mono/component/debugger-agent.c:clear_types_for_assembly
--> type_comes_from_assembly
mono/component/debugger-agent.c:dispose_vm
--> clear_event_request
mono/component/debugger-agent.c:vm_commands
--> clear_event_request
--> get_types_for_source_file
--> get_types
mono/component/debugger-agent.c:event_commands
--> emit_appdomain_load
--> send_assemblies_for_domain
--> emit_thread_start
--> send_types_for_domain
--> clear_event_request
--> mono_de_clear_breakpoint

As an aside, it seems unclear what exactly the loader lock is protecting - many of these uses of the lock should probably use some other serialization mechanism (if any is needed at all) ...

lambdageek added a commit to lambdageek/runtime that referenced this issue Jun 26, 2024
Related to dotnet#93686

While this doesn't eliminate all deadlocks related to the global
loader lock and managed locks, it removes one unneeded use of the
loader lock.

The wrapper (and trampoline) of a JIT icall are only ever set from
NULL to non-NULL.  We can use atomics to deal with races instad of
double checked locking.  This was not the case historically, because
the JIT info was dynamically allocated - so we used the loader lock to
protect the integrity of the hash table
lambdageek added a commit that referenced this issue Jun 27, 2024
…lines (#104038)

* [mini] Use atomics, instead of loader lock, for JIT wrappers

   Related to #93686

   While this doesn't eliminate all deadlocks related to the global loader lock and managed locks, it removes one unneeded use of the loader lock.

   The wrapper (and trampoline) of a JIT icall are only ever set from NULL to non-NULL.  We can use atomics to deal with races instead of double checked locking.  This was not the case historically, because the JIT info was dynamically allocated - so we used the loader lock to protect the integrity of the hash table
@steveisok steveisok modified the milestones: 9.0.0, Future Aug 5, 2024
@WeebOwO
Copy link

WeebOwO commented Oct 10, 2024

I have encountered a similar issue. In my case, two different threads are JIT compiling the same method. One of the threads' call chain goes through mono_class_create_from_typedef and cause deadlock. So, I would like to ask if the fix for this issue is still in progress? 😣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants