-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug Report: Deadlock in messager engine #17229
Comments
I looked at the other engines that use the schema engine, and it turns out the health streamer has a similar deadlock - Operations for close -
The order of operations during a broadcast call from the schema engine are as follows -
|
I'm glad you found this. There have been some deadlock issues in the past that were hard to diagnose. I wonder if this was the underlying cause, and the other changes mitigated the issue without solving the root concern. We haven't run into this internally that I'm aware of. I'm curious, were you doing something related to messaging that caused you to find it, or was this a side effect of other work? I'm always curious how much usage messaging is getting. |
I don't know a lot of details, but I was investigating a failure that caused |
Overview of the Issue
It was noticed that there is a deadlock in the messager engine code. When we Close the messager engine. The order of operations are as follows -
mu
mutex from messager engine.UnregisterNotifier
, which acquires thenotifierMu
mutex.The order of operations during a broadcast call from the schema engine are as follows -
notifierMu
mutex.schemaChanged
method.mu
mutex lock.From the order of operations it is clear that we can reach a deadlock if two go routines running the order of operations defined above, are able to acquire the first lock respectively. They will fail to acquire the second lock and will continue to wait indefinitely.
This can cause
DemotePrimary
to block as messager engineClose()
is a synchronous call in that flow.Reproduction Steps
This is very hard to reproduce in a e2e fashion, but can be observed manually by looking at the code, and trying to call
schemaChange
andClose
in parallel.Binary Version
Operating System and Environment details
Log Fragments
No response
The text was updated successfully, but these errors were encountered: