Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: Deadlock in messager engine #17229

Closed
GuptaManan100 opened this issue Nov 14, 2024 · 3 comments · Fixed by #17230
Closed

Bug Report: Deadlock in messager engine #17229

GuptaManan100 opened this issue Nov 14, 2024 · 3 comments · Fixed by #17230

Comments

@GuptaManan100
Copy link
Member

Overview of the Issue

It was noticed that there is a deadlock in the messager engine code. When we Close the messager engine. The order of operations are as follows -

  1. Acquire mu mutex from messager engine.
  2. Call UnregisterNotifier, which acquires the notifierMu mutex.

The order of operations during a broadcast call from the schema engine are as follows -

  1. We acquire the notifierMu mutex.
  2. Go over all the notifier functions and call them. In the case of messager engine, it is the schemaChanged method.
  3. This function acquires the mu mutex lock.

From the order of operations it is clear that we can reach a deadlock if two go routines running the order of operations defined above, are able to acquire the first lock respectively. They will fail to acquire the second lock and will continue to wait indefinitely.

This can cause DemotePrimary to block as messager engine Close() is a synchronous call in that flow.

Reproduction Steps

This is very hard to reproduce in a e2e fashion, but can be observed manually by looking at the code, and trying to call schemaChange and Close in parallel.

Binary Version

main

Operating System and Environment details

-

Log Fragments

No response

@GuptaManan100
Copy link
Member Author

I looked at the other engines that use the schema engine, and it turns out the health streamer has a similar deadlock -

Operations for close -

  1. Acquire mu mutex from messager engine.
  2. Call UnregisterNotifier, which acquires the notifierMu mutex.

The order of operations during a broadcast call from the schema engine are as follows -

  1. We acquire the notifierMu mutex.
  2. Go over all the notifier functions and call them. In the case of health streamer, it is the reload method.
  3. This function acquires the mu mutex lock.

@derekperkins
Copy link
Member

I'm glad you found this. There have been some deadlock issues in the past that were hard to diagnose. I wonder if this was the underlying cause, and the other changes mitigated the issue without solving the root concern. We haven't run into this internally that I'm aware of.

I'm curious, were you doing something related to messaging that caused you to find it, or was this a side effect of other work? I'm always curious how much usage messaging is getting.

@GuptaManan100
Copy link
Member Author

I don't know a lot of details, but I was investigating a failure that caused DemotePrimary to be indefinitely stuck, and then I found that we had this deadlock! The messager was stuck in Close waiting for the mutex, and then I realized, an outstanding Broadcast call was holding it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants