Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mutex deadlock of roundrobin balancer #1353

Merged
merged 1 commit into from
Jul 12, 2017
Merged

Conversation

nghialv
Copy link
Contributor

@nghialv nghialv commented Jul 6, 2017

When it occurs?

  1. roundrobin: watchAddrUpdates adds a new addrs list to addCh (some addrs have been deleted from the previous list)
  2. clientconn: lbWatcher calls tearDown to close the addrConn
  3. clientconn: tearDown calls ac.down to notify the balancer that the connection has been closed
  4. deadlock (we are making a lock from the first step and all steps are synchronous)

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If your company signed a CLA, they designated a Point of Contact who decides which employees are authorized to participate. You may need to contact the Point of Contact for your company and ask to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the project maintainer to go/cla#troubleshoot.
  • In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again.

@nghialv
Copy link
Contributor Author

nghialv commented Jul 6, 2017

@googlebot I signed it!

@googlebot
Copy link

CLAs look good, thanks!

@dfawley dfawley requested a review from menghanl July 7, 2017 20:39
@nghialv
Copy link
Contributor Author

nghialv commented Jul 11, 2017

@menghanl Could you take a look at this and #1354?

@menghanl
Copy link
Contributor

menghanl commented Jul 11, 2017

Thanks for reporting this.

If my understanding is correct, what happened is:

  • roundrobin sends an update to gRPC, lbWatcher starts to process the update
  • roundrobin sends another update to gRPC (while holding the mutex)
    • this send blocks because the reader lbWatcher is not reading
    • also, the mutex is not released until the send unblocks
  • lbWatcher calls down when processing the previous update, because it removed some address
    • down tries to hold the mutex, and blocks
  • watchAddrUpdates is waiting for lbwatcher to read from the channel, lbwatcher is waiting for watchAddrUpdates to release the mutex

The problem here seems to be that, watchAddrUpdates should not send to addrCh while holding the mutex. If we release the mutex before sending to addrCh, this deadlock will be resolved.
Is that correct?

Another option would be, making sending to addrCh never block.
We could make addrCh a channel with buffer size 1 (make(chan []resolver.Address, 1), and pull the old update from it if there's any before sending

select {
  <-addrChs:
  default:
}
addrCh <- open

@nghialv
Copy link
Contributor Author

nghialv commented Jul 12, 2017

Thanks for your review.
You're right. Removing the old update before sending is a great idea.

Copy link
Contributor

@menghanl menghanl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!
LGTM.

@menghanl menghanl merged commit 27b2052 into grpc:master Jul 12, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jan 18, 2019
@nghialv nghialv deleted the deadlock branch October 29, 2024 06:55
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants