-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-27768 Race conditions in BlockingRpcConnection #5154
Conversation
@Apache9 any chance you can look at this? Some of this comes from your original refactor of RpcClientImpl many years ago. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking care of this.
We reuse the same Runnable seems really terrible but as you said, we recommend users to use netty rpc now so it is not worth to do large refactoring.
🎊 +1 overall
This message was automatically generated. |
Thank you both for taking a look. I'll merge once pre-commit comes back positive and after we do some internal testing. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Any updates here? Thanks. |
The bug was pretty intermittent (every 2-3 days) so we’ve been letting it run in some of our highest volume clients for a few days. It’s now been almost a week without issues, so I’m going to merge it in the next day or two when I get time. Thanks again |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Xiaolin Ha <[email protected]>
Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Xiaolin Ha <[email protected]>
Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Xiaolin Ha <[email protected]> (cherry picked from commit 6b6902d) Change-Id: Ib28dc6afb4b0e3a84e6e2dbecf6d9af49f5fc865
The basic idea here is we should usually have two threads: the main thread and a reader thread. Writes come through the main thread, which also handles calling setupIOStreams if the connection is not yet made. The reader thread is continually polling for work, calling
readResponse()
when calls are found. The writer methods are all synchronized, while the reader thread is not. The reader thread uses awaitForWork()
poll method, which is itself synchronized.If an exception occurs while writing a request,
closeConn
will be called. This interrupts and nulls out the readerthread
, along with the socket and streams, and fails all calls that were pending read. The next write to come in will go throughsetupIOStreams()
, which will create new sockets/streams and start a new reader thread.In an ideal world, when
closeConn
is called, the reader thread will be waiting on thewait()
call inwaitForWork()
. In that case, it's likely (not guaranteed) that when the thread is interrupted bycloseConn
thewait()
will finish and the first check inwaitForWork()
will be true (thread == null
). In that case, the reader thread will properly end.Synchronization order is unspecified. So it's possible that while the existing
writeRequest
/closeConn
was running, another write came in and was waiting on the monitor. When the original call releases the monitor, the new write comes in and since the socket is null, goes throughsetupIOStreams()
. In this case, when thewait()
finishes inwaitForWork()
it will check forthread == null
and the thread will not be null. It will have changed to a new thread, not the current thread.This can also occur if closeConn is called while we are in
readResponse()
, which is not synchronized at all. The same scenario can happen where a new write can come in aftercloseConn
which creates a new thread before readResponse finishes. So it'll go into waitForWork() and see that thread is not null, and then the old reader thread never dies.A larger refactor is probably in order here, if BlockingRpcConnection weren't being replaced. As it is, I solved this issue by adding two things:
Thread.currentThread()
is equal tothread
. I added this check in three places where better handling is necessary:I don't really see any specific tests for BlockingRpcConnection, beyond TestBlockingIPC. I don't really know how I'd add a new test for this logic given our setup. Currently I'm working on doing some manual testing of this change in our environment with a live cluster and lots of multigets.