Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBASE-27768 Race conditions in BlockingRpcConnection #5154

Merged
merged 2 commits into from
Apr 10, 2023

Conversation

bbeaudreault
Copy link
Contributor

The basic idea here is we should usually have two threads: the main thread and a reader thread. Writes come through the main thread, which also handles calling setupIOStreams if the connection is not yet made. The reader thread is continually polling for work, calling readResponse() when calls are found. The writer methods are all synchronized, while the reader thread is not. The reader thread uses a waitForWork() poll method, which is itself synchronized.

If an exception occurs while writing a request, closeConn will be called. This interrupts and nulls out the reader thread, along with the socket and streams, and fails all calls that were pending read. The next write to come in will go through setupIOStreams(), which will create new sockets/streams and start a new reader thread.

In an ideal world, when closeConn is called, the reader thread will be waiting on the wait() call in waitForWork(). In that case, it's likely (not guaranteed) that when the thread is interrupted by closeConn the wait() will finish and the first check in waitForWork() will be true (thread == null). In that case, the reader thread will properly end.

Synchronization order is unspecified. So it's possible that while the existing writeRequest/closeConn was running, another write came in and was waiting on the monitor. When the original call releases the monitor, the new write comes in and since the socket is null, goes through setupIOStreams(). In this case, when the wait() finishes in waitForWork() it will check for thread == null and the thread will not be null. It will have changed to a new thread, not the current thread.

This can also occur if closeConn is called while we are in readResponse(), which is not synchronized at all. The same scenario can happen where a new write can come in after closeConn which creates a new thread before readResponse finishes. So it'll go into waitForWork() and see that thread is not null, and then the old reader thread never dies.


A larger refactor is probably in order here, if BlockingRpcConnection weren't being replaced. As it is, I solved this issue by adding two things:

  1. Add a check for isCurrentThreadExpected, which checks if Thread.currentThread() is equal to thread. I added this check in three places where better handling is necessary:
    1. In the waitForWork loop, if thread != null. We should check that thread is also the current thread, otherwise we need to exit.
    2. When handling InterruptedException. In this case we want to call closeConn if the interrupt itself didn't come from closeConn. For example, if the process is ending or an external actor interrupted us.
    3. In readResponse, when deciding whether to closeConn when an error occurs. I've seen some cases where we end up unnecessarily closing and restarting the same connection thread multiple times because closeConn causes readResponse to fail, but in the meantime a new connection thread was created. The readResponse failure calls closeConn again even though the new connection is ok.
  2. Synchronize the reader threads, so that two reader threads can't read from the same socket. This causes corruption and other oddities.

I don't really see any specific tests for BlockingRpcConnection, beyond TestBlockingIPC. I don't really know how I'd add a new test for this logic given our setup. Currently I'm working on doing some manual testing of this change in our environment with a live cluster and lots of multigets.

@bbeaudreault bbeaudreault requested a review from Apache9 March 30, 2023 15:17
@bbeaudreault
Copy link
Contributor Author

@Apache9 any chance you can look at this? Some of this comes from your original refactor of RpcClientImpl many years ago.

Copy link
Contributor

@Apache9 Apache9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of this.

We reuse the same Runnable seems really terrible but as you said, we recommend users to use netty rpc now so it is not worth to do large refactoring.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 36s Docker mode activated.
-0 ⚠️ yetus 0m 5s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+1 💚 mvninstall 3m 37s branch-2 passed
+1 💚 compile 0m 18s branch-2 passed
+1 💚 shadedjars 4m 51s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 17s branch-2 passed
_ Patch Compile Tests _
+1 💚 mvninstall 3m 18s the patch passed
+1 💚 compile 0m 18s the patch passed
+1 💚 javac 0m 18s the patch passed
+1 💚 shadedjars 4m 53s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 16s the patch passed
_ Other Tests _
+1 💚 unit 7m 53s hbase-client in the patch passed.
27m 38s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5154/1/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #5154
Optional Tests javac javadoc unit shadedjars compile
uname Linux f1f026bd62c1 5.4.0-1094-aws #102~18.04.1-Ubuntu SMP Tue Jan 10 21:07:03 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 9f4b31e
Default Java Eclipse Adoptium-11.0.17+8
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5154/1/testReport/
Max. process+thread count 196 (vs. ulimit of 30000)
modules C: hbase-client U: hbase-client
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5154/1/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@bbeaudreault
Copy link
Contributor Author

Thank you both for taking a look. I'll merge once pre-commit comes back positive and after we do some internal testing.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 57s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ branch-2 Compile Tests _
+1 💚 mvninstall 3m 37s branch-2 passed
+1 💚 compile 0m 45s branch-2 passed
+1 💚 checkstyle 0m 19s branch-2 passed
+1 💚 spotless 0m 43s branch has no errors when running spotless:check.
+1 💚 spotbugs 0m 53s branch-2 passed
_ Patch Compile Tests _
+1 💚 mvninstall 3m 21s the patch passed
+1 💚 compile 0m 43s the patch passed
+1 💚 javac 0m 43s the patch passed
+1 💚 checkstyle 0m 17s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 hadoopcheck 17m 51s Patch does not cause any errors with Hadoop 2.10.2 or 3.2.4 3.3.4.
+1 💚 spotless 0m 40s patch has no errors when running spotless:check.
+1 💚 spotbugs 0m 58s the patch passed
_ Other Tests _
+1 💚 asflicense 0m 12s The patch does not generate ASF License warnings.
32m 59s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5154/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #5154
Optional Tests dupname asflicense javac spotbugs hadoopcheck hbaseanti spotless checkstyle compile
uname Linux 6ac43c824ea4 5.4.0-137-generic #154-Ubuntu SMP Thu Jan 5 17:03:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 9f4b31e
Default Java Eclipse Adoptium-11.0.17+8
Max. process+thread count 85 (vs. ulimit of 30000)
modules C: hbase-client U: hbase-client
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5154/1/console
versions git=2.34.1 maven=3.8.6 spotbugs=4.7.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 50s Docker mode activated.
-0 ⚠️ yetus 0m 6s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
-1 ❌ mvninstall 4m 20s root in branch-2 failed.
+1 💚 compile 0m 27s branch-2 passed
+1 💚 shadedjars 6m 52s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 24s branch-2 passed
_ Patch Compile Tests _
+1 💚 mvninstall 3m 59s the patch passed
+1 💚 compile 0m 27s the patch passed
+1 💚 javac 0m 27s the patch passed
+1 💚 shadedjars 6m 8s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 18s the patch passed
_ Other Tests _
+1 💚 unit 8m 8s hbase-client in the patch passed.
33m 27s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5154/1/artifact/yetus-jdk8-hadoop2-check/output/Dockerfile
GITHUB PR #5154
Optional Tests javac javadoc unit shadedjars compile
uname Linux a0786dd54419 5.4.0-1094-aws #102~18.04.1-Ubuntu SMP Tue Jan 10 21:07:03 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / 9f4b31e
Default Java Temurin-1.8.0_352-b08
mvninstall https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5154/1/artifact/yetus-jdk8-hadoop2-check/output/branch-mvninstall-root.txt
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5154/1/testReport/
Max. process+thread count 172 (vs. ulimit of 30000)
modules C: hbase-client U: hbase-client
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5154/1/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache9
Copy link
Contributor

Apache9 commented Apr 8, 2023

Any updates here?

Thanks.

@bbeaudreault
Copy link
Contributor Author

The bug was pretty intermittent (every 2-3 days) so we’ve been letting it run in some of our highest volume clients for a few days. It’s now been almost a week without issues, so I’m going to merge it in the next day or two when I get time. Thanks again

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 48s Docker mode activated.
-0 ⚠️ yetus 0m 5s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+1 💚 mvninstall 3m 6s branch-2 passed
+1 💚 compile 0m 19s branch-2 passed
+1 💚 shadedjars 4m 57s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 15s branch-2 passed
_ Patch Compile Tests _
+1 💚 mvninstall 2m 58s the patch passed
+1 💚 compile 0m 20s the patch passed
+1 💚 javac 0m 20s the patch passed
+1 💚 shadedjars 4m 49s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 16s the patch passed
_ Other Tests _
+1 💚 unit 7m 48s hbase-client in the patch passed.
27m 2s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5154/2/artifact/yetus-jdk8-hadoop2-check/output/Dockerfile
GITHUB PR #5154
Optional Tests javac javadoc unit shadedjars compile
uname Linux 5619a47c28c4 5.4.0-1094-aws #102~18.04.1-Ubuntu SMP Tue Jan 10 21:07:03 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / a67a8f7
Default Java Temurin-1.8.0_352-b08
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5154/2/testReport/
Max. process+thread count 172 (vs. ulimit of 30000)
modules C: hbase-client U: hbase-client
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5154/2/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 39s Docker mode activated.
-0 ⚠️ yetus 0m 5s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ branch-2 Compile Tests _
+1 💚 mvninstall 3m 45s branch-2 passed
+1 💚 compile 0m 18s branch-2 passed
+1 💚 shadedjars 4m 48s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 18s branch-2 passed
_ Patch Compile Tests _
+1 💚 mvninstall 3m 17s the patch passed
+1 💚 compile 0m 20s the patch passed
+1 💚 javac 0m 20s the patch passed
+1 💚 shadedjars 4m 48s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 16s the patch passed
_ Other Tests _
+1 💚 unit 7m 54s hbase-client in the patch passed.
27m 46s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5154/2/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #5154
Optional Tests javac javadoc unit shadedjars compile
uname Linux 470d7d7fece1 5.4.0-1094-aws #102~18.04.1-Ubuntu SMP Tue Jan 10 21:07:03 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / a67a8f7
Default Java Eclipse Adoptium-11.0.17+8
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5154/2/testReport/
Max. process+thread count 196 (vs. ulimit of 30000)
modules C: hbase-client U: hbase-client
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5154/2/console
versions git=2.34.1 maven=3.8.6
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 41s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ branch-2 Compile Tests _
+1 💚 mvninstall 3m 26s branch-2 passed
+1 💚 compile 0m 44s branch-2 passed
+1 💚 checkstyle 0m 18s branch-2 passed
+1 💚 spotless 0m 42s branch has no errors when running spotless:check.
+1 💚 spotbugs 0m 50s branch-2 passed
_ Patch Compile Tests _
+1 💚 mvninstall 3m 19s the patch passed
+1 💚 compile 0m 43s the patch passed
+1 💚 javac 0m 43s the patch passed
+1 💚 checkstyle 0m 18s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 hadoopcheck 17m 57s Patch does not cause any errors with Hadoop 2.10.2 or 3.2.4 3.3.4.
+1 💚 spotless 0m 42s patch has no errors when running spotless:check.
+1 💚 spotbugs 0m 57s the patch passed
_ Other Tests _
+1 💚 asflicense 0m 11s The patch does not generate ASF License warnings.
33m 30s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5154/2/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #5154
Optional Tests dupname asflicense javac spotbugs hadoopcheck hbaseanti spotless checkstyle compile
uname Linux 779f4ad5467c 5.4.0-137-generic #154-Ubuntu SMP Thu Jan 5 17:03:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision branch-2 / a67a8f7
Default Java Eclipse Adoptium-11.0.17+8
Max. process+thread count 80 (vs. ulimit of 30000)
modules C: hbase-client U: hbase-client
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-5154/2/console
versions git=2.34.1 maven=3.8.6 spotbugs=4.7.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@bbeaudreault bbeaudreault merged commit 0848936 into apache:branch-2 Apr 10, 2023
@bbeaudreault bbeaudreault deleted the HBASE-27768 branch April 10, 2023 19:23
bbeaudreault added a commit that referenced this pull request Apr 10, 2023
bbeaudreault added a commit that referenced this pull request Apr 10, 2023
vinayakphegde pushed a commit to vinayakphegde/hbase that referenced this pull request Apr 4, 2024
Signed-off-by: Duo Zhang <[email protected]>
Signed-off-by: Xiaolin Ha <[email protected]>
(cherry picked from commit 6b6902d)
Change-Id: Ib28dc6afb4b0e3a84e6e2dbecf6d9af49f5fc865
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants