HBASE-26955 Improvement of the pause time between retries in Rpc caller #4349

functioner · 2022-04-15T01:49:11Z

Propose a patch for HBASE-26955
Probably this patch should be further improved, because it ignores the backoff factor when not encountering exceptions of dead server and always use the short delay for retries. I guess probably we need a separate retry loop to handle this issue, and enforce the retry backoff mechanism there. I am looking for suggestions and comments. Thanks!

Apache-HBase · 2022-04-15T02:15:41Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	0m 25s	Docker mode activated.
-0 ⚠️	yetus	0m 2s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ master Compile Tests _
+1 💚	mvninstall	2m 25s	master passed
+1 💚	compile	0m 18s	master passed
+1 💚	shadedjars	3m 38s	branch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 16s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	2m 14s	the patch passed
+1 💚	compile	0m 17s	the patch passed
+1 💚	javac	0m 17s	the patch passed
+1 💚	shadedjars	3m 40s	patch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 16s	the patch passed
		_ Other Tests _
+1 💚	unit	1m 9s	hbase-client in the patch passed.
		15m 31s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4349/1/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR	#4349
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux f9795482b488 5.4.0-96-generic #109-Ubuntu SMP Wed Jan 12 16:49:16 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `f990f56`
Default Java	AdoptOpenJDK-1.8.0_282-b08
Test Results	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4349/1/testReport/
Max. process+thread count	161 (vs. ulimit of 30000)
modules	C: hbase-client U: hbase-client
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4349/1/console
versions	git=2.17.1 maven=3.6.3
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2022-04-15T02:21:06Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	3m 53s	Docker mode activated.
-0 ⚠️	yetus	0m 3s	Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
		_ Prechecks _
		_ master Compile Tests _
+1 💚	mvninstall	2m 58s	master passed
+1 💚	compile	0m 16s	master passed
+1 💚	shadedjars	4m 6s	branch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 18s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	2m 39s	the patch passed
+1 💚	compile	0m 17s	the patch passed
+1 💚	javac	0m 17s	the patch passed
+1 💚	shadedjars	3m 59s	patch has no errors when building our shaded downstream artifacts.
+1 💚	javadoc	0m 14s	the patch passed
		_ Other Tests _
+1 💚	unit	1m 19s	hbase-client in the patch passed.
		20m 51s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4349/1/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR	#4349
Optional Tests	javac javadoc unit shadedjars compile
uname	Linux 77378cdca4c4 5.4.0-1025-aws #25~18.04.1-Ubuntu SMP Fri Sep 11 12:03:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `f990f56`
Default Java	AdoptOpenJDK-11.0.10+9
Test Results	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4349/1/testReport/
Max. process+thread count	189 (vs. ulimit of 30000)
modules	C: hbase-client U: hbase-client
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4349/1/console
versions	git=2.17.1 maven=3.6.3
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2022-04-15T02:25:43Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	0m 40s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	hbaseanti	0m 0s	Patch does not have any anti-patterns.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
		_ master Compile Tests _
+1 💚	mvninstall	2m 26s	master passed
+1 💚	compile	0m 41s	master passed
+1 💚	checkstyle	0m 16s	master passed
+1 💚	spotbugs	0m 39s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	2m 4s	the patch passed
+1 💚	compile	0m 36s	the patch passed
+1 💚	javac	0m 36s	the patch passed
+1 💚	checkstyle	0m 14s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	hadoopcheck	12m 3s	Patch does not cause any errors with Hadoop 3.1.2 3.2.2 3.3.1.
+1 💚	spotbugs	0m 41s	the patch passed
		_ Other Tests _
+1 💚	asflicense	0m 9s	The patch does not generate ASF License warnings.
		25m 25s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4349/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#4349
Optional Tests	dupname asflicense javac spotbugs hadoopcheck hbaseanti checkstyle compile
uname	Linux 0ed52d39af65 5.4.0-1071-aws #76~18.04.1-Ubuntu SMP Mon Mar 28 17:49:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `f990f56`
Default Java	AdoptOpenJDK-1.8.0_282-b08
Max. process+thread count	60 (vs. ulimit of 30000)
modules	C: hbase-client U: hbase-client
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4349/1/console
versions	git=2.17.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache9 · 2022-04-15T06:37:37Z

In general I do not think we should use the same logic in RemoteProcedureResultReporter. At region server side we know the procedure result is very important so we sould try to send the result back to master soon.

But at client side, changing the backoff logic has big side effect, it will greatly increase the load of the cluster, especially what you change here is for almost all rpc request, not only the admin rpc request.

So in general I suggest we keep the code unchanged. For your scenario, if you really want to retry quickly, you can set retry number to 1, then the hbase client will fail soon and you can retry immediately by yourself.

WDYT?

Thanks.

functioner · 2022-04-16T20:14:39Z

@Apache9
I think you have a fair point.
I think the key question here is how quickly can we retry without taking high risk of greatly increasing the load of the server. To answer this question, we need to re-examine the goals of using exponential backoff and whether our code implementation complies with this design goal.
So far I haven't found a very detailed discussion on the rationales for adding exponential backoff (maybe HBASE-511?) to the current HBase client. But from the code comments and Jira issues and wikipedia, I guess the goals of HBase's exponential backoff may include:

When the server is not available, the client should avoid to frequently resend the requests without getting substantial responses. The exponential backoff is a trade-off between waiting for the server startup and resending the request.
When the clients are sending too many requests to the server, a failed request is signal to let the clients slow down, so that the clients' request rates reach an acceptable equilibrium.
When the server tries to handle the request and encounters some error, the server may want the client to wait a little more time so that the server can have enough time to tolerate the error. The exponential backoff is a trade-off between waiting for the fault tolerance and resending the request.

In general, I feel that if we can find the reason for the request error (e.g., server unavailable), we can do the fine-grained exponential backoff separately, based on the error. For example, I find that in the current master branch of Hadoop and Kafka, the client tries to connect to the server with exponential backoff pause time, and after building the connection, the client uses another exponential backoff retry loop to handle the potential exception. They do not share the same exponential backoff retry loop because they are handling different problems. Blindly handling those errors in one single exponential backoff retry loop may result in unnecessary long pause time, without solving any problem.

Specifically, in our case, the HMaster checks the initialization before handling the clients' requests. My client gets these exceptions, waits for this initialization, using the exponential backoff.

hbase/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java

Lines 3091 to 3100 in 2622fa0

    
           void checkInitialized() throws PleaseHoldException, ServerNotRunningYetException, 
        
               MasterNotRunningException, MasterStoppedException { 
        
             checkServiceStarted(); 
        
             if (!isInitialized()) { 
        
               throw new PleaseHoldException("Master is initializing"); 
        
             } 
        
             if (isStopped()) { 
        
               throw new MasterStoppedException(); 
        
             } 
        
           }

And after the successful initialization, the server can handle my clients' request but somehow gets some error in:

hbase/hbase-server/src/main/java/org/apache/hadoop/hbase/procedure2/store/region/RegionProcedureStore.java

Lines 360 to 372 in 2622fa0

    
           @Override 
        
           public void update(Procedure<?> proc) { 
        
             runWithoutRpcCall(() -> { 
        
               try { 
        
                 ProcedureProtos.Procedure proto = ProcedureUtil.convertToProtoProcedure(proc); 
        
                 region.update(r -> r.put(new Put(Bytes.toBytes(proc.getProcId())).addColumn(PROC_FAMILY, 
        
                   PROC_QUALIFIER, proto.toByteArray()))); 
        
               } catch (IOException e) { 
        
                 LOG.error(HBaseMarkers.FATAL, "Failed to update proc {}", proc, e); 
        
                 throw new UncheckedIOException(e); 
        
               } 
        
             }); 
        
           }

The UncheckedIOException is passed to my client, and then unfortunately, the accumulated retry number (due to multiple previous PleaseHoldException) results in a very long pause time.

As you suggested, I can "set retry number to 1". And I think it should be automatically done by the client. The PleaseHoldException and the UncheckedIOException from update operation represents two different reasons for exponential backoff. PleaseHoldException asks the client to wait for the server initialization in a exponential backoff retry loop. And after that, to handle the potential errors of processing client requests, we should start another exponential backoff retry loop, i.e., "set retry number to 1".

WDYT?
If you are convinced, do you have suggestions about enforcing this kind of logic in the code? I feel we may need to add some flags to indicate whether we are waiting for server initialization or handling errors in request processing. And then reset the retry number to 1 when the flag changes.

Thanks.

HBASE-26955 Improvement of the pause time between retries in Rpc caller

4032a1b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HBASE-26955 Improvement of the pause time between retries in Rpc caller #4349

HBASE-26955 Improvement of the pause time between retries in Rpc caller #4349

functioner commented Apr 15, 2022

Apache-HBase commented Apr 15, 2022

Apache-HBase commented Apr 15, 2022

Apache-HBase commented Apr 15, 2022

Apache9 commented Apr 15, 2022

functioner commented Apr 16, 2022

HBASE-26955 Improvement of the pause time between retries in Rpc caller #4349

Are you sure you want to change the base?

HBASE-26955 Improvement of the pause time between retries in Rpc caller #4349

Conversation

functioner commented Apr 15, 2022

Apache-HBase commented Apr 15, 2022

Apache-HBase commented Apr 15, 2022

Apache-HBase commented Apr 15, 2022

Apache9 commented Apr 15, 2022

functioner commented Apr 16, 2022