Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBASE-26955 Improvement of the pause time between retries in Rpc caller #4349

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

functioner
Copy link

Propose a patch for HBASE-26955
Probably this patch should be further improved, because it ignores the backoff factor when not encountering exceptions of dead server and always use the short delay for retries. I guess probably we need a separate retry loop to handle this issue, and enforce the retry backoff mechanism there. I am looking for suggestions and comments. Thanks!

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 25s Docker mode activated.
-0 ⚠️ yetus 0m 2s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 2m 25s master passed
+1 💚 compile 0m 18s master passed
+1 💚 shadedjars 3m 38s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 16s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 2m 14s the patch passed
+1 💚 compile 0m 17s the patch passed
+1 💚 javac 0m 17s the patch passed
+1 💚 shadedjars 3m 40s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 16s the patch passed
_ Other Tests _
+1 💚 unit 1m 9s hbase-client in the patch passed.
15m 31s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4349/1/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR #4349
Optional Tests javac javadoc unit shadedjars compile
uname Linux f9795482b488 5.4.0-96-generic #109-Ubuntu SMP Wed Jan 12 16:49:16 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / f990f56
Default Java AdoptOpenJDK-1.8.0_282-b08
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4349/1/testReport/
Max. process+thread count 161 (vs. ulimit of 30000)
modules C: hbase-client U: hbase-client
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4349/1/console
versions git=2.17.1 maven=3.6.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 3m 53s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 2m 58s master passed
+1 💚 compile 0m 16s master passed
+1 💚 shadedjars 4m 6s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 18s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 2m 39s the patch passed
+1 💚 compile 0m 17s the patch passed
+1 💚 javac 0m 17s the patch passed
+1 💚 shadedjars 3m 59s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 14s the patch passed
_ Other Tests _
+1 💚 unit 1m 19s hbase-client in the patch passed.
20m 51s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4349/1/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #4349
Optional Tests javac javadoc unit shadedjars compile
uname Linux 77378cdca4c4 5.4.0-1025-aws #25~18.04.1-Ubuntu SMP Fri Sep 11 12:03:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / f990f56
Default Java AdoptOpenJDK-11.0.10+9
Test Results https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4349/1/testReport/
Max. process+thread count 189 (vs. ulimit of 30000)
modules C: hbase-client U: hbase-client
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4349/1/console
versions git=2.17.1 maven=3.6.3
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 40s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ master Compile Tests _
+1 💚 mvninstall 2m 26s master passed
+1 💚 compile 0m 41s master passed
+1 💚 checkstyle 0m 16s master passed
+1 💚 spotbugs 0m 39s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 2m 4s the patch passed
+1 💚 compile 0m 36s the patch passed
+1 💚 javac 0m 36s the patch passed
+1 💚 checkstyle 0m 14s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 hadoopcheck 12m 3s Patch does not cause any errors with Hadoop 3.1.2 3.2.2 3.3.1.
+1 💚 spotbugs 0m 41s the patch passed
_ Other Tests _
+1 💚 asflicense 0m 9s The patch does not generate ASF License warnings.
25m 25s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4349/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #4349
Optional Tests dupname asflicense javac spotbugs hadoopcheck hbaseanti checkstyle compile
uname Linux 0ed52d39af65 5.4.0-1071-aws #76~18.04.1-Ubuntu SMP Mon Mar 28 17:49:57 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / f990f56
Default Java AdoptOpenJDK-1.8.0_282-b08
Max. process+thread count 60 (vs. ulimit of 30000)
modules C: hbase-client U: hbase-client
Console output https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-4349/1/console
versions git=2.17.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

@Apache9
Copy link
Contributor

Apache9 commented Apr 15, 2022

In general I do not think we should use the same logic in RemoteProcedureResultReporter. At region server side we know the procedure result is very important so we sould try to send the result back to master soon.

But at client side, changing the backoff logic has big side effect, it will greatly increase the load of the cluster, especially what you change here is for almost all rpc request, not only the admin rpc request.

So in general I suggest we keep the code unchanged. For your scenario, if you really want to retry quickly, you can set retry number to 1, then the hbase client will fail soon and you can retry immediately by yourself.

WDYT?

Thanks.

@functioner
Copy link
Author

@Apache9
I think you have a fair point.
I think the key question here is how quickly can we retry without taking high risk of greatly increasing the load of the server. To answer this question, we need to re-examine the goals of using exponential backoff and whether our code implementation complies with this design goal.
So far I haven't found a very detailed discussion on the rationales for adding exponential backoff (maybe HBASE-511?) to the current HBase client. But from the code comments and Jira issues and wikipedia, I guess the goals of HBase's exponential backoff may include:

  1. When the server is not available, the client should avoid to frequently resend the requests without getting substantial responses. The exponential backoff is a trade-off between waiting for the server startup and resending the request.
  2. When the clients are sending too many requests to the server, a failed request is signal to let the clients slow down, so that the clients' request rates reach an acceptable equilibrium.
  3. When the server tries to handle the request and encounters some error, the server may want the client to wait a little more time so that the server can have enough time to tolerate the error. The exponential backoff is a trade-off between waiting for the fault tolerance and resending the request.

In general, I feel that if we can find the reason for the request error (e.g., server unavailable), we can do the fine-grained exponential backoff separately, based on the error. For example, I find that in the current master branch of Hadoop and Kafka, the client tries to connect to the server with exponential backoff pause time, and after building the connection, the client uses another exponential backoff retry loop to handle the potential exception. They do not share the same exponential backoff retry loop because they are handling different problems. Blindly handling those errors in one single exponential backoff retry loop may result in unnecessary long pause time, without solving any problem.

Specifically, in our case, the HMaster checks the initialization before handling the clients' requests. My client gets these exceptions, waits for this initialization, using the exponential backoff.

void checkInitialized() throws PleaseHoldException, ServerNotRunningYetException,
MasterNotRunningException, MasterStoppedException {
checkServiceStarted();
if (!isInitialized()) {
throw new PleaseHoldException("Master is initializing");
}
if (isStopped()) {
throw new MasterStoppedException();
}
}

And after the successful initialization, the server can handle my clients' request but somehow gets some error in:
@Override
public void update(Procedure<?> proc) {
runWithoutRpcCall(() -> {
try {
ProcedureProtos.Procedure proto = ProcedureUtil.convertToProtoProcedure(proc);
region.update(r -> r.put(new Put(Bytes.toBytes(proc.getProcId())).addColumn(PROC_FAMILY,
PROC_QUALIFIER, proto.toByteArray())));
} catch (IOException e) {
LOG.error(HBaseMarkers.FATAL, "Failed to update proc {}", proc, e);
throw new UncheckedIOException(e);
}
});
}

The UncheckedIOException is passed to my client, and then unfortunately, the accumulated retry number (due to multiple previous PleaseHoldException) results in a very long pause time.

As you suggested, I can "set retry number to 1". And I think it should be automatically done by the client. The PleaseHoldException and the UncheckedIOException from update operation represents two different reasons for exponential backoff. PleaseHoldException asks the client to wait for the server initialization in a exponential backoff retry loop. And after that, to handle the potential errors of processing client requests, we should start another exponential backoff retry loop, i.e., "set retry number to 1".

WDYT?
If you are convinced, do you have suggestions about enforcing this kind of logic in the code? I feel we may need to add some flags to indicate whether we are waiting for server initialization or handling errors in request processing. And then reset the retry number to 1 when the flag changes.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants