-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-26955 Improvement of the pause time between retries in Rpc caller #4349
base: master
Are you sure you want to change the base?
Conversation
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
In general I do not think we should use the same logic in RemoteProcedureResultReporter. At region server side we know the procedure result is very important so we sould try to send the result back to master soon. But at client side, changing the backoff logic has big side effect, it will greatly increase the load of the cluster, especially what you change here is for almost all rpc request, not only the admin rpc request. So in general I suggest we keep the code unchanged. For your scenario, if you really want to retry quickly, you can set retry number to 1, then the hbase client will fail soon and you can retry immediately by yourself. WDYT? Thanks. |
@Apache9
In general, I feel that if we can find the reason for the request error (e.g., server unavailable), we can do the fine-grained exponential backoff separately, based on the error. For example, I find that in the current master branch of Hadoop and Kafka, the client tries to connect to the server with exponential backoff pause time, and after building the connection, the client uses another exponential backoff retry loop to handle the potential exception. They do not share the same exponential backoff retry loop because they are handling different problems. Blindly handling those errors in one single exponential backoff retry loop may result in unnecessary long pause time, without solving any problem. Specifically, in our case, the HMaster checks the initialization before handling the clients' requests. My client gets these exceptions, waits for this initialization, using the exponential backoff. hbase/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java Lines 3091 to 3100 in 2622fa0
And after the successful initialization, the server can handle my clients' request but somehow gets some error in: Lines 360 to 372 in 2622fa0
The UncheckedIOException is passed to my client, and then unfortunately, the accumulated retry number (due to multiple previous PleaseHoldException ) results in a very long pause time.
As you suggested, I can "set retry number to 1". And I think it should be automatically done by the client. The WDYT? Thanks. |
Propose a patch for HBASE-26955
Probably this patch should be further improved, because it ignores the backoff factor when not encountering exceptions of dead server and always use the short delay for retries. I guess probably we need a separate retry loop to handle this issue, and enforce the retry backoff mechanism there. I am looking for suggestions and comments. Thanks!