Fix LRO callables so that the ApiCallContext is always passed through. #600

igorbernstein2 · 2018-10-09T22:32:59Z

This is the next step in integrating OpenCensus (#583). My plans for OpenCensus involve passing a wrapped span through ApiCallContext. Unfortunately as it stands, LRO callables don't respect ApiCallContext and drop it. This PR aims to fix the situation by always passing the ApiCallContext and removing null checks along the way.

Please note that this is a behavioral change: LRO polling attempts will now use the RPC in the RetrySettings that OperationTimedPollAlgorithm was instantiated with.

…rough.

codecov-io · 2018-10-09T22:36:43Z

Codecov Report

Merging #600 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@             Coverage Diff              @@
##             master     #600      +/-   ##
============================================
- Coverage     75.02%   75.02%   -0.01%     
+ Complexity      934      931       -3     
============================================
  Files           176      176              
  Lines          4076     3607     -469     
  Branches        323      297      -26     
============================================
- Hits           3058     2706     -352     
+ Misses          865      750     -115     
+ Partials        153      151       -2

Impacted Files	Coverage Δ	Complexity Δ
...ava/com/google/api/gax/rpc/RecheckingCallable.java	`90% <ø> (ø)`	`2 <0> (ø)`	⬇️
.../com/google/api/gax/rpc/OperationCallableImpl.java	`100% <100%> (ø)`	`5 <3> (ø)`	⬇️
...n/java/com/google/api/gax/rpc/AttemptCallable.java	`80% <100%> (+3.8%)`	`4 <1> (ø)`	⬇️
...om/google/api/gax/rpc/CheckingAttemptCallable.java	`78.94% <100%> (+7.51%)`	`4 <4> (+1)`	⬆️
.../google/api/gax/rpc/OperationCheckingCallable.java	`78.57% <100%> (ø)`	`4 <0> (ø)`	⬇️
...ava/com/google/api/gax/retrying/RetrySettings.java	`15.38% <0%> (-15.87%)`	`2% <0%> (ø)`
.../com/google/api/gax/batching/BatchingSettings.java	`66.66% <0%> (-14.29%)`	`2% <0%> (ø)`
...m/google/api/gax/batching/FlowControlSettings.java	`66.66% <0%> (-11.91%)`	`2% <0%> (ø)`
...om/google/longrunning/stub/GrpcOperationsStub.java	`77.77% <0%> (-11.88%)`	`10% <0%> (-1%)`
...google/api/gax/httpjson/GaxHttpJsonProperties.java	`40% <0%> (-10%)`	`2% <0%> (ø)`
... and 69 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b842b23...b212f9f. Read the comment docs.

igorbernstein2 · 2018-10-09T22:40:45Z

@vam-google Are you ok with the behavioral change? If not, I can add code that explicitly disables rpc timeouts for LRO polling, but I think that not having RPC timeouts for LRO polling is a bug

…ntext

vam-google

LGTM

gax/src/main/java/com/google/api/gax/rpc/CheckingAttemptCallable.java

      externalFuture.setAttemptFuture(new NonCancellableFuture<ResponseT>());
      if (externalFuture.isDone()) {
        return null;
      }
-      ApiFuture<ResponseT> internalFuture = callable.futureCall(null, null);
+      // NOTE: The callable here is an OperationCheckingCallable, which will compose its own


gax/src/main/java/com/google/api/gax/rpc/OperationCheckingCallable.java

   */
  @Override
-  public ApiFuture<OperationSnapshot> futureCall(RequestT request, ApiCallContext context) {
+  public ApiFuture<OperationSnapshot> futureCall(RequestT ignored, ApiCallContext callContext) {


gax/src/main/java/com/google/api/gax/rpc/OperationCheckingCallable.java

@@ -59,11 +59,11 @@
  /**
   * This method is supposed to be called from {@link AttemptCallable#call()}
   *
-   * @param request request
-   * @param context call context
+   * @param ignored request. The request will be composed based on the result of the initialFuture.


vam-google · 2018-10-11T00:27:42Z

@igorbernstein2 please check the fix for the rpcTimeout issue which you experienced (also I removed the comment about the bug from the second test, because it looked like the second test did not suffer from the issue, and was not even trying to make 3 iterations, as the comment was stating).

In general, it seems we, during retries implementation, paid all the attention to the retries from the point of view of client (i.e. the actual operaiton retries and timeouts), but did not pay enough attention to the rpcTimeout value (all the bugs found are about the rpcTimeout stuff). For example, the existing tests for retries and retrying executor do not even validate the value of rpcTimeout in any way. Additional action item should be to add the tests... But it is out of scope of this PR.

igorbernstein · 2018-10-11T12:56:53Z

@vam-google I think your commit should be done separately from this PR. The change doesn't fully address the problem. The outstanding issue now is that the initial rpcTimeout value might be skipped for the first poll. For example, take Bigtable's CreateInstance rpc, which returns an Operation to be polled:

initialCallable is invoked and return an ApiFuture<Operation>, that has not resolved yet
that future is passed to OperationCheckingCallable
which in turn is wrapped in a CheckingAttemptCallable
RecheckingCallable will then create a CallbackChainRetryingFuture which will create the first TimedAttemptSettings. Using the initialRpcTimeout.
RecheckingCallable then will call CheckingAttemptCallable.futureCall(), which will call OperationCheckingCallable.call
Since the initialFuture is incomplete,OperationCheckingCallable will not start a new rpc and will ignore the current rpcTimeout and simply return the initialFuture
CallbackChainRetryingFuture will consider this the first attempt and once it completes, it will move on to the next poll attempt using the 2nd rpcTimeout.

The first rpcTimeout will be skipped. However, if the initialFuture completes before step 5, then the initial rpcTimeout will be used. Fixing this properly is a fairly deep rabbit hole that I don't want to block my work on opencensus. The actual impact of this issue is almost nonexistent because almost all usecases will use a constant rpcTimeout.
Can we just compromise on this: I'll revert your commit and remove my comment and open a github issue that can be prioritized separately?

…thm" This reverts commit 5dc6661.

vam-google · 2018-10-11T17:31:00Z

@igorbernstein2 The described behavior is by design.

A clarification, this paragraph can be skipped {
It is not very essential to this conversation, but allows to explain why on earth there so many stupid callables here. CheckingAttemptCallable does not have futureCall() method, it has only the call() method. In general, one of the really confusing things in this whole process (it confuses me every time I see it myself) is that there are two "types" of Callables participating here: the gax-defined "callables" (like UnaryCallable, RecheckingCallable), which declare method futureCall(), and the "real" callables, which extend java.util.concurrent.Callable (like CheckingAttemptCallable).

The futureCall() is the fundamental part of GAPIC clients surface and defines the asynchronous nature of all rpc calls in the gapic libraries (i.e. if there is futureCall() called, it means that there should be an rpc call at the end of the call chain "journey").

The normal callables are implementation detail, which define how that asynchronous behavior is shoved into JDK's concurrency model. Not each call to call() is related to an actual rpc call (they are pretty much not related).
}

The behavior you described affects not only rpcTimeout but also the retryDelay (poll delay in this context), and it was by design as well. The meaning of OperationCheckingCallable is to encapsulate the "start & poll" behavior. To put it simple: the initial call (the one which starts the LRO) is a "first class member" of the polling process, and each "not ready" status from initial call is handled same way as "polled, poll returned "i'm not ready"). Or from another point of view: each poll, which returns "not ready" is considered to be equal to initialFuture not being completed yet, because the user makes only the initial request, and from his point of view the whole LRO thing is equal to one really long rpc.

In general, you are right, rpcTimeout for polling settings are supposed to use multiplier 1 (and I think it is actually the case for our clients). Also, in most cases the retryDelayMultiplier is also 1, which makes the whole described behavior irrelevant for practical cases.

About the bug
At the same time, the ExponentialRetryAlgorithm really has a bug, where it treats initialRpcTimeout same way as initialRetryDelay, which is simply wrong, because initialRetryDelay becomes relevant only on the first "retry" (i.e. the second "try"), while initialRpcTimeout is relevant for from the first try (and in most cases retry is not even a thing). I.e. initialRpcTimeout is used always, initialRpcTimeout is used only when there was a failure. They are fundamentally different things, which were treated in same way, which is wrong. This is a retrying logic bug and it exists for simple unary callables (no any realation to LRO).

googleapis#600)

WIP: Fix LRO callables so that the ApiCallContext is always passed th…

0f78b71

…rough.

googlebot added the cla: yes This human has signed the Contributor License Agreement. label Oct 9, 2018

igorbernstein2 requested a review from vam-google October 9, 2018 22:33

igorbernstein2 added 2 commits October 10, 2018 15:05

more tweaks

80b9919

Merge remote-tracking branch 'origin/master' into opencensus-2-lro-co…

e9bbb95

…ntext

igorbernstein2 changed the title ~~WIP: Fix LRO callables so that the ApiCallContext is always passed through.~~ Fix LRO callables so that the ApiCallContext is always passed through. Oct 10, 2018

Merge remote-tracking branch 'origin/master' into opencensus-2-lro-co…

70cb018

…ntext

vam-google approved these changes Oct 10, 2018

View reviewed changes

Fix misinterpretation of rpcTimeout in ExponentialRetryAlgorithm

5dc6661

igorbernstein2 added 3 commits October 11, 2018 11:57

Revert "Fix misinterpretation of rpcTimeout in ExponentialRetryAlgori…

0722736

…thm" This reverts commit 5dc6661.

remove comment about the timeout bug

ed196d1

improve comments

fbb441d

remove jcenter repo for benchmark gradle plugins

b212f9f

igorbernstein2 merged commit 8989241 into googleapis:master Oct 11, 2018

igorbernstein2 deleted the opencensus-2-lro-context branch October 11, 2018 17:47

garrettjonesgoogle mentioned this pull request Oct 19, 2018

Release gax-java v1.34.0 #611

Merged

Raibaz pushed a commit to Raibaz/gax-java that referenced this pull request Oct 25, 2018

Fix LRO callables so that the ApiCallContext is always passed through. (

f7f8688

googleapis#600)

Raibaz pushed a commit to Raibaz/gax-java that referenced this pull request Oct 25, 2018

Fix LRO callables so that the ApiCallContext is always passed through. (

e3b0750

googleapis#600)

igorbernstein2 mentioned this pull request Jan 10, 2019

Bigtable: fix handling of unset rpc timeouts googleapis/google-cloud-java#4323

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LRO callables so that the ApiCallContext is always passed through. #600

Fix LRO callables so that the ApiCallContext is always passed through. #600

igorbernstein2 commented Oct 9, 2018

codecov-io commented Oct 9, 2018 •

edited

Loading

igorbernstein2 commented Oct 9, 2018

vam-google left a comment

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

vam-google commented Oct 11, 2018 •

edited

Loading

igorbernstein commented Oct 11, 2018

vam-google commented Oct 11, 2018

Fix LRO callables so that the ApiCallContext is always passed through. #600

Fix LRO callables so that the ApiCallContext is always passed through. #600

Conversation

igorbernstein2 commented Oct 9, 2018

codecov-io commented Oct 9, 2018 • edited Loading

Codecov Report

igorbernstein2 commented Oct 9, 2018

vam-google left a comment

Choose a reason for hiding this comment

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

vam-google commented Oct 11, 2018 • edited Loading

igorbernstein commented Oct 11, 2018

vam-google commented Oct 11, 2018

codecov-io commented Oct 9, 2018 •

edited

Loading

vam-google commented Oct 11, 2018 •

edited

Loading