Skip to content
This repository has been archived by the owner on Sep 26, 2023. It is now read-only.

Fix LRO callables so that the ApiCallContext is always passed through. #600

Merged

Conversation

igorbernstein2
Copy link
Contributor

This is the next step in integrating OpenCensus (#583). My plans for OpenCensus involve passing a wrapped span through ApiCallContext. Unfortunately as it stands, LRO callables don't respect ApiCallContext and drop it. This PR aims to fix the situation by always passing the ApiCallContext and removing null checks along the way.

Please note that this is a behavioral change: LRO polling attempts will now use the RPC in the RetrySettings that OperationTimedPollAlgorithm was instantiated with.

@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Oct 9, 2018
@codecov-io
Copy link

codecov-io commented Oct 9, 2018

Codecov Report

Merging #600 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #600      +/-   ##
============================================
- Coverage     75.02%   75.02%   -0.01%     
+ Complexity      934      931       -3     
============================================
  Files           176      176              
  Lines          4076     3607     -469     
  Branches        323      297      -26     
============================================
- Hits           3058     2706     -352     
+ Misses          865      750     -115     
+ Partials        153      151       -2
Impacted Files Coverage Δ Complexity Δ
...ava/com/google/api/gax/rpc/RecheckingCallable.java 90% <ø> (ø) 2 <0> (ø) ⬇️
.../com/google/api/gax/rpc/OperationCallableImpl.java 100% <100%> (ø) 5 <3> (ø) ⬇️
...n/java/com/google/api/gax/rpc/AttemptCallable.java 80% <100%> (+3.8%) 4 <1> (ø) ⬇️
...om/google/api/gax/rpc/CheckingAttemptCallable.java 78.94% <100%> (+7.51%) 4 <4> (+1) ⬆️
.../google/api/gax/rpc/OperationCheckingCallable.java 78.57% <100%> (ø) 4 <0> (ø) ⬇️
...ava/com/google/api/gax/retrying/RetrySettings.java 15.38% <0%> (-15.87%) 2% <0%> (ø)
.../com/google/api/gax/batching/BatchingSettings.java 66.66% <0%> (-14.29%) 2% <0%> (ø)
...m/google/api/gax/batching/FlowControlSettings.java 66.66% <0%> (-11.91%) 2% <0%> (ø)
...om/google/longrunning/stub/GrpcOperationsStub.java 77.77% <0%> (-11.88%) 10% <0%> (-1%)
...google/api/gax/httpjson/GaxHttpJsonProperties.java 40% <0%> (-10%) 2% <0%> (ø)
... and 69 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b842b23...b212f9f. Read the comment docs.

@igorbernstein2
Copy link
Contributor Author

@vam-google Are you ok with the behavioral change? If not, I can add code that explicitly disables rpc timeouts for LRO polling, but I think that not having RPC timeouts for LRO polling is a bug

@igorbernstein2 igorbernstein2 changed the title WIP: Fix LRO callables so that the ApiCallContext is always passed through. Fix LRO callables so that the ApiCallContext is always passed through. Oct 10, 2018
Copy link
Contributor

@vam-google vam-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

externalFuture.setAttemptFuture(new NonCancellableFuture<ResponseT>());
if (externalFuture.isDone()) {
return null;
}
ApiFuture<ResponseT> internalFuture = callable.futureCall(null, null);
// NOTE: The callable here is an OperationCheckingCallable, which will compose its own

This comment was marked as spam.

*/
@Override
public ApiFuture<OperationSnapshot> futureCall(RequestT request, ApiCallContext context) {
public ApiFuture<OperationSnapshot> futureCall(RequestT ignored, ApiCallContext callContext) {

This comment was marked as spam.

This comment was marked as spam.

@@ -59,11 +59,11 @@
/**
* This method is supposed to be called from {@link AttemptCallable#call()}
*
* @param request request
* @param context call context
* @param ignored request. The request will be composed based on the result of the initialFuture.

This comment was marked as spam.

@vam-google
Copy link
Contributor

vam-google commented Oct 11, 2018

@igorbernstein2 please check the fix for the rpcTimeout issue which you experienced (also I removed the comment about the bug from the second test, because it looked like the second test did not suffer from the issue, and was not even trying to make 3 iterations, as the comment was stating).

In general, it seems we, during retries implementation, paid all the attention to the retries from the point of view of client (i.e. the actual operaiton retries and timeouts), but did not pay enough attention to the rpcTimeout value (all the bugs found are about the rpcTimeout stuff). For example, the existing tests for retries and retrying executor do not even validate the value of rpcTimeout in any way. Additional action item should be to add the tests... But it is out of scope of this PR.

@igorbernstein
Copy link

@vam-google I think your commit should be done separately from this PR. The change doesn't fully address the problem. The outstanding issue now is that the initial rpcTimeout value might be skipped for the first poll. For example, take Bigtable's CreateInstance rpc, which returns an Operation to be polled:

  1. initialCallable is invoked and return an ApiFuture<Operation>, that has not resolved yet
  2. that future is passed to OperationCheckingCallable
  3. which in turn is wrapped in a CheckingAttemptCallable
  4. RecheckingCallable will then create a CallbackChainRetryingFuture which will create the first TimedAttemptSettings. Using the initialRpcTimeout.
  5. RecheckingCallable then will call CheckingAttemptCallable.futureCall(), which will call OperationCheckingCallable.call
  6. Since the initialFuture is incomplete,OperationCheckingCallable will not start a new rpc and will ignore the current rpcTimeout and simply return the initialFuture
  7. CallbackChainRetryingFuture will consider this the first attempt and once it completes, it will move on to the next poll attempt using the 2nd rpcTimeout.

The first rpcTimeout will be skipped. However, if the initialFuture completes before step 5, then the initial rpcTimeout will be used. Fixing this properly is a fairly deep rabbit hole that I don't want to block my work on opencensus. The actual impact of this issue is almost nonexistent because almost all usecases will use a constant rpcTimeout.
Can we just compromise on this: I'll revert your commit and remove my comment and open a github issue that can be prioritized separately?

@vam-google
Copy link
Contributor

@igorbernstein2 The described behavior is by design.

A clarification, this paragraph can be skipped {
It is not very essential to this conversation, but allows to explain why on earth there so many stupid callables here. CheckingAttemptCallable does not have futureCall() method, it has only the call() method. In general, one of the really confusing things in this whole process (it confuses me every time I see it myself) is that there are two "types" of Callables participating here: the gax-defined "callables" (like UnaryCallable, RecheckingCallable), which declare method futureCall(), and the "real" callables, which extend java.util.concurrent.Callable (like CheckingAttemptCallable).

The futureCall() is the fundamental part of GAPIC clients surface and defines the asynchronous nature of all rpc calls in the gapic libraries (i.e. if there is futureCall() called, it means that there should be an rpc call at the end of the call chain "journey").

The normal callables are implementation detail, which define how that asynchronous behavior is shoved into JDK's concurrency model. Not each call to call() is related to an actual rpc call (they are pretty much not related).
}

The behavior you described affects not only rpcTimeout but also the retryDelay (poll delay in this context), and it was by design as well. The meaning of OperationCheckingCallable is to encapsulate the "start & poll" behavior. To put it simple: the initial call (the one which starts the LRO) is a "first class member" of the polling process, and each "not ready" status from initial call is handled same way as "polled, poll returned "i'm not ready"). Or from another point of view: each poll, which returns "not ready" is considered to be equal to initialFuture not being completed yet, because the user makes only the initial request, and from his point of view the whole LRO thing is equal to one really long rpc.

In general, you are right, rpcTimeout for polling settings are supposed to use multiplier 1 (and I think it is actually the case for our clients). Also, in most cases the retryDelayMultiplier is also 1, which makes the whole described behavior irrelevant for practical cases.

About the bug
At the same time, the ExponentialRetryAlgorithm really has a bug, where it treats initialRpcTimeout same way as initialRetryDelay, which is simply wrong, because initialRetryDelay becomes relevant only on the first "retry" (i.e. the second "try"), while initialRpcTimeout is relevant for from the first try (and in most cases retry is not even a thing). I.e. initialRpcTimeout is used always, initialRpcTimeout is used only when there was a failure. They are fundamentally different things, which were treated in same way, which is wrong. This is a retrying logic bug and it exists for simple unary callables (no any realation to LRO).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants