Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClientRequestContext#cancel cancels the associated request immediately #5800

Merged
merged 11 commits into from
Aug 9, 2024

Conversation

jrhee17
Copy link
Contributor

@jrhee17 jrhee17 commented Jul 5, 2024

Motivation:

In order to handle #4591, I propose that we first define an API which allows users to cancel a request. Currently, ClientRequestContext#cancel is invoked once CancellationScheduler#start is invoked. I propose to change the behavior of ClientRequestContext#cancel such that the associated request is aborted immediately. Once this API is available, it would be trivial to implement ResponseTimeoutMode by adjusting where to call CancellationScheduler#start. Additionally, users may easily implement their own timeout mechanism if they would like.

Modifications:

  • CancellationScheduler related changes
    • DefaultCancellationScheduler is refactored to be lock based instead of event loop based. The reasoning for this change was for the scenario where the request execution didn't reach the event loop yet. i.e. If a user calls ctx.cancel in the decorator, a connection shouldn't be opened.
      • e.g.
        final Throwable t = new Throwable();
        final CountingConnectionPoolListener connListener = new CountingConnectionPoolListener();
        final AtomicReference<ClientRequestContext> ctxRef = new AtomicReference<>();
        try (ClientFactory cf = ClientFactory
        .builder()
        .connectionPoolListener(connListener)
        .workerGroup(eventLoopGroup.get(), false)
        .build()) {
        final HttpResponse res = server.webClient(cb -> {
        cb.decorator((delegate, ctx, req) -> {
        ctx.cancel(t);
        ctxRef.set(ctx);
        return delegate.execute(ctx, req);
        });
        cb.decorator(TestInfoHeaderDecorator.newDecorator(testInfo));
        cb.decorator(AttachCallbacksDecorator.newDecorator());
        cb.factory(cf);
        }).get("/");
        assertThatThrownBy(() -> res.aggregate().join())
        .isInstanceOf(CompletionException.class)
        .hasCauseInstanceOf(UnprocessedRequestException.class)
        .hasRootCause(t);
        assertThat(connListener.opened()).isEqualTo(0);
        assertThat(requests).doesNotContain(testInfo.getDisplayName());
        // don't validate the thread since we haven't started with event loop scheduling yet
        validateCallbackChecks(null);
        }
    • CancellationScheduler#updateTask is introduced. This API updates the cancellation task if the scheduler isn't invoked yet. If the scheduler is invoked already, the cancellation task will be executed eventually. This API allows ctx.cancel to attempt cancellation depending on which stage the request is at. For instance, at the decorators only req.abort needs to be called but at the request write stage, the cancellation task may need to send a reset signal.
    • Misc. an infinite timeout is internally represented as Long.MAX_VALUE instead of 0
  • AbstractHttpRequestHandler related changes
    • CancellationTask in AbstractHttpRequestHandler, HttpResponseWrapper, AbstractHttpResponseHandler is modified to be scheduled inside an event loop. The reasoning is that ctx.cancel, and hence CancellationTask#run can be invoked from any thread.
    • There is a higher chance of AbstractHttpRequestHandler calling fail or failAndReset multiple times. There is no point in doing so, so added a boolean flag failed for safety.
  • HttpResponseWrapper related changes
    • The original intention of cancelTimeoutAndLog was to not log if the response is unexpected. Modified so that if the response is cancelled or the context is cancelled, no logging is done
    • There is probably no reason to not call close when a timeout occurs. Unified the fragmented logic of closing the HttpResponseWrapper.
  • Found that DefaultRequestLog is completing futures with the context set on the event loop. We generally don't do this since we strictly validate against setting incorrect RequestContexts, so modified to use withoutContext.

Result:

@jrhee17 jrhee17 added this to the 1.30.0 milestone Jul 5, 2024
@jrhee17 jrhee17 marked this pull request as ready for review July 5, 2024 09:20
@@ -428,7 +428,8 @@ private void updateFlags(int flags) {
private static void completeSatisfiedFutures(RequestLogFuture[] satisfiedFutures, RequestLog log,
RequestContext ctx) {
if (!ctx.eventLoop().inEventLoop()) {
ctx.eventLoop().execute(() -> completeSatisfiedFutures(satisfiedFutures, log, ctx));
ctx.eventLoop().withoutContext().execute(
() -> completeSatisfiedFutures(satisfiedFutures, log, ctx));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q) Why was the context removed? Shouldn't the request context be propagated to the callbacks of satisfiedFutures?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed like we were making an effort to clear the request context when completing logs.

e.g.

try (SafeCloseable ignored = RequestContextUtil.pop()) {
req.abort(cause);
final RequestLogBuilder logBuilder = ctx.logBuilder();
logBuilder.endRequest(cause);
logBuilder.endResponse(cause);
}

I saw ContextCancellationTest failing occasionally due to this, so I modified to use withoutContext.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was an inevitable choice Netty can handle pending messages in batch mode that can cause a request context leak.

If an action does not perform sequentially in the callbacks of Netty operations, I think a context-aware executor is generally preferred.

I am not strong here. If you have any other workarounds, it is okay to keep it.

Copy link
Contributor Author

@jrhee17 jrhee17 Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Netty can handle pending messages in batch mode that can cause a request context leak

I see, I think I misunderstood the intention then. My original guess was that if request logs for multiple requests are waited on a single condition, users would easily see a request context leak.

I'm not sure I understood what you mean by batch mode though. I thought channels are tied to a single thread, and within a single thread as long as the try...close syntax is used context leaks weren't possible.

Copy link
Contributor

@ikhoon ikhoon Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try...close syntax is used context leaks weren't possible.

Context leaks could happen even with try...close.
if request logs for multiple requests are waited on a single condition seems to mean the same thing, but I'll explain it just in case.

Say we have a simple service that adds a callback to response.whenComplete().

try (ctx.push()) {
   val response = HttpResponse.streaming();
   response.write(...);
   response.whenComplete().thenAccept(unused -> {
      // Do something else.
   })
}
  • The code pushes ctx temporary using try with resource block.
  • response.whenComplete() may be completed when WriteFutureListener.operationComplete() is invoked if it is a unary type.
    future.addListener(new WriteFutureListener(response.content().isEmpty(), cause));
  • Netty Channel stores write requests to ChannelOutboundBuffer.
  • Channel is busy. response.write() for request_1 to request_9 are added to ChannelOutboundBuffer and operationComplete() isn't invoked for them.
  • When response.write() for request_10 is written, Channel can flush the outbound data so request_1 to request_10 are flushed in the same call (task).
  • Without try (SafeCloseable ignored = RequestContextUtil.pop()), the callbacks of request_1-9.whenComplete() will see the context of request_10.

However, it would be impossible to leak the context with ctx.eventLoop().execute(() -> completeSatisfiedFutures()) since it is executed asynchronously. So I thought it was safe to remove .withoutContext().

It seemed like we were making an effort to clear the request context when completing logs.

As you said, contexts are not propagated in some cases. I agree with keeping the current style in terms of consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation 👍

When response.write() for request_10 is written, Channel can flush the outbound data so request_1 to request_10 are flushed in the same call (task).

I understood that even if request_1 to request_10 is flushed together in a single task, they will be executed sequentially.
e.g.

ClientRequestContext ctx1 = null;
ClientRequestContext ctx2 = null;
// represents a single task
CommonPools.workerGroup().execute(() -> {
    // request 1
    try (SafeCloseable sf = ctx1.push()) {
        ctx1.logBuilder().endResponse();   
    }
    // request 2
    try (SafeCloseable sf = ctx2.push()) {
        ctx2.logBuilder().endResponse();
    }
});

I agree with keeping the current style in terms of consistency.

👍 for me as well. Ultimately, my thought was that it depends more on how users are using the API - for this reason it is probably more important that the context is pushed (or not) consistently.

e.g. I can imagine the following scenario

HttpResponseWriter res1 = null;
HttpResponseWriter res2 = null;
CompletableFuture.allOf(ctx1.log().whenComplete(), ctx2.log().whenComplete())
                 .handle((v, t) -> {
                     res1.close();
                     res2.close();
                     // when the res is completed, the same thread may call ctx.push somewhere
                     return null;
                 });

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for chiming in late.
One thing we need to keep in mind is that if this method is initially called by the ctx.eventLoop(), it might have the context in the thread-local.
Could you let me know which test fails in ContextCancellationTest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't expect this change to be so controversial 😅 Let me just revert this change since this isn't really critical to this PR

timeoutNanos = LongMath.saturatedSubtract(timeoutNanos, elapsedTimeNanos);
}
if (timeoutNanos != Long.MAX_VALUE) {
scheduledFuture = eventLoop().schedule(() -> invokeTask(null), timeoutNanos, NANOSECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice clean up!

if (ctxExt == null) {
return;
}
ctxExt.responseCancellationScheduler().updateTask(cause -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested removing updateTask() method, but that would result in too many changes to the existing code. As long as the implemetation is correct, I'm okay to maintain the current style and focus on implementing the response timeout mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, updateTask() doesn't necessarily have to be removed - it can still be used internally since the old CancellationScheduler#start acted similarly to updateTask().

It is probably more important to unify and simplify the request/response abortion mechanism. This seems difficult to do at the moment for client-side. In particular, when one attempts this one should take note of 1) consistency between [req|res].cause and log.[req|res]Cause, 2) setting the correct exception type (e.g. UnprocessedRequestException).

SCHEDULED,
FINISHING,
FINISHED
PENDING,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about SCHEDULED? PENDING sounds a like a status between init() and start().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, note that technically calling start() doesn't guarantee a task has been scheduled if the timeout is Long.MAX_VALUE

Copy link
Contributor

@ikhoon ikhoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks nice. Left minor comments and a corner case.

startTimeNanos = ticker.read();
if (timeoutMode == TimeoutMode.SET_FROM_NOW) {
final long elapsedTimeNanos = startTimeNanos - setFromNowStartNanos;
timeoutNanos = LongMath.saturatedSubtract(timeoutNanos, elapsedTimeNanos);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeoutNanos could be a negative number with TimeoutMode.SET_FROM_NOW if a decorator takes a long time to perform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing it. timeoutNanos is also exposed to the userland via ctx.responseTimeoutMillis(). 0 would be better than a negative number.

public long responseTimeoutMillis() {
return TimeUnit.NANOSECONDS.toMillis(responseCancellationScheduler.timeoutNanos());
}

@@ -135,7 +135,9 @@ void shouldRunHooksWhenContextIsPushed() {
hookEvents.clear();
response = client.get("http://foo.com:" + server.httpPort() + "/virtualhost");
assertThat(response.status()).isEqualTo(HttpStatus.OK);
assertThat(hookEvents).containsExactly(
// we don't do containsExactly here because there is no easy way to guarantee that
// all context hooks from the previous request have been completed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q) Did the previous request have been completed mean the request sent to /server? Has it been a flaky test or introduced by this PR? Would you mind elaborating on the case so that I can understand it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me just revert this change and handle separately if needed since this is a change independent from this PR

@ikhoon
Copy link
Contributor

ikhoon commented Aug 2, 2024

#5156 has been merged. Would you mind resolving the conflicts?

Copy link
Contributor

@ikhoon ikhoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, @jrhee17! 👍💯

Comment on lines 28 to 40
static CancellationScheduler ofClient(long timeoutNanos) {
if (timeoutNanos == 0) {
timeoutNanos = Long.MAX_VALUE;
}
return new DefaultCancellationScheduler(timeoutNanos, false);
}

static CancellationScheduler ofServer(long timeoutNanos) {
if (timeoutNanos == 0) {
timeoutNanos = Long.MAX_VALUE;
}
return new DefaultCancellationScheduler(timeoutNanos, true);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we do when a user specified Long.MAX_VALUE? Should we use Long.MAX_VALUE - 1 if we handle Long.MAX_VALUE specially? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think about this case too seriously since this is equivalent to a timeout of 292 years.
Having said this, pushed a commit handling this case as well.

jshell> System.out.println(TimeUnit.NANOSECONDS.toDays(Long.MAX_VALUE) / 365);
292

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but correctness matters 🤣

Copy link
Member

@trustin trustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice clean-up!

Copy link
Member

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. Left a couple of suggestions.

@@ -428,7 +428,8 @@ private void updateFlags(int flags) {
private static void completeSatisfiedFutures(RequestLogFuture[] satisfiedFutures, RequestLog log,
RequestContext ctx) {
if (!ctx.eventLoop().inEventLoop()) {
ctx.eventLoop().execute(() -> completeSatisfiedFutures(satisfiedFutures, log, ctx));
ctx.eventLoop().withoutContext().execute(
() -> completeSatisfiedFutures(satisfiedFutures, log, ctx));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for chiming in late.
One thing we need to keep in mind is that if this method is initially called by the ctx.eventLoop(), it might have the context in the thread-local.
Could you let me know which test fails in ContextCancellationTest?

private final Ticker ticker;
private final ReentrantShortLock lock = new ReentrantShortLock();

private final boolean server;
@Nullable
private volatile CancellationFuture whenCancelling;
@Nullable
private volatile CancellationFuture whenCancelled;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because whenCancelled(), which creates this future, is called almost immediatly after this instance is created, how about creating this future directory at construction?
private final CancellationFuture whenCancelled = new CancellationFuture();

if (pendingTaskUpdater.compareAndSet(this, oldPendingTask, newPendingTask)) {
break;
}
whenCancelled().thenAccept(cause -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This callback is called for all updating tasks.
How about adding this callback only once and use this.task instead of task in the callback?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really understand your comment here.

This callback is called for all updating tasks.

Technically, this code path will be called when updateTask is invoked after the CancellationScheduler is in the FINISHED state. The initial timed out task won't be invoked here

How about adding this callback only once and use this.task instead of task in the callback?

Are you suggesting that we modify this.task? My concern is this will create a race condition with invokeTask(). Do you mind adding a suggestion block so I can understand your intention?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, my bad. I didn't notice the return statement in the above if condition so I thought thenAccept will be attached for all tasks. 😓

Copy link
Member

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 👍 👍 👍

@ikhoon
Copy link
Contributor

ikhoon commented Aug 9, 2024

@jrhee17 Would you check the test failures?

ContextCancellationTest > cancel_beforeDelegate(TestInfo) FAILED
    java.lang.AssertionError: 
    Expecting empty but was: [[creqId=147071ed][http://127.0.0.1:34223/#GET]]
        at com.linecorp.armeria.client.ContextCancellationTest.validateCallbackChecks(ContextCancellationTest.java:289)
        at com.linecorp.armeria.client.ContextCancellationTest.cancel_beforeDelegate(ContextCancellationTest.java:115)

ContextCancellationTest > cancel_beforeConnection(TestInfo) FAILED
    java.lang.AssertionError: 
    Expecting empty but was: [[creqId=456bf5ae][http://foo.com:34223/#POST]]
        at com.linecorp.armeria.client.ContextCancellationTest.validateCallbackChecks(ContextCancellationTest.java:289)
        at com.linecorp.armeria.client.ContextCancellationTest.cancel_beforeConnection(ContextCancellationTest.java:153)

@jrhee17
Copy link
Contributor Author

jrhee17 commented Aug 9, 2024

Let me go ahead and merge this PR since it's reliable passing the CI again, thanks for the review all 👍

@jrhee17 jrhee17 merged commit 9a29b39 into line:main Aug 9, 2024
15 checks passed
ikhoon pushed a commit that referenced this pull request Aug 12, 2024
Motivation:

The original motivation of this PR stems from
#4591.
There has been requests for a timeout which spans over an entire
client's lifecycle. Following #5800, this can be achieved easily by
adjusting where to call `CancellationScheduler#start`.

I propose that we add an enum `ResponseTimeoutMode` which decides when a
`CancellationScheduler` will start to be scheduled. By doing so, we can
allow users to choose when a response timeout will start to be scheduled
per client.

Modifications:

- Introduced a new `ResponseTimeoutMode` enum and added options in
`AbstractClientOptionsBuilder` and `Flags` respectively
- Depending on the set `ResponseTimeoutMode`, the timeout is started on
1) context init 2) request start 3) or response read

Result:

- Closes #4591

<!--
Visit this URL to learn more about how to write a pull request
description:

https://armeria.dev/community/developer-guide#how-to-write-pull-request-description
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants