Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recreate or wrap exceptions thrown by async transport implementations to preserve caller stack traces #656

Conversation

andrewparmet
Copy link
Contributor

Description

Ensures that callers of synchronous methods will see their method name in the stracktrace of an exception thrown by an asynchronous transport.

Issues Resolved

Fixes #653.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Member

@dblock dblock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cursory review for now ...

  1. What is the the user experience change with this?
  2. Anything breaking in a semver sense?
  3. Let's also update some user guides on exceptions/error handling as part of this PR?

assertTrue(e.getMessage().contains("transient setting [no_idea_what_you_are_talking_about], not recognized"));
} catch (Exception e) {
OpenSearchException openSearchException;
if (e instanceof OpenSearchException) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't expect conditionals in tests, is this not predictable?

Copy link
Contributor Author

@andrewparmet andrewparmet Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's predictable but now varies per implementation.

This test is run against the RestClient transport and the Apache HTTP 5 transports. The RestClient transport throws OpenSearchException directly and since its transport is synchronous, this exception's stack trace contains a reference to the caller. The HC5 transport is always asynchronous and while the underlying logic will throw an OpenSearchException, that logic is executing on a pooled thread. That OpenSearchException then has to be wrapped in order to preserve a reference to the caller in the thrown exception's stack trace. I cannot construct an OpenSearchException wrapping an OpenSearchException (which would be suspicious anyways), so this now has to catch any OpenSearchException and IOException. I can modify the catch blocks to be multicatches.

The correct way to model this is to parameterize each implementation of this abstract class by the type of exception it should be expected to throw and use the assertThrows pattern here instead of try-catch.

This modeling issue combined with #656 (comment) are two overarching pieces of test tech debt I think should be handled separately from this PR, but would enable more comprehensive transport-agnostic integration tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation. Open GitHub issues for these leftovers so we don't forget / someone else can help.

import org.apache.logging.log4j.core.util.Throwables;
import org.junit.Test;

public abstract class AbstractAsyncStracktraceIT extends OpenSearchJavaClientTestCase {
Copy link
Contributor Author

@andrewparmet andrewparmet Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to be able to mimic the existing pattern for RestClient and Apache HTTP5 integration tests and write one abstract test class that both Apache HTTP5 and AWS SDK V2 clients can use. I don't have it configured correctly here. When the CI suite runs these tests the AWS test is executing when it shouldn't be.

Altogether it looks like there's no great universal parent class that achieves exactly what I'm looking for but OpenSearchJavaClientTestCase seems to be the closest thing to it.

dblock
dblock previously approved these changes Oct 9, 2023
assertTrue(e.getMessage().contains("transient setting [no_idea_what_you_are_talking_about], not recognized"));
} catch (Exception e) {
OpenSearchException openSearchException;
if (e instanceof OpenSearchException) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation. Open GitHub issues for these leftovers so we don't forget / someone else can help.

@reta
Copy link
Collaborator

reta commented Oct 13, 2023

@dblock I don't think we should be doing this change for 2 reasons:

  • there will be runtime exception in the code (wrapped in the CompletionException or not) and it is normal to let them pop
  • this could be considered a breaking change, since some exceptions that were propagated before as they should become now wrapped into IOExceptions

@andrewparmet
Copy link
Contributor Author

it is normal to let them pop

While it's generally acceptable to let RuntimeExceptions propagate, it's expected that the stacktrace will include your caller. Would it be acceptable to re-wrap with a different runtime exception here? Perhaps OpenSearchRuntimeException?

this could be considered a breaking change

Is this not permitted in 3.0.0? Since it's unreleased, such a breaking change would be permitted by semver.

@reta
Copy link
Collaborator

reta commented Oct 13, 2023

While it's generally acceptable to let RuntimeExceptions propagate, it's expected that the stacktrace will include your caller. Would it be acceptable to re-wrap with a different runtime exception here? Perhaps OpenSearchRuntimeException?

Let's take an example of the CompletionException here: as we know, this is wrapper. Ideally, what we would like to have here is the exact exception that the code being called has raised (and currently, implementation tries to do that). With respect to the caller stacktrace: async code is notoriously difficult to troubleshoot but we have only 1 single place that does that (performRequestAsync). How it is helpful to have it included? It could be convenient but could you please provide an example that it is necessary?

Is this not permitted in 3.0.0? Since it's unreleased, such a breaking change would be permitted by semver.

Sure, for 3.x we are fine, we won't be backporting it which could be a maintenance issue (we will have to support 2 different behaviours across 2.x and 3.x branches).

@andrewparmet
Copy link
Contributor Author

andrewparmet commented Oct 13, 2023

async code is notoriously difficult to troubleshoot but we have only 1 single place that does that (performRequestAsync). How it is helpful to have it included?

That one single place is on the code path of every call OpenSearchClient makes when using async transport.

A logger.error call records the stack trace of the caught exception, but we have no way of correlating it with a call to the OpenSearch client. We can't answer the question: "Which of our calls to OpenSearch during this request failed?" A caller stack trace is the idiomatic way to answer that question.

In some cases the OpenSearchClient uses async transport for synchronous calls: for example, when using the Apache HC5 transport. This also happens when making synchronous calls using the AWS async client which we reuse across synchronous and asynchronous calls to enable connection reuse. Notably we have no synchronous transport we can use in tests (see #631) and this makes it much harder to debug new code.

I think the root of the problem is really that OpenSearchClient can be built with async transport when it only offers synchronous APIs. In contrast AWS clients offered by the v2 Java SDK are completely split between sync and async variants and you cannot construct one with a transport of the other kind.

Clients built with the other kind of transport compromise their behavior: an async client built with sync transport must wrap synchronous calls with Futures or in Executors (RestClientTransport does the first). A sync client built with async transport cannot offer the same API for thrown exceptions since it must unwrap exceptions which will have obfuscated stack traces or throw their wrapping completion/execution exceptions (HC5/AwsSdkAsync do the first).

The solution is to either accept a compromise or to disable this kind of "square peg round hole" construction. I think that obfuscating caller traces is a bad compromise so I filed an issue to achieve consensus before attempting to change the behavior. I didn't dump too many hours into this, so if the consensus isn't there, I do have the functioning workaround I mentioned. I don't think it's a good pattern to require from clients of a library, especially if you're free to change behavior in a new major version.

for 3.x we are fine

Good to hear. We're upgrading from the RestHighLevelClient (1.3.2) and we won't need a backport.

@reta
Copy link
Collaborator

reta commented Oct 13, 2023

A logger.error call records the stack trace of the caught exception, but we have no way of correlating it with a call to the OpenSearch client. We can't answer the question: "Which of our calls to OpenSearch during this request failed?" A caller stack trace is the idiomatic way to answer that question.

Thanks a lot for sharing your thoughts. One of the features that we're working on is opensearch-project/OpenSearch#6750, that would also include client instrumentation for sure. This is long shot for sure but to highlight the future options.

The solution is to either accept a compromise or to disable this kind of "square peg round hole" construction. I think that obfuscating caller traces is a bad compromise so I filed an issue to achieve consensus before attempting to change the behavior.

I hear you here, I think we should have put more thoughts into API design (we largely followed the RestClientTransport to preserve the same behaviour), unless others think the stack traces is the way to go, let spend some to understand if API could offer better choice (or implementation could better branch sync / async flows).

@dblock
Copy link
Member

dblock commented Oct 13, 2023

Thanks @andrewparmet @reta. I think my questions remain, we should work backwards from the developer experience that we want. Keeping an open mind about the solution!

@reta
Copy link
Collaborator

reta commented Oct 27, 2023

Thanks @andrewparmet @reta. I think my questions remain, we should work backwards from the developer experience that we want. Keeping an open mind about the solution!

@andrewparmet my apologies for the delay, so I have looked into 2 options here:

  1. Suggested by you, the approach to rethrow exception: it will help to preserve the stacktrace but the risk of breaking things here is very high: -
    • we cannot preserve the exact exception type
    • we diverge the behaviour of the RestClientTransport and ApacheHttpClient5Transport (not only confusing but will make migration more difficult)
  2. Adding dedicated sync client. It will work but sadly requires the API changes (since configuration callbacks are all bound to AsyncHttpClient, not ideal either.

I believe we should keep the current implementation unchanged and recommend people to use the OpenSearchAsyncClient directly to dial with exception management the way it fits the need.

@andrewparmet
Copy link
Contributor Author

andrewparmet commented Oct 27, 2023

we diverge the behaviour of the RestClientTransport and ApacheHttpClient5Transport

They are already divergent in the sense that one preserves the trace of its caller and the other does not even given equivalent server-side errors. Option 2 is the only way to be fully consistent. If HC5 offers synchronous transport then perhaps it shares a configuration type with the async transport.

@reta
Copy link
Collaborator

reta commented Oct 27, 2023

They are already divergent in the sense that one preserves the trace of its caller and the other does not even given equivalent server-side errors.

Oh that I missed out, my apologies for that, that is an argument towards making it right, looking into divergence now, thank you.

If HC5 offers synchronous transport then perhaps it shares a configuration type with the async transport.

It offers but does not share anything, those are separate implementations

@andrewparmet
Copy link
Contributor Author

those are separate implementations

Boo, but that's probably the right way to do it from their side.

My ultimate position here then is to match the AWS SDK - async clients should take async transports and sync clients should take sync transports. Then, if you want to do what I'm doing and share a connection pool, do the wrapping and unwrapping of async calls yourself. Everybody gets exactly the API they want - all the way down to the way stacktraces look.

@reta
Copy link
Collaborator

reta commented Oct 27, 2023

Boo, but that's probably the right way to do it from their side.

Totally :)

My ultimate position here then is to match the AWS SDK - async clients should take async transports and sync clients should take sync transports. T

I agree, the part which is difficult to translate seamlessly is HttpClientConfigCallback that requires HttpAsyncClientBuilder explicitly.

@andrewparmet mind if I push a change to your pull request? TLDR; replicating the RestClient approach, essentially exactly the way you are trying to fix it here but with a bit more complicated logic. Thank you.

@andrewparmet
Copy link
Contributor Author

HttpClientConfigCallback that requires HttpAsyncClientBuilder explicitly

Yeah this API would probably need to be renamed to distinguish between sync and async APIs, e.g. ApacheHttpClient5AsyncTransportBuilder.

Feel free to hijack this PR - the magic button should be enabled.

Signed-off-by: Andrew Parmet <[email protected]>
Signed-off-by: Andrew Parmet <[email protected]>
Signed-off-by: Andrew Parmet <[email protected]>
Signed-off-by: Andrew Parmet <[email protected]>
Signed-off-by: Andrew Parmet <[email protected]>
Signed-off-by: Andrew Parmet <[email protected]>
@andrewparmet
Copy link
Contributor Author

andrewparmet commented May 30, 2024

@dblock @reta Sorry for the delay here. This PR is now up to date with the latest idea on exception extraction implemented in the AWS transport.

It includes two tests - one for the AWS transport and one for the HC5 transport - that lightly verify that stack trace propagation works. Ideally there would be one abstract test that can cover both transports, but I didn't see an easy way to do that given the current setup of the test suite. Let me know if there's another approach you'd prefer. It does look like this may be easier to do once Java 8 support is dropped and AWS integration tests are moved into the Java 11 test source set so code can be shared.

Due to the new logic introduced in 03b214f, it does not completely cover all potential cases. In particular I think this strategy is slightly brittle as new exceptions may (will) be introduced that are not correctly captured. For example, this happened with OpenSearchClientException between the last time we touched this PR and now.

Signed-off-by: Andrew Parmet <[email protected]>
Signed-off-by: Andrew Parmet <[email protected]>
@andrewparmet andrewparmet force-pushed the wrap-runtime-exceptions-with-io-exceptions branch from 2ccb4b9 to b4f54d6 Compare May 30, 2024 03:53
@andrewparmet andrewparmet changed the title Wrap exceptions thrown by async transport implementations in IOException Recreate or wrap exceptions thrown by async transport implementations to preserve caller stack traces May 30, 2024
Signed-off-by: Andrew Parmet <[email protected]>
java-client/build.gradle.kts Outdated Show resolved Hide resolved
Signed-off-by: Andrew Parmet <[email protected]>
@andrewparmet andrewparmet force-pushed the wrap-runtime-exceptions-with-io-exceptions branch from ef9cd5a to f7814cd Compare May 30, 2024 13:13
Signed-off-by: Andrew Parmet <[email protected]>
Signed-off-by: Andrew Parmet <[email protected]>
@reta
Copy link
Collaborator

reta commented May 30, 2024

@dblock @reta Sorry for the delay here. This PR is now up to date with the latest idea on exception extraction implemented in the AWS transport.

Thanks a lot @andrewparmet , LGTM, does it solve your problem in an implementation shape like that? Thank you

@andrewparmet
Copy link
Contributor Author

It does - we'll be able to trace our calls at the very least to the rethrown variations at the bottom of the extraction functions.

@reta
Copy link
Collaborator

reta commented May 30, 2024

@dblock do you mind to check it out? thank you!

@dblock
Copy link
Member

dblock commented May 30, 2024

@dblock do you mind to check it out? thank you!

Looks great, merging.

@andrewparmet Thanks for hanging in here with us. I think the project could use some documentation along the "handling errors" lines that could explain the various scenarios that you've fixed and how one should deal with them. Maybe you want to write something simple to start?

@dblock dblock merged commit db50b7d into opensearch-project:main May 30, 2024
54 checks passed
@reta reta added the backport 2.x Backport to 2.x branch label May 30, 2024
opensearch-trigger-bot bot pushed a commit that referenced this pull request May 30, 2024
… to preserve caller stack traces (#656)

* wrap exceptions thrown by async transport implementations in IOException

Signed-off-by: Andrew Parmet <[email protected]>

* license headers

Signed-off-by: Andrew Parmet <[email protected]>

* changelog

Signed-off-by: Andrew Parmet <[email protected]>

* tolerate null causes

Signed-off-by: Andrew Parmet <[email protected]>

* fix tests

Signed-off-by: Andrew Parmet <[email protected]>

* remove print

Signed-off-by: Andrew Parmet <[email protected]>

* one more license header

Signed-off-by: Andrew Parmet <[email protected]>

* fix last non-aws test

Signed-off-by: Andrew Parmet <[email protected]>

* use multicatch to restrict caught exceptions

Signed-off-by: Andrew Parmet <[email protected]>

* Replicate the RestClient exception wrapping for async invocation flow

Signed-off-by: Andriy Redko <[email protected]>

* replicate hc5 exception extraction strategy in aws transport

Signed-off-by: Andrew Parmet <[email protected]>

* move other tests

Signed-off-by: Andrew Parmet <[email protected]>

* lint

Signed-off-by: Andrew Parmet <[email protected]>

* delete aws test for now; add support for OpenSearchClientException

Signed-off-by: Andrew Parmet <[email protected]>

* reintroduce an aws test, sadly, not extending the abstract case

Signed-off-by: Andrew Parmet <[email protected]>

* java 8

Signed-off-by: Andrew Parmet <[email protected]>

* replicate ISE

Signed-off-by: Andrew Parmet <[email protected]>

* poke

Signed-off-by: Andrew Parmet <[email protected]>

* handle some netty-specific channel exceptions

Signed-off-by: Andrew Parmet <[email protected]>

* poke

Signed-off-by: Andrew Parmet <[email protected]>

* nevermind netty

Signed-off-by: Andrew Parmet <[email protected]>

* io before rt

Signed-off-by: Andrew Parmet <[email protected]>

* no hc5

Signed-off-by: Andrew Parmet <[email protected]>

---------

Signed-off-by: Andrew Parmet <[email protected]>
Signed-off-by: Andriy Redko <[email protected]>
Co-authored-by: Andriy Redko <[email protected]>
(cherry picked from commit db50b7d)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
reta added a commit that referenced this pull request May 30, 2024
… to preserve caller stack traces (#656) (#1006)

* wrap exceptions thrown by async transport implementations in IOException



* license headers



* changelog



* tolerate null causes



* fix tests



* remove print



* one more license header



* fix last non-aws test



* use multicatch to restrict caught exceptions



* Replicate the RestClient exception wrapping for async invocation flow



* replicate hc5 exception extraction strategy in aws transport



* move other tests



* lint



* delete aws test for now; add support for OpenSearchClientException



* reintroduce an aws test, sadly, not extending the abstract case



* java 8



* replicate ISE



* poke



* handle some netty-specific channel exceptions



* poke



* nevermind netty



* io before rt



* no hc5



---------




(cherry picked from commit db50b7d)

Signed-off-by: Andrew Parmet <[email protected]>
Signed-off-by: Andriy Redko <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Andriy Redko <[email protected]>
@BrendonFaleiro BrendonFaleiro mentioned this pull request Jun 21, 2024
@andrewparmet andrewparmet deleted the wrap-runtime-exceptions-with-io-exceptions branch July 25, 2024 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Synchronous calls with async transport should preserve caller stack trace
3 participants