Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate S3 Select to use AWS Java SDK v2 #18270

Conversation

ankushChatterjee
Copy link
Contributor

Description

The PR aims to migrate AWS Java SDK usage in S3 Select to V2 from V1.

AWS SDK v2 is a rewrite of the V1 SDK, thus there is a lot of refactoring of the code.

As a part of this PR:

  • Added a new metric collector (MetricsPublisher in SDKV2 terms) for S3 Select, since the V1 metrics collector is not compatible with the V2 classes. If metrics collection for S3Select is something we do not want, it can be removed as well to reduce complexity.
  • Bumped up AWS SDK V2 version in root pom.xml to 2.20.99 from 2.20.89. This was required to incorporate the cross region client access feature.
  • Bumped up netty version in root pom.xml to 4.1.94.Final from 4.1.92.Final to avoid RequireUpperBounds errors.

Additional context and related issues

Some more context around the PR can be found on the conversation at #17522

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

@ankushChatterjee ankushChatterjee marked this pull request as ready for review July 14, 2023 18:44
@ankushChatterjee
Copy link
Contributor Author

Hi, can anyone please review this?

@electrum
Copy link
Member

Can you rebase? Then I'll kick off the tests with secrets

@@ -89,6 +90,7 @@ public void configure(Binder binder)
.setDefault().toInstance(ImmutableList::of);

binder.bind(TrinoS3ClientFactory.class).in(Scopes.SINGLETON);
binder.bind(S3SelectStats.class).toInstance(new S3SelectStats());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this? The metrics aren't exposed anywhere. Do you intend to export these with JMX?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did intend that, I will go over this once again and make the metric system proper. Although, do we really need to collect these metrics? If not, we can remove this and avoid the complexity entirely.

@@ -82,4 +103,118 @@ public boolean isRequestComplete()
{
return requestComplete;
}

/**
* Below classes are required for compatibility between AWS Java SDK 1.x and 2.x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does compatibility mean here? Since we're not supporting the 1.x SDK anymore, could we simplify the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SDKv2 does not return an InputStream for the data like SDK v1. I think the Hadoop LineReader above this layer requires an InputStream. This glue code takes the event stream and converts it into an InputStream. Maybe in the new implementation, we can remove the need for an InputStream and directly use the event stream?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be out of our hands, but it really feels like this is something the SDK should be able to provide for us. Getting object content back as an input stream seems like a common enough use case to have it be baked in. It's not like InputStreams are a Hadoop specific thing.

{
return ClientOverrideConfiguration.builder()
.putAdvancedOption(SdkAdvancedClientOption.USER_AGENT_PREFIX, userAgentPrefix)
.putAdvancedOption(SdkAdvancedClientOption.USER_AGENT_SUFFIX, enabled ? "Trino-select" : "Trino")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code isn't needed and never was, since this client factory is only for S3 Select. If S3 Select is not enabled, then this factory won't be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, will remove this code.

@electrum
Copy link
Member

electrum commented Nov 6, 2023

Closing since we removed support for S3 Select.

@electrum electrum closed this Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector
Development

Successfully merging this pull request may close these issues.

3 participants