Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable S3Select pushdown when query not filters data #13477

Merged
merged 1 commit into from
Aug 15, 2022

Conversation

fuatbasik
Copy link
Contributor

@fuatbasik fuatbasik commented Aug 3, 2022

Enabling S3 Select pushdown for filtering improves performance of queries by reducing the data on the wire. It is most effective when queries pushed down to Select is filtering out some portion of the data residing on S3. This commits disables Select pushdown when query has no predicate, or projection. To retrieve entire object, S3 GetObject is a cheaper option.

Description

Enabling S3 Select pushdown for filtering improves performance of queries by reducing the data on the wire. It is most effective when queries pushed down to Select is filtering out significant portion of the data residing on S3. This commit disables Select pushdown when query has no predicate, or projection. In these cases using S3 GetObject is both cheaper and faster.

Is this change a fix, improvement, new feature, refactoring, or other?

Improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

To Trino-Hive connector

How would you describe this change to a non-technical end user or system administrator?

This commit disables use of S3 Select, when it is not going to improve the performance.

Related issues, pull requests, and links

Documentation

(X) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(X) No release notes entries required.
( ) Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot
Copy link

cla-bot bot commented Aug 3, 2022

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Fuat Basik.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@findepi findepi requested a review from arhimondr August 3, 2022 09:05
return isEquivalentColumns(projectedColumnNames, schema) && isEquivalentColumnTypes(projectedColumnTypes, schema);
}

private boolean isEquivalentColumns(Set<String> projectedColumnNames, Properties schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: do you think these utility methods can be static?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, i will fix this in the next revision

@dnanuti
Copy link
Member

dnanuti commented Aug 3, 2022

I would like to call-out that our Docker tests are relying on retrieving the entire table content, so pushing down non-filtering queries to Select. We need to add a mechanism to pass filtering queries before merging this PR.

@fuatbasik
Copy link
Contributor Author

fuatbasik commented Aug 3, 2022

You mean these tests right (https://github.com/trinodb/trino/blob/master/plugin/trino-hive-hadoop2/src/test/java/io/trino/plugin/hive/s3select/TestHiveFileSystemS3SelectPushdown.java)? You are right and it is a really good catch. With this change, none of these tests will be pushed down to Select.

Let me create a new test utility method, ScanAndFilterTable that accepts Table and ColumnHandles, and uses Select to return only the relevant data. Next, i can add tests in the aforementioned class that uses this method, instead of readTable method. This way, we can check correctness and completeness of the Select Filtering too, which should be a good addition to test coverage.

}
else {
final String columnNameDelimiter = (String) schema.getOrDefault(COLUMN_NAME_DELIMITER, ",");
columnNames = new HashSet<>(asList(columnNameProperty.split(columnNameDelimiter)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified to Set.of(columnNameProperty.split(columnNameDelimiter))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed this.

Set<String> columnNames;
String columnNameProperty = schema.getProperty(LIST_COLUMNS);
if (columnNameProperty.length() == 0) {
columnNames = new HashSet<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use Collections.emptySet()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for the other occurrences - lines: 165, 179

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed this.

columnNames = new HashSet<>();
}
else {
final String columnNameDelimiter = (String) schema.getOrDefault(COLUMN_NAME_DELIMITER, ",");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: drop final

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed this.

Set<String> columnNames;
String columnNameProperty = schema.getProperty(LIST_COLUMNS);
if (columnNameProperty.length() == 0) {
columnNames = new HashSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer Set.of or ImmutableSet.of

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed this.

String columnTypeProperty = schema.getProperty(LIST_COLUMN_TYPES);
Set<String> columnTypes;
if (columnTypeProperty.length() == 0) {
columnTypes = new HashSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed this.

private Set<String> getColumnProperty(List<HiveColumnHandle> readerColumns, Function<HiveColumnHandle, String> mapper)
{
if (readerColumns == null || readerColumns.isEmpty()) {
return new HashSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed this.


private Set<String> getColumnProperty(List<HiveColumnHandle> readerColumns, Function<HiveColumnHandle, String> mapper)
{
if (readerColumns == null || readerColumns.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can readerColumns ever be null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cannot be null but can be empty. Dropping the null check.

}
return readerColumns.stream()
.map(mapper)
.collect(Collectors.toSet());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toImmutableSet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

public void shouldReturnSelectRecordCursor()
{
List<HiveColumnHandle> columnHandleList = new ArrayList<>();
s3SelectPushdownEnabled = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not thread safe (multiple tests can run in parallel)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made s3SelectPushdownEnabled a method-local variable for the tests.


public class TestS3SelectRecordCursorProvider
{
private static final Configuration CONFIGURATION = ConfigurationInstantiator.newEmptyConfiguration();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend inlining all mutable fields for thread safety (CONFIGURATION, HDFS_ENVIRONMENT, S3_SELECT_RECORD_CURSOR_PROVIDER, SCHEMA)

Copy link
Contributor Author

@fuatbasik fuatbasik Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure i understood this comment. Should I re-create these objects in each Test method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure i understood this comment. Should I re-create these objects in each Test method?

Yeah. Those objects should be cheap to create and it would make the tests thread safe allowing multiple tests to be executed in parallel.


private static String buildPropertyFromColumns(List<HiveColumnHandle> columns, Function<HiveColumnHandle, String> mapper)
{
if (columns == null || columns.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can columns ever be null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it shouldn't be, i am dropping the null check. But they can be empty.

{
s3SelectPushdownEnabled = true;
TupleDomain<HiveColumnHandle> effectivePredicate = TupleDomain.all();
final List<HiveColumnHandle> readerColumns = ImmutableList.of(QUANTITY_COLUMN, AUTHOR_COLUMN, ARTICLE_COLUMN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: drop final (here an in shouldNotReturnSelectRecordCursorWhenProjectionOrderIsDifferent)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixing it in second revision.

@cla-bot cla-bot bot added the cla-signed label Aug 10, 2022
@fuatbasik fuatbasik force-pushed the select-pushdown-optimisation branch 2 times, most recently from a1b0841 to 49aa6a4 Compare August 10, 2022 13:40
Copy link
Contributor

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % remaining comments

@@ -80,17 +90,20 @@ public Optional<ReaderRecordCursorWithProjections> createRecordCursor(
// Ignore predicates on partial columns for now.
effectivePredicate = effectivePredicate.filter((column, domain) -> column.isBaseColumn());

List<HiveColumnHandle> readerColumns = projectedReaderColumns
.map(readColumns -> readColumns.get().stream().map(HiveColumnHandle.class::cast).collect(toUnmodifiableList()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we usually prefer the toImmutableList collector from Guava

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updating in the next commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

public void shouldReturnSelectRecordCursor()
{
List<HiveColumnHandle> columnHandleList = new ArrayList<>();
boolean s3SelectPushdownEnabled = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would probably simply inline it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have inlined the parameter.

TupleDomain<HiveColumnHandle> effectivePredicate = TupleDomain.all();
Optional<HiveRecordCursorProvider.ReaderRecordCursorWithProjections> recordCursor =
S3_SELECT_RECORD_CURSOR_PROVIDER.createRecordCursor(
CONFIGURATION, SESSION, PATH, START, LENGTH, FILESIZE, SCHEMA, columnHandleList, effectivePredicate, TESTING_TYPE_MANAGER, s3SelectPushdownEnabled);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would recommend putting each argument on a separate line (here and in other places)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, i encapsulated the createRecordCursor in a method and there I am putting each argument on a separate line.

Enabling S3 Select pushdown for filtering improves performance of
queries by reducing the data on the wire. It is most effective when
queries pushed down to Select is filtering out significant portion
of the data residing on S3. This commits disables Select pushdown
when query has no predicate, or projection. In these cases using
GET is both cheaper and faster.
@arhimondr arhimondr merged commit a4b3cd7 into trinodb:master Aug 15, 2022
@github-actions github-actions bot added this to the 393 milestone Aug 15, 2022
@colebow
Copy link
Member

colebow commented Aug 15, 2022

If this improves performance/reduces cost, it would be good to include it in the release notes. Could you please propose a potential release note?

cc @arhimondr

@fuatbasik
Copy link
Contributor Author

Potential release note: Improve efficiency for queries over tables in CSV and JSON formats stored on S3 when no filtering or projection is needed by automatically disabling S3 Select pushdown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants