-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable S3Select pushdown when query not filters data #13477
Disable S3Select pushdown when query not filters data #13477
Conversation
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Fuat Basik.
|
return isEquivalentColumns(projectedColumnNames, schema) && isEquivalentColumnTypes(projectedColumnTypes, schema); | ||
} | ||
|
||
private boolean isEquivalentColumns(Set<String> projectedColumnNames, Properties schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: do you think these utility methods can be static?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, i will fix this in the next revision
I would like to call-out that our Docker tests are relying on retrieving the entire table content, so pushing down non-filtering queries to Select. We need to add a mechanism to pass filtering queries before merging this PR. |
You mean these tests right (https://github.com/trinodb/trino/blob/master/plugin/trino-hive-hadoop2/src/test/java/io/trino/plugin/hive/s3select/TestHiveFileSystemS3SelectPushdown.java)? You are right and it is a really good catch. With this change, none of these tests will be pushed down to Select. Let me create a new test utility method, ScanAndFilterTable that accepts Table and ColumnHandles, and uses Select to return only the relevant data. Next, i can add tests in the aforementioned class that uses this method, instead of readTable method. This way, we can check correctness and completeness of the Select Filtering too, which should be a good addition to test coverage. |
} | ||
else { | ||
final String columnNameDelimiter = (String) schema.getOrDefault(COLUMN_NAME_DELIMITER, ","); | ||
columnNames = new HashSet<>(asList(columnNameProperty.split(columnNameDelimiter))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified to Set.of(columnNameProperty.split(columnNameDelimiter))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Fixed this.
Set<String> columnNames; | ||
String columnNameProperty = schema.getProperty(LIST_COLUMNS); | ||
if (columnNameProperty.length() == 0) { | ||
columnNames = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use Collections.emptySet()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for the other occurrences - lines: 165, 179
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Fixed this.
columnNames = new HashSet<>(); | ||
} | ||
else { | ||
final String columnNameDelimiter = (String) schema.getOrDefault(COLUMN_NAME_DELIMITER, ","); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: drop final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Fixed this.
Set<String> columnNames; | ||
String columnNameProperty = schema.getProperty(LIST_COLUMNS); | ||
if (columnNameProperty.length() == 0) { | ||
columnNames = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer Set.of
or ImmutableSet.of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Fixed this.
String columnTypeProperty = schema.getProperty(LIST_COLUMN_TYPES); | ||
Set<String> columnTypes; | ||
if (columnTypeProperty.length() == 0) { | ||
columnTypes = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Fixed this.
private Set<String> getColumnProperty(List<HiveColumnHandle> readerColumns, Function<HiveColumnHandle, String> mapper) | ||
{ | ||
if (readerColumns == null || readerColumns.isEmpty()) { | ||
return new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Fixed this.
|
||
private Set<String> getColumnProperty(List<HiveColumnHandle> readerColumns, Function<HiveColumnHandle, String> mapper) | ||
{ | ||
if (readerColumns == null || readerColumns.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can readerColumns
ever be null
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cannot be null but can be empty. Dropping the null check.
} | ||
return readerColumns.stream() | ||
.map(mapper) | ||
.collect(Collectors.toSet()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toImmutableSet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
public void shouldReturnSelectRecordCursor() | ||
{ | ||
List<HiveColumnHandle> columnHandleList = new ArrayList<>(); | ||
s3SelectPushdownEnabled = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not thread safe (multiple tests can run in parallel)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made s3SelectPushdownEnabled a method-local variable for the tests.
|
||
public class TestS3SelectRecordCursorProvider | ||
{ | ||
private static final Configuration CONFIGURATION = ConfigurationInstantiator.newEmptyConfiguration(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend inlining all mutable fields for thread safety (CONFIGURATION
, HDFS_ENVIRONMENT
, S3_SELECT_RECORD_CURSOR_PROVIDER
, SCHEMA
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure i understood this comment. Should I re-create these objects in each Test method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure i understood this comment. Should I re-create these objects in each Test method?
Yeah. Those objects should be cheap to create and it would make the tests thread safe allowing multiple tests to be executed in parallel.
|
||
private static String buildPropertyFromColumns(List<HiveColumnHandle> columns, Function<HiveColumnHandle, String> mapper) | ||
{ | ||
if (columns == null || columns.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can columns
ever be null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it shouldn't be, i am dropping the null check. But they can be empty.
{ | ||
s3SelectPushdownEnabled = true; | ||
TupleDomain<HiveColumnHandle> effectivePredicate = TupleDomain.all(); | ||
final List<HiveColumnHandle> readerColumns = ImmutableList.of(QUANTITY_COLUMN, AUTHOR_COLUMN, ARTICLE_COLUMN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: drop final
(here an in shouldNotReturnSelectRecordCursorWhenProjectionOrderIsDifferent
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Fixing it in second revision.
a1b0841
to
49aa6a4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % remaining comments
@@ -80,17 +90,20 @@ public Optional<ReaderRecordCursorWithProjections> createRecordCursor( | |||
// Ignore predicates on partial columns for now. | |||
effectivePredicate = effectivePredicate.filter((column, domain) -> column.isBaseColumn()); | |||
|
|||
List<HiveColumnHandle> readerColumns = projectedReaderColumns | |||
.map(readColumns -> readColumns.get().stream().map(HiveColumnHandle.class::cast).collect(toUnmodifiableList())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we usually prefer the toImmutableList
collector from Guava
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updating in the next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
public void shouldReturnSelectRecordCursor() | ||
{ | ||
List<HiveColumnHandle> columnHandleList = new ArrayList<>(); | ||
boolean s3SelectPushdownEnabled = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would probably simply inline it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I have inlined the parameter.
TupleDomain<HiveColumnHandle> effectivePredicate = TupleDomain.all(); | ||
Optional<HiveRecordCursorProvider.ReaderRecordCursorWithProjections> recordCursor = | ||
S3_SELECT_RECORD_CURSOR_PROVIDER.createRecordCursor( | ||
CONFIGURATION, SESSION, PATH, START, LENGTH, FILESIZE, SCHEMA, columnHandleList, effectivePredicate, TESTING_TYPE_MANAGER, s3SelectPushdownEnabled); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would recommend putting each argument on a separate line (here and in other places)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, i encapsulated the createRecordCursor in a method and there I am putting each argument on a separate line.
49aa6a4
to
e11bfb2
Compare
Enabling S3 Select pushdown for filtering improves performance of queries by reducing the data on the wire. It is most effective when queries pushed down to Select is filtering out significant portion of the data residing on S3. This commits disables Select pushdown when query has no predicate, or projection. In these cases using GET is both cheaper and faster.
e11bfb2
to
bc8e2fc
Compare
If this improves performance/reduces cost, it would be good to include it in the release notes. Could you please propose a potential release note? cc @arhimondr |
Potential release note: Improve efficiency for queries over tables in CSV and JSON formats stored on S3 when no filtering or projection is needed by automatically disabling S3 Select pushdown. |
Enabling S3 Select pushdown for filtering improves performance of queries by reducing the data on the wire. It is most effective when queries pushed down to Select is filtering out some portion of the data residing on S3. This commits disables Select pushdown when query has no predicate, or projection. To retrieve entire object, S3 GetObject is a cheaper option.
Description
Enabling S3 Select pushdown for filtering improves performance of queries by reducing the data on the wire. It is most effective when queries pushed down to Select is filtering out significant portion of the data residing on S3. This commit disables Select pushdown when query has no predicate, or projection. In these cases using S3 GetObject is both cheaper and faster.
Improvement
To Trino-Hive connector
This commit disables use of S3 Select, when it is not going to improve the performance.
Related issues, pull requests, and links
Documentation
(X) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
(X) No release notes entries required.
( ) Release notes entries required with the following suggested text: