Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Partition pruning" for s3 #37356

Merged
merged 3 commits into from
May 25, 2022
Merged

Conversation

amosbird
Copy link
Collaborator

@amosbird amosbird commented May 19, 2022

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Allow to prune the list of files via virtual columns such as _file and _path when reading from S3. This is for #37174 , #23494

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

@robot-clickhouse robot-clickhouse added the pr-improvement Pull request with some product improvements label May 19, 2022
@@ -250,7 +250,10 @@ ColumnPtr IExecutableFunction::executeWithoutSparseColumns(const ColumnsWithType
: columns_without_low_cardinality.front().column->size();

auto res = executeWithoutLowCardinalityColumns(columns_without_low_cardinality, dictionary_type, new_input_rows_count, dry_run);
auto keys = res->convertToFullColumnIfConst();
bool res_is_constant = isColumnConst(*res);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When default implementation is used, Const(LowCardinality(...)) loses its constant property and becomes LowCardinality(...), which in turn disables pruning with virtual columns.

Here is the fix.

bool has_wildcards = s3_configuration.uri.bucket.find(PARTITION_ID_WILDCARD) != String::npos
|| keys.back().find(PARTITION_ID_WILDCARD) != String::npos;
if (partition_by && has_wildcards)
throw Exception(ErrorCodes::NOT_IMPLEMENTED, "Reading from a partitioned S3 storage is not implemented yet");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not possible to read from a partitioned S3 storage for now (it always returns empty). Let's throw an exception instead.

select * from s3(s3_conn, filename='test_02302_*', format=Parquet) where _file like '%5';

-- Test s3 table with explicit keys (no glob)
-- TODO support truncate table function
Copy link
Collaborator Author

@amosbird amosbird May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to construct an S3 storage with explicit keys, I have to apply the following clumsy steps. Things can be improved:

  1. allow to do truncate table function s3(...)
  2. allow to create an S3 storage with a list of keys (or even different buckets)

@kssenii kssenii self-assigned this May 19, 2022
@amosbird amosbird force-pushed the partition-prune-for-s3 branch from afbb275 to 7931683 Compare May 20, 2022 00:11
@amosbird amosbird force-pushed the partition-prune-for-s3 branch from 21a979c to 1ee02a4 Compare May 24, 2022 11:02
Copy link
Member

@kssenii kssenii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-improvement Pull request with some product improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants