-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable multi-file partitioning in dask_cudf.read_parquet #8393
Enable multi-file partitioning in dask_cudf.read_parquet #8393
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-21.08 #8393 +/- ##
===============================================
Coverage ? 10.60%
===============================================
Files ? 116
Lines ? 18606
Branches ? 0
===============================================
Hits ? 1974
Misses ? 16632
Partials ? 0 Continue to review full report at Codecov.
|
@randerzander - dask#7557 was merged, and I just confirmed that this PR still enables (very performant) multi-file aggregation via the |
Removing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for tests?
Good question actually - If CI is using the Dask main branch, we can add an explicit test with |
Removing |
@devavret - Just added test coverage |
else: | ||
(path, row_group, partition_keys) = piece | ||
if not isinstance(pieces, list): | ||
pieces = [pieces] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be ?
pieces = [pieces] | |
pieces = list(pieces) |
in case pieces is a numpy array or tuple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - I know this is a bit ugly :/
We cant use list(pieces)
, because pieces will either be a tuple, a string, a list of strings, or a list of tuples. For newer versions of Dask, this should always be a list already. For older versions, however, we want to convert a string into a list of strings, and a tuple into a list of tuples.
@@ -40,42 +40,82 @@ def read_metadata(*args, **kwargs): | |||
|
|||
return (new_meta, stats, parts, index) | |||
|
|||
@classmethod | |||
def multi_support(cls): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this being used ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used upstream (in Dask) to check if we support pieces
being passed in a as a list in read_partition
. So, it is an "opt-in" mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rjzamora . This looks good and thanks for adding the tests.
Also, thank you to @rgsl888prabhu and @devavret for the review
@gpucibot merge |
Depends on dask#7557
This PR updates
dask_cudf.read_parquet
to enable multi-file aggregation when thechunksize
parameter is used. This change enables dramatic (>50x) improvements when the dataset contains many small files, which is somewhat common for hive/directory-partitioned data.Motivating Example