Enable multi-file partitioning in dask_cudf.read_parquet #8393

rjzamora · 2021-05-27T17:59:03Z

This PR updates dask_cudf.read_parquet to enable multi-file aggregation when the chunksize parameter is used. This change enables dramatic (>50x) improvements when the dataset contains many small files, which is somewhat common for hive/directory-partitioned data.

Motivating Example

dask.datasets.timeseries(
    start='2000-01-01',
    end='2000-02-28',
    freq='1s',
    partition_freq='1h',
    seed=42,
).to_parquet(
    path,
    engine="pyarrow",
    partition_on="name",
    write_index=False,
)

####
#### BEFORE THIS PR ####
####

# Without useing `chunksize`, or  when using `chunksize` before this PR
# results in 36,192 partitions
%time df_read = dask_cudf.read_parquet(path)
CPU times: user 1.17 s, sys: 501 ms, total: 1.67 s
Wall time: 1.67 s

# Without useing `chunksize`, or  when using `chunksize` before this PR
# results in a 5+ minute read time
%time df_read.compute()
CPU times: user 5min 34s, sys: 37.1 s, total: 6min 11s
Wall time: 6min 12s

####
#### AFTER THIS PR ####
####

# Using `chunksize` WITH this PR results in 26 partitions
%time df_read = dask_cudf.read_parquet(path, chunksize="1GB", aggregate_files="name")
CPU times: user 1.03 s, sys: 119 ms, total: 1.15 s
Wall time: 1.15 s

# Using `chunksize` WITH this PR results in a ~95x speedup
%time df_read.compute()
CPU times: user 3.18 s, sys: 703 ms, total: 3.88 s
Wall time: 3.92 s

…parquet

codecov · 2021-05-27T20:19:18Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@67b7aac). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head caf2b90 differs from pull request most recent head 08d12e2. Consider uploading reports for the commit 08d12e2 to get more accurate results

@@               Coverage Diff               @@
##             branch-21.08    #8393   +/-   ##
===============================================
  Coverage                ?   10.60%           
===============================================
  Files                   ?      116           
  Lines                   ?    18606           
  Branches                ?        0           
===============================================
  Hits                    ?     1974           
  Misses                  ?    16632           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 67b7aac...08d12e2. Read the comment docs.

…parquet

rjzamora · 2021-07-15T20:33:54Z

@randerzander - dask#7557 was merged, and I just confirmed that this PR still enables (very performant) multi-file aggregation via the aggregate_files parameter in read_parquet :)

ajschmidt8 · 2021-07-16T13:39:27Z

Removing ops-codeowners from the required reviews since it doesn't seem there are any file changes that we're responsible for. Feel free to add us back if necessary.

devavret

No need for tests?

rjzamora · 2021-07-16T14:32:23Z

No need for tests?

Good question actually - If CI is using the Dask main branch, we can add an explicit test with aggregate_files= set. Otherwise, we cannot really test for the feature we are enabling. Perhaps a simple test with a version check is appropriate.

robertmaynard · 2021-07-16T16:52:15Z

Removing cmake-codeowners from the reviews since it doesn't seem this has any build-system impact.

rjzamora · 2021-07-16T16:55:33Z

@devavret - Just added test coverage

rgsl888prabhu · 2021-07-16T17:07:44Z

python/dask_cudf/dask_cudf/io/parquet.py

-        else:
-            (path, row_group, partition_keys) = piece
+        if not isinstance(pieces, list):
+            pieces = [pieces]


Should this be ?

Suggested change

pieces = [pieces]

pieces = list(pieces)

in case pieces is a numpy array or tuple

Sorry - I know this is a bit ugly :/

We cant use list(pieces), because pieces will either be a tuple, a string, a list of strings, or a list of tuples. For newer versions of Dask, this should always be a list already. For older versions, however, we want to convert a string into a list of strings, and a tuple into a list of tuples.

rgsl888prabhu · 2021-07-16T17:11:33Z

python/dask_cudf/dask_cudf/io/parquet.py

@@ -40,42 +40,82 @@ def read_metadata(*args, **kwargs):

        return (new_meta, stats, parts, index)

+    @classmethod
+    def multi_support(cls):


Where is this being used ?

It is used upstream (in Dask) to check if we support pieces being passed in a as a list in read_partition. So, it is an "opt-in" mechanism.

…parquet

quasiben

Thanks @rjzamora . This looks good and thanks for adding the tests.

Also, thank you to @rgsl888prabhu and @devavret for the review

quasiben · 2021-07-21T20:25:35Z

@gpucibot merge

rjzamora added 4 commits April 21, 2021 11:25

save possible changes to enable multi-file parquet

52a643c

Merge remote-tracking branch 'upstream/branch-21.06' into multi-file-…

3b3dadb

…parquet

align with latest upstream PR

54644e1

trigger format check

c5e660f

github-actions bot added the Python Affects Python cuDF API. label May 27, 2021

rjzamora mentioned this pull request Jun 1, 2021

Add aggregate_files argument to enable multi-file partitions in read_parquet dask/dask#7557

Merged

Merge remote-tracking branch 'upstream/branch-21.08' into multi-file-…

2ab6838

…parquet

rjzamora marked this pull request as ready for review July 15, 2021 20:09

rjzamora requested review from a team as code owners July 15, 2021 20:09

rjzamora requested review from devavret, davidwendt, charlesbluca and rgsl888prabhu and removed request for a team July 15, 2021 20:09

github-actions bot added CMake CMake build issue conda Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Jul 15, 2021

rjzamora removed CMake CMake build issue conda Java Affects Java cuDF API. labels Jul 15, 2021

rjzamora added 3 - Ready for Review Ready for review by team 4 - Needs Dask Reviewer non-breaking Non-breaking change and removed gpuCI libcudf Affects libcudf (C++/CUDA) code. labels Jul 15, 2021

rjzamora changed the base branch from branch-21.06 to branch-21.08 July 15, 2021 20:11

rjzamora added the improvement Improvement / enhancement to an existing function label Jul 15, 2021

ajschmidt8 removed the request for review from a team July 16, 2021 13:39

devavret approved these changes Jul 16, 2021

View reviewed changes

robertmaynard removed the request for review from a team July 16, 2021 16:51

add test coverage

d67f711

rgsl888prabhu reviewed Jul 16, 2021

View reviewed changes

rjzamora added 3 commits July 16, 2021 10:27

trigger format

a175daf

Merge remote-tracking branch 'upstream/branch-21.08' into multi-file-…

235422a

…parquet

Merge remote-tracking branch 'upstream/branch-21.08' into multi-file-…

08d12e2

…parquet

rjzamora removed the request for review from a team July 20, 2021 21:32

quasiben approved these changes Jul 21, 2021

View reviewed changes

rapids-bot bot merged commit 39220f9 into rapidsai:branch-21.08 Jul 21, 2021

rjzamora deleted the multi-file-parquet branch July 21, 2021 20:31

vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs Dask Reviewer labels Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable multi-file partitioning in dask_cudf.read_parquet #8393

Enable multi-file partitioning in dask_cudf.read_parquet #8393

rjzamora commented May 27, 2021 •

edited

Loading

codecov bot commented May 27, 2021 •

edited

Loading

rjzamora commented Jul 15, 2021

ajschmidt8 commented Jul 16, 2021

devavret left a comment

rjzamora commented Jul 16, 2021

robertmaynard commented Jul 16, 2021

rjzamora commented Jul 16, 2021

rgsl888prabhu Jul 16, 2021

rjzamora Jul 16, 2021

rgsl888prabhu Jul 16, 2021

rjzamora Jul 16, 2021

quasiben left a comment

quasiben commented Jul 21, 2021

Enable multi-file partitioning in dask_cudf.read_parquet #8393

Enable multi-file partitioning in dask_cudf.read_parquet #8393

Conversation

rjzamora commented May 27, 2021 • edited Loading

codecov bot commented May 27, 2021 • edited Loading

Codecov Report

rjzamora commented Jul 15, 2021

ajschmidt8 commented Jul 16, 2021

devavret left a comment

Choose a reason for hiding this comment

rjzamora commented Jul 16, 2021

robertmaynard commented Jul 16, 2021

rjzamora commented Jul 16, 2021

rgsl888prabhu Jul 16, 2021

Choose a reason for hiding this comment

rjzamora Jul 16, 2021

Choose a reason for hiding this comment

rgsl888prabhu Jul 16, 2021

Choose a reason for hiding this comment

rjzamora Jul 16, 2021

Choose a reason for hiding this comment

quasiben left a comment

Choose a reason for hiding this comment

quasiben commented Jul 21, 2021

rjzamora commented May 27, 2021 •

edited

Loading

codecov bot commented May 27, 2021 •

edited

Loading