Fix multi hive-partition parquet reading in dask-cudf #9122

rjzamora · 2021-08-26T13:30:24Z

This PR fixes some un-tested hive-partitioning edge cases in read-parquet. The "full" fix requires dask#8072. However, this PR should be merged before the upstream change goes in (otherwise dask_cudf.read_parquet CI will temporarily break). Note that this change is non-breaking, but the corresponding dask-dataframe change is.

codecov · 2021-08-27T14:46:25Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@4d8e401). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 7742364 differs from pull request most recent head f9968a7. Consider uploading reports for the commit f9968a7 to get more accurate results

@@               Coverage Diff               @@
##             branch-21.10    #9122   +/-   ##
===============================================
  Coverage                ?   10.87%           
===============================================
  Files                   ?      115           
  Lines                   ?    19141           
  Branches                ?        0           
===============================================
  Hits                    ?     2082           
  Misses                  ?    17059           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d8e401...f9968a7. Read the comment docs.

galipremsagar

@rjzamora Changes look good to me, will there be a followup PR after dask/dask#8072 goes in? If not we would probably have to tag this as a breaking change so that this makes into the breaking change CHANGELOG.md entry.

rjzamora · 2021-08-30T20:46:04Z

Changes look good to me, will there be a followup PR after dask/dask#8072 goes in? If not we would probably have to tag this as a breaking change so that this makes into the breaking change CHANGELOG.md entry.

Thanks for the review @galipremsagar! Are you refering to the fact that dask#8072 is technically a breaking change without this PR in place (even though this PR is not "breaking")? It is true that cudf<=21.08 will run into dask-cudf test failures with dask versions released after dask#8072 is merged.

galipremsagar · 2021-08-30T20:53:01Z

Are you refering to the fact that dask#8072 is technically a breaking change without this PR in place (even though this PR is not "breaking")?

Yes, exactly.

It is true that cudf<=21.08 will run into dask-cudf test failures with dask versions released after dask#8072 is merged.

I see, we don't guarantee cudf<=21.08 will work with a dask version which is above the max pinned version for that release. But that said, will the end-user have to make changes to their code while using read_parquet after this PR & dask#8072 are merged? Seems like the function signature change in dask upstream PR is breaking change?

rjzamora · 2021-08-31T14:12:35Z

But that said, will the end-user have to make changes to their code while using read_parquet after this PR & dask#8072 are merged? Seems like the function signature change in dask upstream PR is breaking change?

This is a good question. We are refactoring code and tweaking some function signatures, but the intention was to avoid removing kwargs from any public functions. Therefore, the down-stream user should not need to change any code after these PRs go in. However, if the user is on dask>=2021.9.0 (assuming the dask PR gets merged for that release), then they will want to use cudf>=21.10. The parquet API will mostly work for people with older cudf versions, but hive-partitioned columns will not be detected in some cases (i.e. they may have missing columns with some settings).

I'm certainly not completely sure of the "correct" way to label this PR, so I'm happy to defer to you :)

galipremsagar · 2021-08-31T14:15:43Z

The parquet API will mostly work for people with older cudf versions, but hive-partitioned columns will not be detected in some cases (i.e. they may have missing columns with some settings).

Okay, got it. Then lets keep the tags as is. Thanks for explaining this to me @rjzamora 🙏

rjzamora · 2021-09-01T21:27:09Z

rerun tests

quasiben · 2021-09-09T20:54:16Z

@gpucibot merge

rjzamora added 2 commits August 25, 2021 15:32

fix read_partition for multi-piece hive reading

bae6b75

add test coverage

1247c24

rjzamora added 2 - In Progress Currently a work in progress dask Dask issue non-breaking Non-breaking change labels Aug 26, 2021

rjzamora self-assigned this Aug 26, 2021

rjzamora requested a review from a team as a code owner August 26, 2021 13:30

github-actions bot added the Python Affects Python cuDF API. label Aug 26, 2021

rjzamora added the bug Something isn't working label Aug 26, 2021

Merge remote-tracking branch 'upstream/branch-21.10' into fix-multi-hive

25b9336

rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Aug 27, 2021

galipremsagar approved these changes Aug 30, 2021

View reviewed changes

add handling for missing partitioning arg

f9968a7

galipremsagar approved these changes Aug 31, 2021

View reviewed changes

quasiben approved these changes Sep 9, 2021

View reviewed changes

rapids-bot bot merged commit 4349232 into rapidsai:branch-21.10 Sep 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multi hive-partition parquet reading in dask-cudf #9122

Fix multi hive-partition parquet reading in dask-cudf #9122

rjzamora commented Aug 26, 2021

codecov bot commented Aug 27, 2021 •

edited

Loading

galipremsagar left a comment

rjzamora commented Aug 30, 2021

galipremsagar commented Aug 30, 2021

rjzamora commented Aug 31, 2021

galipremsagar commented Aug 31, 2021

rjzamora commented Sep 1, 2021

quasiben commented Sep 9, 2021

Fix multi hive-partition parquet reading in dask-cudf #9122

Fix multi hive-partition parquet reading in dask-cudf #9122

Conversation

rjzamora commented Aug 26, 2021

codecov bot commented Aug 27, 2021 • edited Loading

Codecov Report

galipremsagar left a comment

Choose a reason for hiding this comment

rjzamora commented Aug 30, 2021

galipremsagar commented Aug 30, 2021

rjzamora commented Aug 31, 2021

galipremsagar commented Aug 31, 2021

rjzamora commented Sep 1, 2021

quasiben commented Sep 9, 2021

codecov bot commented Aug 27, 2021 •

edited

Loading