Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multi hive-partition parquet reading in dask-cudf #9122

Merged
merged 4 commits into from
Sep 9, 2021

Conversation

rjzamora
Copy link
Member

This PR fixes some un-tested hive-partitioning edge cases in read-parquet. The "full" fix requires dask#8072. However, this PR should be merged before the upstream change goes in (otherwise dask_cudf.read_parquet CI will temporarily break). Note that this change is non-breaking, but the corresponding dask-dataframe change is.

@rjzamora rjzamora added 2 - In Progress Currently a work in progress dask Dask issue non-breaking Non-breaking change labels Aug 26, 2021
@rjzamora rjzamora self-assigned this Aug 26, 2021
@rjzamora rjzamora requested a review from a team as a code owner August 26, 2021 13:30
@github-actions github-actions bot added the Python Affects Python cuDF API. label Aug 26, 2021
@rjzamora rjzamora added the bug Something isn't working label Aug 26, 2021
@codecov
Copy link

codecov bot commented Aug 27, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@4d8e401). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 7742364 differs from pull request most recent head f9968a7. Consider uploading reports for the commit f9968a7 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #9122   +/-   ##
===============================================
  Coverage                ?   10.87%           
===============================================
  Files                   ?      115           
  Lines                   ?    19141           
  Branches                ?        0           
===============================================
  Hits                    ?     2082           
  Misses                  ?    17059           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d8e401...f9968a7. Read the comment docs.

@rjzamora rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Aug 27, 2021
Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rjzamora Changes look good to me, will there be a followup PR after dask/dask#8072 goes in? If not we would probably have to tag this as a breaking change so that this makes into the breaking change CHANGELOG.md entry.

@rjzamora
Copy link
Member Author

Changes look good to me, will there be a followup PR after dask/dask#8072 goes in? If not we would probably have to tag this as a breaking change so that this makes into the breaking change CHANGELOG.md entry.

Thanks for the review @galipremsagar! Are you refering to the fact that dask#8072 is technically a breaking change without this PR in place (even though this PR is not "breaking")? It is true that cudf<=21.08 will run into dask-cudf test failures with dask versions released after dask#8072 is merged.

@galipremsagar
Copy link
Contributor

Are you refering to the fact that dask#8072 is technically a breaking change without this PR in place (even though this PR is not "breaking")?

Yes, exactly.

It is true that cudf<=21.08 will run into dask-cudf test failures with dask versions released after dask#8072 is merged.

I see, we don't guarantee cudf<=21.08 will work with a dask version which is above the max pinned version for that release. But that said, will the end-user have to make changes to their code while using read_parquet after this PR & dask#8072 are merged? Seems like the function signature change in dask upstream PR is breaking change?

@rjzamora
Copy link
Member Author

But that said, will the end-user have to make changes to their code while using read_parquet after this PR & dask#8072 are merged? Seems like the function signature change in dask upstream PR is breaking change?

This is a good question. We are refactoring code and tweaking some function signatures, but the intention was to avoid removing kwargs from any public functions. Therefore, the down-stream user should not need to change any code after these PRs go in. However, if the user is on dask>=2021.9.0 (assuming the dask PR gets merged for that release), then they will want to use cudf>=21.10. The parquet API will mostly work for people with older cudf versions, but hive-partitioned columns will not be detected in some cases (i.e. they may have missing columns with some settings).

I'm certainly not completely sure of the "correct" way to label this PR, so I'm happy to defer to you :)

@galipremsagar
Copy link
Contributor

The parquet API will mostly work for people with older cudf versions, but hive-partitioned columns will not be detected in some cases (i.e. they may have missing columns with some settings).

Okay, got it. Then lets keep the tags as is. Thanks for explaining this to me @rjzamora 🙏

@rjzamora
Copy link
Member Author

rjzamora commented Sep 1, 2021

rerun tests

@quasiben
Copy link
Member

quasiben commented Sep 9, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 4349232 into rapidsai:branch-21.10 Sep 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working dask Dask issue non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants