-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix multi hive-partition parquet reading in dask-cudf #9122
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #9122 +/- ##
===============================================
Coverage ? 10.87%
===============================================
Files ? 115
Lines ? 19141
Branches ? 0
===============================================
Hits ? 2082
Misses ? 17059
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rjzamora Changes look good to me, will there be a followup PR after dask/dask#8072 goes in? If not we would probably have to tag this as a breaking change so that this makes into the breaking change CHANGELOG.md
entry.
Thanks for the review @galipremsagar! Are you refering to the fact that dask#8072 is technically a breaking change without this PR in place (even though this PR is not "breaking")? It is true that cudf<=21.08 will run into dask-cudf test failures with dask versions released after dask#8072 is merged. |
Yes, exactly.
I see, we don't guarantee |
This is a good question. We are refactoring code and tweaking some function signatures, but the intention was to avoid removing kwargs from any public functions. Therefore, the down-stream user should not need to change any code after these PRs go in. However, if the user is on dask>=2021.9.0 (assuming the dask PR gets merged for that release), then they will want to use cudf>=21.10. The parquet API will mostly work for people with older cudf versions, but hive-partitioned columns will not be detected in some cases (i.e. they may have missing columns with some settings). I'm certainly not completely sure of the "correct" way to label this PR, so I'm happy to defer to you :) |
Okay, got it. Then lets keep the tags as is. Thanks for explaining this to me @rjzamora 🙏 |
rerun tests |
@gpucibot merge |
This PR fixes some un-tested hive-partitioning edge cases in read-parquet. The "full" fix requires dask#8072. However, this PR should be merged before the upstream change goes in (otherwise dask_cudf.read_parquet CI will temporarily break). Note that this change is non-breaking, but the corresponding dask-dataframe change is.