-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for struct type in ORC writer #9025
Add support for struct type in ORC writer #9025
Conversation
This reverts commit 6940680.
…fea-orc-writer-struct
…fea-orc-writer-struct
…fea-orc-writer-struct
…fea-orc-writer-struct
…fea-orc-write-struct
…fea-orc-write-struct
…fea-orc-write-struct
…fea-orc-write-struct
…fea-orc-write-struct
…fea-orc-write-struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python/Cython approval, with a couple of small questions.
C++/CUDA looked OK, but probably best to get approval from someone with greater familiarity. I didn't review those files in depth.
Is there a reason that submodule updates are present in this PR? The current build failure may be related since it looked like it was failing to pull in jitify.
table_view_from_table(table, ignore_index=True) | ||
) | ||
if self.index is not False: | ||
if isinstance(table._index, cudf.core.multiindex.MultiIndex): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In write_orc
the corresponding block checks for any non-RangeIndex type. Is that discrepancy intentional or meaningful? Or do you always get either a MultiIndex or a RangeIndex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to defer to @devavret or @galipremsagar on this one. The Python changes in this PR are based on the equivalent Parquet PR #7461.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block is a replacement of the logic of utils.get_column_names
which assigns the names of the columns to metadata. We don't want to update the metadata for RangeIndex because metadata contains only information on libcudf column_views. We could get a single series materialized index too, which is captured by the second part of the branch.
Regarding writing RangeIndex, this is currently broken #7011 (and re-reported in #9216) because it might take some smarts to merge range indices across different calls to write_table
. The RangeIndex only gets written in the pandas specific metadata and passes straight through libcudf untouched. If you have some time, I'd invite you to take on #7011
…fea-orc-write-struct
They got pulled in when I merged 21.10. Not sure what I did wrong and how to revert it. I have these two files marked as modified in git. When I try to update submodules, it fails with |
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review but GitHub won't let me reply to @Vyas' comment without publishing my review. 🤷♂️
table_view_from_table(table, ignore_index=True) | ||
) | ||
if self.index is not False: | ||
if isinstance(table._index, cudf.core.multiindex.MultiIndex): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block is a replacement of the logic of utils.get_column_names
which assigns the names of the columns to metadata. We don't want to update the metadata for RangeIndex because metadata contains only information on libcudf column_views. We could get a single series materialized index too, which is captured by the second part of the branch.
Regarding writing RangeIndex, this is currently broken #7011 (and re-reported in #9216) because it might take some smarts to merge range indices across different calls to write_table
. The RangeIndex only gets written in the pandas specific metadata and passes straight through libcudf untouched. If you have some time, I'd invite you to take on #7011
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial pass, still going through PR.
@gpucibot merge |
This reverts commit 2c6b39b.
Fixes #7830, #8443
Features:
table_input_metadata
.Changes:
Breaking because the table metadata type has changed.