-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add one-level list encoding support in parquet reader #9848
Add one-level list encoding support in parquet reader #9848
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.02 #9848 +/- ##
================================================
- Coverage 10.49% 10.42% -0.07%
================================================
Files 119 119
Lines 20305 20470 +165
================================================
+ Hits 2130 2134 +4
- Misses 18175 18336 +161
Continue to review full report at Codecov.
|
@res-life you found this bug is there additional testing that you want done? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're looking for a way to add tests, you can either add the file provided in #9240 as binary to a gtest or add the file in python/cudf/cudf/tests/data and add a pytest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for the most part, thanks for taking care of this issue!
column_buffer element_col(element_dtype, schema_elem.repetition_type == OPTIONAL); | ||
// store the index of this element | ||
nesting.push_back(static_cast<int>(output_col.children.size())); | ||
// TODO: not sure if we should assign a name or leave it blank |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the included in the output schema?
In ORC the names of nested columns are generated as the index in the parent's list of children. Gives a uniform way to access nested columns of lists/maps/structs. I don't know enough about Parquet to understand if the same logic can apply here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is really little spec information I can find for this one-level list encoding, not to mention how to name the elements. I will check parquet-cpp
to see how they handle it and remove TODO
whenever I find a proper fix (which may take some certain time). Probably in a follow-up PR.
|
@res-life Just to be clear, the corresponding pytest has been already added in this PR. This file format is specific thus I'd like to get more tests before this PR get merged. Please let me know if you have other similar files on hand. |
@PointKernel Please also check this: |
rerun tests |
88bb4e9
to
786b456
Compare
rerun tests |
@gpucibot merge |
Closes #9240
This PR added the one-level list encoding support in parquet reader. It also involved cleanups like removing the unused stream argument and fixing typos in docs/comments.