Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add one-level list encoding support in parquet reader #9848

Merged
merged 14 commits into from
Dec 10, 2021

Conversation

PointKernel
Copy link
Member

Closes #9240

This PR added the one-level list encoding support in parquet reader. It also involved cleanups like removing the unused stream argument and fixing typos in docs/comments.

@PointKernel PointKernel added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue non-breaking Non-breaking change labels Dec 6, 2021
@PointKernel PointKernel requested a review from a team as a code owner December 6, 2021 22:14
@PointKernel PointKernel self-assigned this Dec 6, 2021
@PointKernel
Copy link
Member Author

This PR can pass the unit test provided in #9240. I just want to make sure it covers all corner cases. @revans2 can you please test this PR with a larger file or other data types (like decimal, string, etc.)?

@codecov
Copy link

codecov bot commented Dec 6, 2021

Codecov Report

Merging #9848 (6c720f0) into branch-22.02 (967a333) will decrease coverage by 0.06%.
The diff coverage is 4.92%.

❗ Current head 6c720f0 differs from pull request most recent head f626a89. Consider uploading reports for the commit f626a89 to get more accurate results
Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.02    #9848      +/-   ##
================================================
- Coverage         10.49%   10.42%   -0.07%     
================================================
  Files               119      119              
  Lines             20305    20470     +165     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18336     +161     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/_base_index.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/column.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/string.py 0.00% <ø> (ø)
python/cudf/cudf/core/dataframe.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/groupby/groupby.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <ø> (ø)
python/cudf/cudf/core/indexed_frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/multiindex.py 0.00% <0.00%> (ø)
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05c209f...f626a89. Read the comment docs.

@revans2
Copy link
Contributor

revans2 commented Dec 7, 2021

@res-life you found this bug is there additional testing that you want done?

Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're looking for a way to add tests, you can either add the file provided in #9240 as binary to a gtest or add the file in python/cudf/cudf/tests/data and add a pytest.

cpp/src/io/parquet/parquet.hpp Outdated Show resolved Hide resolved
@github-actions github-actions bot added the Python Affects Python cuDF API. label Dec 7, 2021
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for the most part, thanks for taking care of this issue!

cpp/src/io/parquet/reader_impl.cu Show resolved Hide resolved
column_buffer element_col(element_dtype, schema_elem.repetition_type == OPTIONAL);
// store the index of this element
nesting.push_back(static_cast<int>(output_col.children.size()));
// TODO: not sure if we should assign a name or leave it blank
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the included in the output schema?
In ORC the names of nested columns are generated as the index in the parent's list of children. Gives a uniform way to access nested columns of lists/maps/structs. I don't know enough about Parquet to understand if the same logic can apply here.

Copy link
Member Author

@PointKernel PointKernel Dec 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is really little spec information I can find for this one-level list encoding, not to mention how to name the elements. I will check parquet-cpp to see how they handle it and remove TODO whenever I find a proper fix (which may take some certain time). Probably in a follow-up PR.

cpp/src/io/parquet/reader_impl.cu Outdated Show resolved Hide resolved
@res-life
Copy link
Contributor

res-life commented Dec 8, 2021

@res-life you found this bug is there additional testing that you want done?
Yes, I'll update the test cases after this is merged.

@PointKernel
Copy link
Member Author

PointKernel commented Dec 9, 2021

@res-life you found this bug is there additional testing that you want done?
Yes, I'll update the test cases after this is merged.

@res-life Just to be clear, the corresponding pytest has been already added in this PR. This file format is specific thus I'd like to get more tests before this PR get merged. Please let me know if you have other similar files on hand.

@res-life
Copy link
Contributor

res-life commented Dec 9, 2021

@PointKernel Please also check this:
index-outof-bound.zip

@PointKernel
Copy link
Member Author

rerun tests

@vuule vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Dec 9, 2021
@vuule
Copy link
Contributor

vuule commented Dec 10, 2021

rerun tests

@PointKernel
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 9435945 into rapidsai:branch-22.02 Dec 10, 2021
@PointKernel PointKernel deleted the parquet-one-level-list branch March 31, 2022 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Reading some lists in parquet produces incorrect results
6 participants