Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement JNI for chunked Parquet reader #11961

Merged
merged 211 commits into from
Nov 18, 2022

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Oct 20, 2022

This adds JNI for chunked Parquet reader. It depends on the chunked Parquet reader implementation PR (#11867).

nvdbaranec and others added 30 commits September 23, 2022 10:59
…taining a mix of nested and non-nested types would

result in incorrect row counts for the non-nested types. Also optimizes the preprocess path so that non-nested types
do not end up getting visited by the kernel.
…ists. Fixed an additional issue in the decoding where flat column types underneath

structs could end up ignoring skip_rows/num_rows.
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>

# Conflicts:
#	cpp/src/io/parquet/page_data.cu
#	cpp/src/io/parquet/reader_impl.cu
#	cpp/src/io/parquet/reader_impl.hpp
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
@ttnghia ttnghia changed the title Implement JNI for chunked Parquet reader [skip ci] Implement JNI for chunked Parquet reader Nov 17, 2022
@ttnghia ttnghia added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Nov 17, 2022
@ttnghia ttnghia marked this pull request as ready for review November 18, 2022 03:06
@ttnghia ttnghia requested a review from a team as a code owner November 18, 2022 03:06
@github-actions github-actions bot removed the libcudf Affects libcudf (C++/CUDA) code. label Nov 18, 2022
@codecov
Copy link

codecov bot commented Nov 18, 2022

Codecov Report

Base: 87.47% // Head: 88.22% // Increases project coverage by +0.75% 🎉

Coverage data is based on head (23afc49) compared to base (f817d96).
Patch has no changes to coverable lines.

❗ Current head 23afc49 differs from pull request most recent head 6591704. Consider uploading reports for the commit 6591704 to get more accurate results

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #11961      +/-   ##
================================================
+ Coverage         87.47%   88.22%   +0.75%     
================================================
  Files               133      137       +4     
  Lines             21826    22571     +745     
================================================
+ Hits              19093    19914     +821     
+ Misses             2733     2657      -76     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/interval.py 85.45% <0.00%> (-9.10%) ⬇️
python/strings_udf/strings_udf/__init__.py 75.80% <0.00%> (-8.51%) ⬇️
python/cudf/cudf/io/text.py 91.66% <0.00%> (-8.34%) ⬇️
python/cudf/cudf/core/_base_index.py 81.28% <0.00%> (-4.27%) ⬇️
python/cudf/cudf/io/json.py 92.06% <0.00%> (-2.68%) ⬇️
python/cudf/cudf/utils/utils.py 89.91% <0.00%> (-0.69%) ⬇️
python/cudf/cudf/core/column/timedelta.py 90.17% <0.00%> (-0.58%) ⬇️
python/cudf/cudf/core/column/datetime.py 89.21% <0.00%> (-0.51%) ⬇️
python/cudf/cudf/core/column/column.py 87.96% <0.00%> (-0.46%) ⬇️
python/dask_cudf/dask_cudf/core.py 73.72% <0.00%> (-0.41%) ⬇️
... and 46 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent stuff. A couple of minor nitpicks aside, this looks good to me.

Signed-off-by: Nghia Truong <[email protected]>
Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good I ran through all of my tests with it and they all passed.

@ttnghia
Copy link
Contributor Author

ttnghia commented Nov 18, 2022

@gpucibot merge

@ajschmidt8 ajschmidt8 merged commit 782fba3 into rapidsai:branch-22.12 Nov 18, 2022
@ttnghia ttnghia deleted the jni_parquet_reader branch November 18, 2022 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request Java Affects Java cuDF API. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Split batches from parquet that are too large, and try to guess better before decompressing
7 participants