-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement chunked Parquet reader #11867
Conversation
…taining a mix of nested and non-nested types would result in incorrect row counts for the non-nested types. Also optimizes the preprocess path so that non-nested types do not end up getting visited by the kernel.
…ists. Fixed an additional issue in the decoding where flat column types underneath structs could end up ignoring skip_rows/num_rows.
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]> # Conflicts: # cpp/tests/CMakeLists.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks solid, just a few more nitpick-y suggestions.
Awesome PR!
[[nodiscard]] bool has_next() const; | ||
|
||
/** | ||
* @copydoc cudf::io::chunked_parquet_reader::read_chunk | ||
*/ | ||
[[nodiscard]] table_with_metadata read_chunk() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if read_chunk
should be const but we can leave it for now.
Signed-off-by: Nghia Truong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing most of the nitpicks!
We can address some remaining (riskier) clean up ideas in separate PRs.
@gpucibot merge |
1 similar comment
@gpucibot merge |
This adds JNI for chunked Parquet reader. It depends on the chunked Parquet reader implementation PR (#11867). Authors: - https://github.com/nvdbaranec - Nghia Truong (https://github.com/ttnghia) Approvers: - MithunR (https://github.com/mythrocks) - Robert (Bobby) Evans (https://github.com/revans2)
This adds chunked Parquet reader, which can perform chunked reading for accessing files by an iterative manner. Instead of reading the input file all at once, we can read it chunk by chunk, each chunk can be limited to be small enough to not exceed the cudf internal limit (2GB/2 billions rows):