Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reading partition value columns larger than cudf column size limit #9230

Merged
merged 48 commits into from
Oct 17, 2023

Conversation

parthosa
Copy link
Collaborator

@parthosa parthosa commented Sep 13, 2023

Fixes #9110.

This PR fixes the issue when readers create partition value column that exceeds the cudf column size limit. This is fixed by checking size and creating multiple batches.

Tasks:

  • Convert existing methods in MultiFileReaderUtils to an Iterator-based API.
  • Update next() and hasNext() methods in GpuColumnarBatchWithPartitionValuesIterator.
  • Implement logic to split partRows and partValues into partitions less than cuDF limit.
  • Modify the code to split input of ColumnarBatch using the above partitions.
  • Implement similar partitioning logic for the case with single partition values.
  • Unify the structure for processing both the Single Partition Value and Multiple Partition Value cases.
  • Remove redundant separate processing for the Single Partition Value case.
  • Create a case class and iterator with pending queue to defer the merging of partition values until they are required.
  • Ensure usage of reference counts, memory leaks and improve logic for splitting.
  • Make cudf column size configurable and add unit tests.
  • Add the retry framework
  • Add unit tests in pyspark

Related Issue:

Performance Metrics:

Details:

  • Workload Information -
    • Application: Basic Spark App that writes columns with variable-length strings as Parquet and reads it.
    • Spark Version: 3.1.1
    • Spark RAPIDS Version: 23.08
    • GPU: RTX A5000
  • cuDF Column Limit: 2GB
  • Size of ColumnVector = Num Rows * Size of Column Value
# Partition Cols # Rows (million) Largest Column Value (bytes) Largest Column Vector (GB) CPU Time (sec) GPU Time (sec) (23.08) GPU Time (sec) (current branch) Speed Up (CPU/GPU current branch)
1 100 150 14.0 132.0 cuDF error 7.5 17.6
2 100 150 14.0 125.0 cuDF error 8.0 15.6
3 100 150 14.0 122.0 cuDF error 8.0 15.3
2 100 75 7.0 72.0 cuDF error 4.1 17.6
2 100 15 1.4 37.0 2.0 2.0 18.5
1 10 150 1.4 8.2 1.7 1.7 4.8

@parthosa parthosa self-assigned this Sep 13, 2023
@sameerz sameerz added the bug Something isn't working label Sep 18, 2023
Copy link
Collaborator

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got most of the way through it, most of comments are just on better documentation, some of where I left it lacking

Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
@parthosa
Copy link
Collaborator Author

This looks good overall, but it's still draft. Does it need to be?

It would be good to get some other eyes on this. @tgravescs @razajafri would you mind doing another round of review to see if I missed things?

Thank You @jlowe for the reviews. The implementation looks in a good shape. Removing the PR from draft status.

@parthosa parthosa marked this pull request as ready for review October 10, 2023 15:32
Copy link
Collaborator

@razajafri razajafri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gone over the whole thing, will do another pass later

Copy link
Collaborator

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to look at a couple more functions here are comments so far

Copy link
Collaborator

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, other then adding a couple more tests

tgravescs
tgravescs previously approved these changes Oct 16, 2023
@tgravescs
Copy link
Collaborator

build

@jlowe
Copy link
Member

jlowe commented Oct 16, 2023

build

@parthosa parthosa merged commit de032b3 into NVIDIA:branch-23.12 Oct 17, 2023
30 checks passed
@parthosa parthosa deleted the spark-rapids-9110 branch October 17, 2023 22:45
@parthosa parthosa restored the spark-rapids-9110 branch October 17, 2023 22:47
@parthosa parthosa deleted the spark-rapids-9110 branch October 17, 2023 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] GPU Reader fails due to partition column creating column larger then cudf column size limit
5 participants