-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix reading partition value columns larger than cudf column size limit #9230
Conversation
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnarBatchIterator.scala
Outdated
Show resolved
Hide resolved
…size Signed-off-by: Partho Sarthi <[email protected]>
tests/src/test/scala/com/nvidia/spark/rapids/GpuMultiFileReaderSuite.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnarBatchIterator.scala
Outdated
Show resolved
Hide resolved
…e unit tests Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got most of the way through it, most of comments are just on better documentation, some of where I left it lacking
...ugin/src/main/scala/com/nvidia/spark/rapids/ColumnarPartitionReaderWithPartitionValues.scala
Outdated
Show resolved
Hide resolved
...ugin/src/main/scala/com/nvidia/spark/rapids/ColumnarPartitionReaderWithPartitionValues.scala
Outdated
Show resolved
Hide resolved
...ugin/src/main/scala/com/nvidia/spark/rapids/ColumnarPartitionReaderWithPartitionValues.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnarBatchIterator.scala
Show resolved
Hide resolved
...ugin/src/main/scala/com/nvidia/spark/rapids/ColumnarPartitionReaderWithPartitionValues.scala
Outdated
Show resolved
Hide resolved
...ugin/src/main/scala/com/nvidia/spark/rapids/ColumnarPartitionReaderWithPartitionValues.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
...ugin/src/main/scala/com/nvidia/spark/rapids/ColumnarPartitionReaderWithPartitionValues.scala
Outdated
Show resolved
Hide resolved
...ugin/src/main/scala/com/nvidia/spark/rapids/ColumnarPartitionReaderWithPartitionValues.scala
Outdated
Show resolved
Hide resolved
...ugin/src/main/scala/com/nvidia/spark/rapids/ColumnarPartitionReaderWithPartitionValues.scala
Outdated
Show resolved
Hide resolved
...ugin/src/main/scala/com/nvidia/spark/rapids/ColumnarPartitionReaderWithPartitionValues.scala
Outdated
Show resolved
Hide resolved
...ugin/src/main/scala/com/nvidia/spark/rapids/ColumnarPartitionReaderWithPartitionValues.scala
Outdated
Show resolved
Hide resolved
...ugin/src/main/scala/com/nvidia/spark/rapids/ColumnarPartitionReaderWithPartitionValues.scala
Outdated
Show resolved
Hide resolved
...ugin/src/main/scala/com/nvidia/spark/rapids/ColumnarPartitionReaderWithPartitionValues.scala
Outdated
Show resolved
Hide resolved
...ugin/src/main/scala/com/nvidia/spark/rapids/ColumnarPartitionReaderWithPartitionValues.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Partho Sarthi <[email protected]>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
…ns vs Single Value Partitions Signed-off-by: Partho Sarthi <[email protected]>
- Create seperate case class and iterator to handle split cases Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
...ugin/src/main/scala/com/nvidia/spark/rapids/ColumnarPartitionReaderWithPartitionValues.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnarBatchIterator.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SplitColumnarBatchProcessor.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SplitColumnarBatchProcessor.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SplitColumnarBatchProcessor.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SplitColumnarBatchProcessor.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SplitColumnarBatchProcessor.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Partho Sarthi <[email protected]>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/BatchWithPartitionData.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/BatchWithPartitionData.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/BatchWithPartitionData.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnarBatchIterator.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/BatchWithPartitionData.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/BatchWithPartitionData.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/BatchWithPartitionData.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/BatchWithPartitionData.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/SplitColumnarBatchProcessor.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/BatchWithPartitionData.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/BatchWithPartitionData.scala
Outdated
Show resolved
Hide resolved
...ugin/src/main/scala/com/nvidia/spark/rapids/ColumnarPartitionReaderWithPartitionValues.scala
Outdated
Show resolved
Hide resolved
tests/src/test/scala/com/nvidia/spark/rapids/GpuMultiFileReaderSuite.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/BatchWithPartitionData.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Partho Sarthi <[email protected]>
Thank You @jlowe for the reviews. The implementation looks in a good shape. Removing the PR from draft status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't gone over the whole thing, will do another pass later
sql-plugin/src/main/scala/com/nvidia/spark/rapids/BatchWithPartitionData.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/BatchWithPartitionData.scala
Show resolved
Hide resolved
Signed-off-by: Partho Sarthi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to look at a couple more functions here are comments so far
sql-plugin/src/main/scala/com/nvidia/spark/rapids/BatchWithPartitionData.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/BatchWithPartitionData.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnarBatchIterator.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/BatchWithPartitionData.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/BatchWithPartitionData.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/BatchWithPartitionData.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/BatchWithPartitionData.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Partho Sarthi <[email protected]>
Signed-off-by: Partho Sarthi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, other then adding a couple more tests
integration_tests/src/main/python/col_size_exceeding_cudf_limit_test.py
Outdated
Show resolved
Hide resolved
tests/src/test/scala/com/nvidia/spark/rapids/BatchWithPartitionDataSuite.scala
Show resolved
Hide resolved
Signed-off-by: Partho Sarthi <[email protected]>
tests/src/test/scala/com/nvidia/spark/rapids/BatchWithPartitionDataSuite.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Partho Sarthi <[email protected]>
build |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/BatchWithPartitionData.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Partho Sarthi <[email protected]>
build |
Fixes #9110.
This PR fixes the issue when readers create partition value column that exceeds the cudf column size limit. This is fixed by checking size and creating multiple batches.
Tasks:
next()
andhasNext()
methods inGpuColumnarBatchWithPartitionValuesIterator
.partRows
andpartValues
into partitions less thancuDF limit
.reference counts
, memory leaks and improve logic for splitting.Related Issue:
Performance Metrics:
Details:
Size of ColumnVector = Num Rows * Size of Column Value