Add support for chunked parquet reading [databricks] #6934

revans2 · 2022-10-27T20:47:08Z

This depends on rapidsai/cudf#11961
This fixes the parquet part of #4968 I need to file a separate follow on issue for the ORC part of this.

In my tests with non-nested values there was no performance difference, and in general the chunked reader was able to avoid memory problems.

I still need to do some performance testing for nested types.

Signed-off-by: Robert (Bobby) Evans <[email protected]>

sql-plugin/src/main/java/com/nvidia/spark/rapids/iceberg/parquet/GpuParquet.java

sql-plugin/src/main/java/com/nvidia/spark/rapids/iceberg/spark/source/GpuSparkScan.java

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuMultiFileReader.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

revans2 · 2022-11-03T15:08:31Z

@jlowe I think I have finished all of the rework you requested. Please take another look.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnarBatchIterator.scala

jlowe · 2022-11-03T15:55:54Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDataProducer.scala

+ *
+ * @tparam T what it is that we are wrapping
+ */
+abstract class GpuDataProducer[T] extends AutoCloseable {


Suggested change

abstract class GpuDataProducer[T] extends AutoCloseable {

trait GpuDataProducer[T] extends AutoCloseable {

Should we also sealed it?

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnarBatchIterator.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDataProducer.scala

jlowe · 2022-11-03T16:29:42Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnarBatchIterator.scala

+  override def next(): ColumnarBatch = throw new NoSuchElementException()
+}
+
+class SingleGpuColumnarBatchIterator(var batch: ColumnarBatch)


Suggested change

class SingleGpuColumnarBatchIterator(var batch: ColumnarBatch)

class SingleGpuColumnarBatchIterator(private var batch: ColumnarBatch)

ttnghia · 2022-11-04T04:59:04Z

integration_tests/src/main/python/parquet_test.py

@@ -70,19 +70,29 @@ def read_parquet_sql(data_path):
 original_parquet_file_reader_conf = {'spark.rapids.sql.format.parquet.reader.type': 'PERFILE'}
 multithreaded_parquet_file_reader_conf = {'spark.rapids.sql.format.parquet.reader.type': 'MULTITHREADED'}
 coalesce_parquet_file_reader_conf = {'spark.rapids.sql.format.parquet.reader.type': 'COALESCING'}
+coalesce_parquet_file_reader_multithread_filter_chunked_conf = {'spark.rapids.sql.format.parquet.reader.type': 'COALESCING',


I don't see any test with different size limits here? Some limit values ranging from small to large should be better than the default max 2GB value.

The default limit in these tests is 512 MiB. We don't have any files that are read that would produce more than a single batch anyways. I don't think we have any files that would be larger than a page per row so even if we tried to turn it on the output would be the same. I have done manual tests because the size needed does not really lend itself to this type of test.

ttnghia · 2022-11-04T05:00:27Z

sql-plugin/src/main/java/com/nvidia/spark/rapids/iceberg/parquet/GpuParquet.java

@@ -67,6 +67,8 @@ public static class ReadBuilder {
    private Configuration conf = null;
    private int maxBatchSizeRows = Integer.MAX_VALUE;
    private long maxBatchSizeBytes = Integer.MAX_VALUE;
+    private long targetBatchSizeBytes = Integer.MAX_VALUE;


For me, I'm very confused in differentiating between maxBatchSizeBytes and targetBatchSizeBytes.

Do you want a comment in here? The long term plan is to eventually get rid of the maxBatchSizeBytes and the maxBatchSizeRows, but we can only do that one we have chunked reads the only option for Parquet and ORC and depending on how things go for avro, CSV, and JSON too.

Yes please add comment here to clarify how they are different.

ttnghia · 2022-11-04T05:12:05Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDataProducer.scala

+    while (hasNext) {
+      func(next)
+    }


In cudf we had issue when not calling next before hasNext for the empty input file. In such cases, the valid output should be a table having 0 rows but with all (0 rows) columns in the file schema. We can only get such output table if we call next before hasNext:

do { func(next) } while(hasNext)

I'm not sure if such output table is also desired in Spark.

Wow that is totally not the order that I would expect the APIs to need to be called in. It is totally opposite of all java iterators, but OK. I will need to change it in a very different place than this because GpuDataProducer is abstract and says that it should operate like an iterator.

Please make sure that the JNI API is clearly documented because like I said this totally is the opposite of java semantics.

Is documenting this weird behavior the right answer? This seems like behavior that should be fixed in cudf. The entire point of hasNext is to be a predicate function to let the calling code know that it is valid to call next, as the method name implies. Needing to special-case that for the first call is bizarre and error-prone, IMO.

This is not just a specific issue with cudf, but a general logic issue. In the case of empty file, what should hasNext return? It should always return false as there is no data in the file (except metadata) to read. But what you expect is a table with empty columns, not a table without anything, right? So if you check hasNext and see a false then next will never be called and you never get a table with empty columns.

On the other hand, if hasNext return true for empty file then it will always return true, and you can't know when to stop.

Talked with Bobby and decided to workaround this in cudf, hasNext (cudf Java JNI) will return true at least once (always returns true for the first time).

jlowe

Looks good to me, just a minor question on the single iterators.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnarBatchIterator.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDataProducer.scala

jlowe

lgtm, assuming iterator issues get worked out in the cudf Java bindings.

revans2 · 2022-11-18T16:41:23Z

lgtm, assuming iterator issues get worked out in the cudf Java bindings.

Yes the latest patch has the iterator issues fixed on the java side. Hopefully it gets merged in soon.

revans2 · 2022-11-21T15:54:47Z

build

revans2 · 2022-11-21T17:23:59Z

build

Add support for chunked parquet reading

365a276

Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 self-assigned this Oct 27, 2022

sameerz added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Oct 29, 2022

Change Default

e7e14f6

jlowe reviewed Nov 1, 2022

View reviewed changes

revans2 added 2 commits November 2, 2022 13:31

Merge branch 'branch-22.12' into read_chunked_parquet

10ca66f

First set of review comments

2bd95c7

ttnghia self-requested a review November 3, 2022 13:30

More rework

8c59625

jlowe reviewed Nov 3, 2022

View reviewed changes

ttnghia reviewed Nov 4, 2022

View reviewed changes

More review comments

646619e

revans2 mentioned this pull request Nov 4, 2022

Implement JNI for chunked Parquet reader rapidsai/cudf#11961

Merged

jlowe reviewed Nov 4, 2022

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuColumnarBatchIterator.scala Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDataProducer.scala Show resolved Hide resolved

More fixes

17f6715

jlowe approved these changes Nov 4, 2022

View reviewed changes

ttnghia linked an issue Nov 4, 2022 that may be closed by this pull request

[FEA] Split batches from parquet that are too large, and try to guess better before decompressing #4968

Closed

Merge branch 'branch-22.12' into read_chunked_parquet

6da2ae1

revans2 marked this pull request as ready for review November 21, 2022 15:53

revans2 merged commit 74d5f1e into NVIDIA:branch-22.12 Nov 21, 2022

revans2 deleted the read_chunked_parquet branch November 21, 2022 20:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for chunked parquet reading [databricks] #6934

Add support for chunked parquet reading [databricks] #6934

revans2 commented Oct 27, 2022 •

edited by sameerz

Loading

revans2 commented Nov 3, 2022

jlowe Nov 3, 2022

ttnghia Nov 4, 2022

jlowe Nov 3, 2022

ttnghia Nov 4, 2022

revans2 Nov 4, 2022

ttnghia Nov 4, 2022

revans2 Nov 4, 2022

ttnghia Nov 4, 2022

ttnghia Nov 4, 2022 •

edited

Loading

revans2 Nov 4, 2022

revans2 Nov 4, 2022

jlowe Nov 4, 2022

ttnghia Nov 4, 2022 •

edited

Loading

ttnghia Nov 4, 2022 •

edited

Loading

jlowe left a comment

jlowe left a comment

revans2 commented Nov 18, 2022

revans2 commented Nov 21, 2022

revans2 commented Nov 21, 2022

	abstract class GpuDataProducer[T] extends AutoCloseable {
	trait GpuDataProducer[T] extends AutoCloseable {

	class SingleGpuColumnarBatchIterator(var batch: ColumnarBatch)
	class SingleGpuColumnarBatchIterator(private var batch: ColumnarBatch)

Add support for chunked parquet reading [databricks] #6934

Add support for chunked parquet reading [databricks] #6934

Conversation

revans2 commented Oct 27, 2022 • edited by sameerz Loading

revans2 commented Nov 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Nov 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Nov 4, 2022 • edited Loading

Choose a reason for hiding this comment

ttnghia Nov 4, 2022 • edited Loading

Choose a reason for hiding this comment

jlowe left a comment

Choose a reason for hiding this comment

jlowe left a comment

Choose a reason for hiding this comment

revans2 commented Nov 18, 2022

revans2 commented Nov 21, 2022

revans2 commented Nov 21, 2022

revans2 commented Oct 27, 2022 •

edited by sameerz

Loading

ttnghia Nov 4, 2022 •

edited

Loading

ttnghia Nov 4, 2022 •

edited

Loading

ttnghia Nov 4, 2022 •

edited

Loading