Test parquet predicate pushdown for basic types and fields having dots in names [databricks] #9128

thirtiseven · 2023-08-29T09:50:12Z

Closes #9127
Closes #9094

This PR adds some tests to test parquet predicate pushdown for basic types and fields with dots in names. And also a follow on to orc's ppd test, using `assume' instead of commenting out failed cases.

For PPD testing, these cases are tested:

CPU write GPU read, GPU write CPU read, and GPU write GPU read.
Range partitioning on write and predicate is out of DataFrame range, if applicable.
Data source V1 and V2.

Some contexts: #9119

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2023-08-30T05:19:43Z

build

res-life · 2023-08-30T09:52:41Z

tests/src/test/scala/org/apache/spark/sql/rapids/ParquetFilterSuite.scala

+      withTempPath { path =>
+        withSQLConf(
+          SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> "TIMESTAMP_MICROS",
+          "spark.rapids.sql.test.enabled" -> "false",


val TEST_CONF = conf("spark.rapids.sql.test.enabled") .doc("Intended to be used by unit tests, if enabled all operations must run on the " + "GPU or an error happens.") .internal() .booleanConf .createWithDefault(false)

If enable GPU, we should set this as true. Or we may enconter this kind of case:
The test passed, but it's running on CPU silently, and we expect it's running on GPU.

Changes would like:

if (writeGpu || readGpu) { spark.rapids.sql.test.enabled = true }

Actually I think it should be

"spark.rapdis.slq.test.enabled" -> (!writeGpu).toString

or else if we are reading on the GPU, but writing on the CPU we will run into a problem where the we get an error for having things not be on the GPU.

Updated to

"spark.rapids.sql.test.enabled" -> writeGpu.toString

and

"spark.rapids.sql.test.enabled" -> readGpu.toString

to check all operations running on GPU during GPU read and write separately.

And removed binary test because we don't support binaryType right now.

res-life · 2023-08-30T10:05:16Z

I also notice that ParquetFilterSuite has testing for the following operators:

      checkFilterPredicate(intAttr === 1, classOf[Eq[_]], resultFun(1))
      checkFilterPredicate(intAttr <=> 1, classOf[Eq[_]], resultFun(1))
      checkFilterPredicate(intAttr =!= 1, classOf[NotEq[_]],
        (2 to 4).map(i => Row.apply(resultFun(i))))

      checkFilterPredicate(intAttr < 2, classOf[Lt[_]], resultFun(1))
      checkFilterPredicate(intAttr > 3, classOf[Gt[_]], resultFun(4))
      checkFilterPredicate(intAttr <= 1, classOf[LtEq[_]], resultFun(1))
      checkFilterPredicate(intAttr >= 4, classOf[GtEq[_]], resultFun(4))

      checkFilterPredicate(Literal(1) === intAttr, classOf[Eq[_]], resultFun(1))
      checkFilterPredicate(Literal(1) <=> intAttr, classOf[Eq[_]], resultFun(1))
      checkFilterPredicate(Literal(2) > intAttr, classOf[Lt[_]], resultFun(1))
      checkFilterPredicate(Literal(3) < intAttr, classOf[Gt[_]], resultFun(4))
      checkFilterPredicate(Literal(1) >= intAttr, classOf[LtEq[_]], resultFun(1))
      checkFilterPredicate(Literal(4) <= intAttr, classOf[GtEq[_]], resultFun(4))

And I notice ParquetFilterSuite also tests V1 and V2 datasources.

Not sure if it's necessary to add above tests.

res-life · 2023-08-30T10:26:14Z

Can we mark #9119 is invalid?

thirtiseven · 2023-08-30T10:41:51Z

Thanks for review @res-life

Not sure if it's necessary to add above tests.

Yes there are many cases to test PPD in Spark, I will go through them and add more cases.

Can we mark #9119 is invalid?

Closed it.

revans2 · 2023-08-30T14:33:59Z

tests/src/test/scala/org/apache/spark/sql/rapids/ParquetFilterSuite.scala

+    }
+  }
+
+  def withAllParquetReaders(code: => Unit): Unit = {


nit: This only matters when reading on the CPU. The GPU ignores this.

revans2 · 2023-08-30T14:35:47Z

tests/src/test/scala/org/apache/spark/sql/rapids/ParquetFilterSuite.scala

+      withTempPath { path =>
+        withSQLConf(
+          SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> "TIMESTAMP_MICROS",
+          "spark.rapids.sql.test.enabled" -> "false",


Actually I think it should be

"spark.rapdis.slq.test.enabled" -> (!writeGpu).toString

or else if we are reading on the GPU, but writing on the CPU we will run into a problem where the we get an error for having things not be on the GPU.

revans2 · 2023-08-30T14:36:45Z

tests/src/test/scala/org/apache/spark/sql/rapids/ParquetFilterSuite.scala

+          SQLConf.PARQUET_FILTER_PUSHDOWN_DATE_ENABLED.key -> "true",
+          SQLConf.PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED.key -> "true",
+          SQLConf.PARQUET_FILTER_PUSHDOWN_DECIMAL_ENABLED.key -> "true",
+          "spark.rapids.sql.test.enabled" -> "false",


Same here we should update this one similarly to the comment above.

"spark.rapdis.slq.test.enabled" -> (!readGpu).toString,

res-life · 2023-08-31T05:22:52Z

Added [databricks] in the title to also test databricks.

Signed-off-by: Haoyang Li <[email protected]>

revans2 · 2023-08-31T15:13:11Z

build

thirtiseven · 2023-08-31T17:00:11Z

Updated the timestamp test, V2 source also needs spark.sql.parquet.outputTimestampType=TIMESTAMP_MICROS when reading to enable predicate pushdown. Also added date test.

thirtiseven · 2023-09-01T00:23:58Z

build

thirtiseven and others added 5 commits August 28, 2023 18:03

wip

606f0de

Merge branch 'NVIDIA:branch-23.10' into ppd_parquet_dotname

a0284e0

test predicate pushdown

6287e32

test PPD for basic types and name with dots

5acaa56

Signed-off-by: Haoyang Li <[email protected]>

clean up

d574c82

sameerz added test Only impacts tests reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Aug 29, 2023

use column instead of string for predicate to enable binary test

3e84acc

thirtiseven self-assigned this Aug 30, 2023

thirtiseven marked this pull request as ready for review August 30, 2023 08:26

res-life reviewed Aug 30, 2023

View reviewed changes

revans2 reviewed Aug 30, 2023

View reviewed changes

remove binary test and address comments

8e4b05b

res-life changed the title ~~Test parquet predicate pushdown for basic types and fields having dots in names~~ Test parquet predicate pushdown for basic types and fields having dots in names [databricks] Aug 31, 2023

thirtiseven added 4 commits August 31, 2023 14:34

add null and greaterthan test

f142ddd

Signed-off-by: Haoyang Li <[email protected]>

test datasource v1 and v2

7bb0eeb

clean up

caf6953

clean up

9f80185

revans2 previously approved these changes Aug 31, 2023

View reviewed changes

Add datasource V2 timestamp test, add date test

8e25688

thirtiseven dismissed revans2’s stale review via 8e25688 August 31, 2023 16:53

revans2 approved these changes Aug 31, 2023

View reviewed changes

thirtiseven merged commit df016cb into NVIDIA:branch-23.10 Sep 1, 2023
27 checks passed

thirtiseven deleted the ppd_parquet_dotname branch September 1, 2023 05:17

jlowe mentioned this pull request Jan 24, 2024

[FEA] Audit: add unit tests for predicate push down #147

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test parquet predicate pushdown for basic types and fields having dots in names [databricks] #9128

Test parquet predicate pushdown for basic types and fields having dots in names [databricks] #9128

thirtiseven commented Aug 29, 2023 •

edited

Loading

thirtiseven commented Aug 30, 2023

res-life Aug 30, 2023

revans2 Aug 30, 2023

thirtiseven Aug 31, 2023

res-life commented Aug 30, 2023

res-life commented Aug 30, 2023

thirtiseven commented Aug 30, 2023 •

edited

Loading

revans2 Aug 30, 2023

revans2 Aug 30, 2023

revans2 Aug 30, 2023

res-life commented Aug 31, 2023

revans2 commented Aug 31, 2023

thirtiseven commented Aug 31, 2023 •

edited

Loading

thirtiseven commented Sep 1, 2023

Test parquet predicate pushdown for basic types and fields having dots in names [databricks] #9128

Test parquet predicate pushdown for basic types and fields having dots in names [databricks] #9128

Conversation

thirtiseven commented Aug 29, 2023 • edited Loading

thirtiseven commented Aug 30, 2023

res-life Aug 30, 2023

Choose a reason for hiding this comment

revans2 Aug 30, 2023

Choose a reason for hiding this comment

thirtiseven Aug 31, 2023

Choose a reason for hiding this comment

res-life commented Aug 30, 2023

res-life commented Aug 30, 2023

thirtiseven commented Aug 30, 2023 • edited Loading

revans2 Aug 30, 2023

Choose a reason for hiding this comment

revans2 Aug 30, 2023

Choose a reason for hiding this comment

revans2 Aug 30, 2023

Choose a reason for hiding this comment

res-life commented Aug 31, 2023

revans2 commented Aug 31, 2023

thirtiseven commented Aug 31, 2023 • edited Loading

thirtiseven commented Sep 1, 2023

thirtiseven commented Aug 29, 2023 •

edited

Loading

thirtiseven commented Aug 30, 2023 •

edited

Loading

thirtiseven commented Aug 31, 2023 •

edited

Loading