Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test parquet predicate pushdown for basic types and fields having dots in names [databricks] #9128

Merged
merged 12 commits into from
Sep 1, 2023

Conversation

thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented Aug 29, 2023

Closes #9127
Closes #9094

This PR adds some tests to test parquet predicate pushdown for basic types and fields with dots in names. And also a follow on to orc's ppd test, using `assume' instead of commenting out failed cases.

For PPD testing, these cases are tested:

  • CPU write GPU read, GPU write CPU read, and GPU write GPU read.
  • Range partitioning on write and predicate is out of DataFrame range, if applicable.
  • Data source V1 and V2.

Some contexts: #9119

@sameerz sameerz added test Only impacts tests reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Aug 29, 2023
@thirtiseven
Copy link
Collaborator Author

build

@thirtiseven thirtiseven self-assigned this Aug 30, 2023
@thirtiseven thirtiseven marked this pull request as ready for review August 30, 2023 08:26
withTempPath { path =>
withSQLConf(
SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> "TIMESTAMP_MICROS",
"spark.rapids.sql.test.enabled" -> "false",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val TEST_CONF = conf("spark.rapids.sql.test.enabled")
   .doc("Intended to be used by unit tests, if enabled all operations must run on the " +
     "GPU or an error happens.")
   .internal()
   .booleanConf
   .createWithDefault(false)

If enable GPU, we should set this as true. Or we may enconter this kind of case:
The test passed, but it's running on CPU silently, and we expect it's running on GPU.

Changes would like:

if (writeGpu || readGpu) {
  spark.rapids.sql.test.enabled = true
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think it should be

"spark.rapdis.slq.test.enabled" -> (!writeGpu).toString

or else if we are reading on the GPU, but writing on the CPU we will run into a problem where the we get an error for having things not be on the GPU.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to

"spark.rapids.sql.test.enabled" -> writeGpu.toString

and

"spark.rapids.sql.test.enabled" -> readGpu.toString

to check all operations running on GPU during GPU read and write separately.

And removed binary test because we don't support binaryType right now.

@res-life
Copy link
Collaborator

I also notice that ParquetFilterSuite has testing for the following operators:

      checkFilterPredicate(intAttr === 1, classOf[Eq[_]], resultFun(1))
      checkFilterPredicate(intAttr <=> 1, classOf[Eq[_]], resultFun(1))
      checkFilterPredicate(intAttr =!= 1, classOf[NotEq[_]],
        (2 to 4).map(i => Row.apply(resultFun(i))))

      checkFilterPredicate(intAttr < 2, classOf[Lt[_]], resultFun(1))
      checkFilterPredicate(intAttr > 3, classOf[Gt[_]], resultFun(4))
      checkFilterPredicate(intAttr <= 1, classOf[LtEq[_]], resultFun(1))
      checkFilterPredicate(intAttr >= 4, classOf[GtEq[_]], resultFun(4))

      checkFilterPredicate(Literal(1) === intAttr, classOf[Eq[_]], resultFun(1))
      checkFilterPredicate(Literal(1) <=> intAttr, classOf[Eq[_]], resultFun(1))
      checkFilterPredicate(Literal(2) > intAttr, classOf[Lt[_]], resultFun(1))
      checkFilterPredicate(Literal(3) < intAttr, classOf[Gt[_]], resultFun(4))
      checkFilterPredicate(Literal(1) >= intAttr, classOf[LtEq[_]], resultFun(1))
      checkFilterPredicate(Literal(4) <= intAttr, classOf[GtEq[_]], resultFun(4))

And I notice ParquetFilterSuite also tests V1 and V2 datasources.

Not sure if it's necessary to add above tests.

@res-life
Copy link
Collaborator

Can we mark #9119 is invalid?

@thirtiseven
Copy link
Collaborator Author

thirtiseven commented Aug 30, 2023

Thanks for review @res-life

Not sure if it's necessary to add above tests.

Yes there are many cases to test PPD in Spark, I will go through them and add more cases.

Can we mark #9119 is invalid?

Closed it.

}
}

def withAllParquetReaders(code: => Unit): Unit = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This only matters when reading on the CPU. The GPU ignores this.

withTempPath { path =>
withSQLConf(
SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> "TIMESTAMP_MICROS",
"spark.rapids.sql.test.enabled" -> "false",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think it should be

"spark.rapdis.slq.test.enabled" -> (!writeGpu).toString

or else if we are reading on the GPU, but writing on the CPU we will run into a problem where the we get an error for having things not be on the GPU.

SQLConf.PARQUET_FILTER_PUSHDOWN_DATE_ENABLED.key -> "true",
SQLConf.PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED.key -> "true",
SQLConf.PARQUET_FILTER_PUSHDOWN_DECIMAL_ENABLED.key -> "true",
"spark.rapids.sql.test.enabled" -> "false",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here we should update this one similarly to the comment above.

"spark.rapdis.slq.test.enabled" -> (!readGpu).toString,

@res-life res-life changed the title Test parquet predicate pushdown for basic types and fields having dots in names Test parquet predicate pushdown for basic types and fields having dots in names [databricks] Aug 31, 2023
@res-life
Copy link
Collaborator

Added [databricks] in the title to also test databricks.

revans2
revans2 previously approved these changes Aug 31, 2023
@revans2
Copy link
Collaborator

revans2 commented Aug 31, 2023

build

@thirtiseven
Copy link
Collaborator Author

thirtiseven commented Aug 31, 2023

Updated the timestamp test, V2 source also needs spark.sql.parquet.outputTimestampType=TIMESTAMP_MICROS when reading to enable predicate pushdown. Also added date test.

@thirtiseven
Copy link
Collaborator Author

build

@thirtiseven thirtiseven merged commit df016cb into NVIDIA:branch-23.10 Sep 1, 2023
27 checks passed
@thirtiseven thirtiseven deleted the ppd_parquet_dotname branch September 1, 2023 05:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin test Only impacts tests
Projects
None yet
4 participants