Add test for selecting a single complex field array and its parent struct array [databricks] #9018

razajafri · 2023-08-11T19:37:17Z

Pushing this comprehensive patch after the revert of #8744.

PR tries to mimic the test in Spark's SchemaPruningSuite.

We create a table with complex fields and try to read only a subfield to see if column pruning is working the way it's supposed to i.e. we are not reading unnecessary columns. e.g. if we have a complex type contact with an array of friends, like so

Contact {
   first_name: String
   middle_name: String
   last_name: String
   friends: Array[Contact]
}

selecting spark.table("contacts").select(explode("friends").alias("friend").select("friend.first_name") shouldn't also read the middle_name and last_name of the friends field.

fixes #8712
fixes #8713
fixes #8714
fixes #8715

Signed-off-by: Raza Jafri <[email protected]>

razajafri · 2023-08-11T19:40:34Z

Depends on #9013

razajafri · 2023-08-11T19:42:06Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/AdaptiveSparkPlanHelperImpl.scala

+package org.apache.spark.sql.rapids
+
+import org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper
+


I will add comment here stating why this file is needed

razajafri · 2023-08-11T19:45:14Z

@gerashegalov @jlowe I have tested the jar on DB and the parallel-worlds has two separate files for AdaptiveSparkPlanHelperImpl, do I need to add this filename to any other txt files that we are using in de-dupe process?

jlowe · 2023-08-11T19:57:09Z

I have tested the jar on DB and the parallel-worlds has two separate files for AdaptiveSparkPlanHelperImpl

That is expected, because AdaptiveSparkPlanHelperImpl derives from AdaptiveSparkPlanHelper which is different across the Spark platforms.

jlowe · 2023-08-11T19:58:44Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/ShimLoader.scala

@@ -414,4 +415,9 @@ object ShimLoader extends Logging {
  def loadGpuColumnVector(): Class[_] = {
    ShimReflectionUtils.loadClass("com.nvidia.spark.rapids.GpuColumnVector")
  }
+
+  def newAdaptiveSparkPlanHelperShim(): AdaptiveSparkPlanHelperImpl =


ShimLoader should be returning an unshimmed class here, otherwise an unshimmed class may attempt to load a shimmed class even before we get to the ShimLoader part, and if the classpath isn't parallel-worlds aware, that will fail.

gerashegalov · 2023-08-17T18:09:58Z

integration_tests/src/main/python/prune_partition_column_test.py

+@pytest.mark.parametrize('format', ["parquet", "orc"])
+def test_select_complex_field(format, spark_tmp_path, query_and_expected_schemata, is_partitioned, spark_tmp_table_factory):
+    table_name = spark_tmp_table_factory.get()
+    query, expected_schemata = query_and_expected_schemata


nit: we can do this already in parametrize

@pytest.mark.parametrize('query,expected_schemata', def test_select_complex_field(..., query, expected_schemata, ...)

For some reason, pytest doesn't like it when I mark one of the parameters namely pytest.param(("select name.middle, address from {} where p=2", "struct<name:struct<middle:string>,address:string>"), marks=pytest.mark.skip(reason='https://github.com/NVIDIA/spark-rapids/issues/8788')),

it looks like you are passing a pair to pytest,param, it should be varargs https://docs.pytest.org/en/7.1.x/reference/reference.html#pytest-param

…a-pruning-test

razajafri · 2023-08-17T18:43:09Z

build

jlowe · 2023-08-17T19:33:52Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/ExecutionPlanCaptureCallback.scala

@@ -209,4 +248,8 @@ class ExecutionPlanCaptureCallback extends QueryExecutionListener {

  override def onFailure(funcName: String, qe: QueryExecution, exception: Exception): Unit =
    captureIfNeeded(qe)
+}
+
+trait AdaptiveSparkPlanHelperShim {


AdaptiveSparkPlanHelperShim.class needs to be added to dist/unshimmed-common-from-spark311.txt.

razajafri · 2023-08-18T00:46:55Z

build

razajafri · 2023-08-18T00:47:13Z

@gerashegalov @jlowe please take another look

jlowe · 2023-08-18T16:45:04Z

@razajafri any reason this is still a draft PR?

razajafri · 2023-08-18T16:51:58Z

@razajafri any reason this is still a draft PR?

Thanks for the review.

I had put it in draft because I wanted to ensure the pre-merge passed after @gerashegalov 's changes to the multi-shim jar.

jlowe · 2023-08-18T17:18:13Z

I had put it in draft because I wanted to ensure the pre-merge passed

Note that if the only reason for the draft is to make sure premerge passed, then IMO there's no reason for the PR to be draft. A failed premerge will prevent it from being merged even if it's not in draft. The main reason for draft is to prevent a PR from being merged even if it passes premerge and otherwise would be eligible for merging.

razajafri added 2 commits August 11, 2023 12:17

added tests for orc/parquet schema evolution

0520fb9

Signing off

87a7c9c

Signed-off-by: Raza Jafri <[email protected]>

razajafri commented Aug 11, 2023

View reviewed changes

razajafri marked this pull request as draft August 11, 2023 19:46

jlowe reviewed Aug 11, 2023

View reviewed changes

sameerz added the test Only impacts tests label Aug 14, 2023

Addressed review comments

84bd061

gerashegalov reviewed Aug 17, 2023

View reviewed changes

Merge remote-tracking branch 'origin/branch-23.10' into SP-8712-schem…

8c5a637

…a-pruning-test

jlowe reviewed Aug 17, 2023

View reviewed changes

razajafri added 2 commits August 17, 2023 17:27

addressed review comments

7188e3c

addressed review comments

620ed84

jlowe approved these changes Aug 18, 2023

View reviewed changes

razajafri marked this pull request as ready for review August 18, 2023 16:50

razajafri merged commit 66b1174 into NVIDIA:branch-23.10 Aug 18, 2023
26 of 27 checks passed

razajafri deleted the SP-8712-schema-pruning-test branch August 18, 2023 16:52

abellina mentioned this pull request Aug 24, 2023

[BUG] test_select_complex_field fails in MT tests #9103

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test for selecting a single complex field array and its parent struct array [databricks] #9018

Add test for selecting a single complex field array and its parent struct array [databricks] #9018

razajafri commented Aug 11, 2023

razajafri commented Aug 11, 2023

razajafri Aug 11, 2023

razajafri commented Aug 11, 2023 •

edited

Loading

jlowe commented Aug 11, 2023

jlowe Aug 11, 2023

gerashegalov Aug 17, 2023

razajafri Aug 17, 2023

gerashegalov Aug 17, 2023

razajafri commented Aug 17, 2023

jlowe Aug 17, 2023

razajafri commented Aug 18, 2023

razajafri commented Aug 18, 2023

jlowe commented Aug 18, 2023

razajafri commented Aug 18, 2023

jlowe commented Aug 18, 2023

		package org.apache.spark.sql.rapids

		import org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper

Add test for selecting a single complex field array and its parent struct array [databricks] #9018

Add test for selecting a single complex field array and its parent struct array [databricks] #9018

Conversation

razajafri commented Aug 11, 2023

razajafri commented Aug 11, 2023

razajafri Aug 11, 2023

Choose a reason for hiding this comment

razajafri commented Aug 11, 2023 • edited Loading

jlowe commented Aug 11, 2023

jlowe Aug 11, 2023

Choose a reason for hiding this comment

gerashegalov Aug 17, 2023

Choose a reason for hiding this comment

razajafri Aug 17, 2023

Choose a reason for hiding this comment

gerashegalov Aug 17, 2023

Choose a reason for hiding this comment

razajafri commented Aug 17, 2023

jlowe Aug 17, 2023

Choose a reason for hiding this comment

razajafri commented Aug 18, 2023

razajafri commented Aug 18, 2023

jlowe commented Aug 18, 2023

razajafri commented Aug 18, 2023

jlowe commented Aug 18, 2023

razajafri commented Aug 11, 2023 •

edited

Loading