Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test for selecting a single complex field array and its parent struct array [databricks] #9018

Merged
merged 6 commits into from
Aug 18, 2023

Conversation

razajafri
Copy link
Collaborator

Pushing this comprehensive patch after the revert of #8744.

PR tries to mimic the test in Spark's SchemaPruningSuite.

We create a table with complex fields and try to read only a subfield to see if column pruning is working the way it's supposed to i.e. we are not reading unnecessary columns. e.g. if we have a complex type contact with an array of friends, like so

Contact {
   first_name: String
   middle_name: String
   last_name: String
   friends: Array[Contact]
}

selecting spark.table("contacts").select(explode("friends").alias("friend").select("friend.first_name") shouldn't also read the middle_name and last_name of the friends field.

fixes #8712
fixes #8713
fixes #8714
fixes #8715

@razajafri
Copy link
Collaborator Author

Depends on #9013

package org.apache.spark.sql.rapids

import org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add comment here stating why this file is needed

@razajafri
Copy link
Collaborator Author

razajafri commented Aug 11, 2023

@gerashegalov @jlowe I have tested the jar on DB and the parallel-worlds has two separate files for AdaptiveSparkPlanHelperImpl, do I need to add this filename to any other txt files that we are using in de-dupe process?

@razajafri razajafri marked this pull request as draft August 11, 2023 19:46
@jlowe
Copy link
Member

jlowe commented Aug 11, 2023

I have tested the jar on DB and the parallel-worlds has two separate files for AdaptiveSparkPlanHelperImpl

That is expected, because AdaptiveSparkPlanHelperImpl derives from AdaptiveSparkPlanHelper which is different across the Spark platforms.

@@ -414,4 +415,9 @@ object ShimLoader extends Logging {
def loadGpuColumnVector(): Class[_] = {
ShimReflectionUtils.loadClass("com.nvidia.spark.rapids.GpuColumnVector")
}

def newAdaptiveSparkPlanHelperShim(): AdaptiveSparkPlanHelperImpl =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShimLoader should be returning an unshimmed class here, otherwise an unshimmed class may attempt to load a shimmed class even before we get to the ShimLoader part, and if the classpath isn't parallel-worlds aware, that will fail.

@sameerz sameerz added the test Only impacts tests label Aug 14, 2023
@pytest.mark.parametrize('format', ["parquet", "orc"])
def test_select_complex_field(format, spark_tmp_path, query_and_expected_schemata, is_partitioned, spark_tmp_table_factory):
table_name = spark_tmp_table_factory.get()
query, expected_schemata = query_and_expected_schemata
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can do this already in parametrize

@pytest.mark.parametrize('query,expected_schemata',
def test_select_complex_field(..., query, expected_schemata, ...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, pytest doesn't like it when I mark one of the parameters namely pytest.param(("select name.middle, address from {} where p=2", "struct<name:struct<middle:string>,address:string>"), marks=pytest.mark.skip(reason='https://github.com/NVIDIA/spark-rapids/issues/8788')),

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like you are passing a pair to pytest,param, it should be varargs https://docs.pytest.org/en/7.1.x/reference/reference.html#pytest-param

@razajafri
Copy link
Collaborator Author

build

@@ -209,4 +248,8 @@ class ExecutionPlanCaptureCallback extends QueryExecutionListener {

override def onFailure(funcName: String, qe: QueryExecution, exception: Exception): Unit =
captureIfNeeded(qe)
}

trait AdaptiveSparkPlanHelperShim {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AdaptiveSparkPlanHelperShim.class needs to be added to dist/unshimmed-common-from-spark311.txt.

@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

@gerashegalov @jlowe please take another look

@jlowe
Copy link
Member

jlowe commented Aug 18, 2023

@razajafri any reason this is still a draft PR?

@razajafri razajafri marked this pull request as ready for review August 18, 2023 16:50
@razajafri
Copy link
Collaborator Author

@razajafri any reason this is still a draft PR?

Thanks for the review.

I had put it in draft because I wanted to ensure the pre-merge passed after @gerashegalov 's changes to the multi-shim jar.

@razajafri razajafri merged commit 66b1174 into NVIDIA:branch-23.10 Aug 18, 2023
26 of 27 checks passed
@razajafri razajafri deleted the SP-8712-schema-pruning-test branch August 18, 2023 16:52
@jlowe
Copy link
Member

jlowe commented Aug 18, 2023

I had put it in draft because I wanted to ensure the pre-merge passed

Note that if the only reason for the draft is to make sure premerge passed, then IMO there's no reason for the PR to be draft. A failed premerge will prevent it from being merged even if it's not in draft. The main reason for draft is to prevent a PR from being merged even if it passes premerge and otherwise would be eligible for merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
4 participants