-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add test for selecting a single complex field array and its parent struct array [databricks] #9018
Add test for selecting a single complex field array and its parent struct array [databricks] #9018
Conversation
Signed-off-by: Raza Jafri <[email protected]>
Depends on #9013 |
package org.apache.spark.sql.rapids | ||
|
||
import org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add comment here stating why this file is needed
@gerashegalov @jlowe I have tested the jar on DB and the parallel-worlds has two separate files for |
That is expected, because AdaptiveSparkPlanHelperImpl derives from AdaptiveSparkPlanHelper which is different across the Spark platforms. |
@@ -414,4 +415,9 @@ object ShimLoader extends Logging { | |||
def loadGpuColumnVector(): Class[_] = { | |||
ShimReflectionUtils.loadClass("com.nvidia.spark.rapids.GpuColumnVector") | |||
} | |||
|
|||
def newAdaptiveSparkPlanHelperShim(): AdaptiveSparkPlanHelperImpl = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ShimLoader should be returning an unshimmed class here, otherwise an unshimmed class may attempt to load a shimmed class even before we get to the ShimLoader part, and if the classpath isn't parallel-worlds aware, that will fail.
@pytest.mark.parametrize('format', ["parquet", "orc"]) | ||
def test_select_complex_field(format, spark_tmp_path, query_and_expected_schemata, is_partitioned, spark_tmp_table_factory): | ||
table_name = spark_tmp_table_factory.get() | ||
query, expected_schemata = query_and_expected_schemata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can do this already in parametrize
@pytest.mark.parametrize('query,expected_schemata',
def test_select_complex_field(..., query, expected_schemata, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason, pytest doesn't like it when I mark one of the parameters namely pytest.param(("select name.middle, address from {} where p=2", "struct<name:struct<middle:string>,address:string>"), marks=pytest.mark.skip(reason='https://github.com/NVIDIA/spark-rapids/issues/8788')),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like you are passing a pair to pytest,param, it should be varargs https://docs.pytest.org/en/7.1.x/reference/reference.html#pytest-param
build |
@@ -209,4 +248,8 @@ class ExecutionPlanCaptureCallback extends QueryExecutionListener { | |||
|
|||
override def onFailure(funcName: String, qe: QueryExecution, exception: Exception): Unit = | |||
captureIfNeeded(qe) | |||
} | |||
|
|||
trait AdaptiveSparkPlanHelperShim { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AdaptiveSparkPlanHelperShim.class needs to be added to dist/unshimmed-common-from-spark311.txt.
build |
@gerashegalov @jlowe please take another look |
@razajafri any reason this is still a draft PR? |
Thanks for the review. I had put it in draft because I wanted to ensure the pre-merge passed after @gerashegalov 's changes to the multi-shim jar. |
Note that if the only reason for the draft is to make sure premerge passed, then IMO there's no reason for the PR to be draft. A failed premerge will prevent it from being merged even if it's not in draft. The main reason for draft is to prevent a PR from being merged even if it passes premerge and otherwise would be eligible for merging. |
Pushing this comprehensive patch after the revert of #8744.
PR tries to mimic the test in Spark's SchemaPruningSuite.
We create a table with complex fields and try to read only a subfield to see if column pruning is working the way it's supposed to i.e. we are not reading unnecessary columns. e.g. if we have a complex type
contact
with an array offriends
, like soselecting
spark.table("contacts").select(explode("friends").alias("friend").select("friend.first_name")
shouldn't also read the middle_name and last_name of the friends field.fixes #8712
fixes #8713
fixes #8714
fixes #8715