[BUG] discrepancy in the plugin jar deployment in run_pyspark_from_build.sh depending on TEST_PARALLEL #5714

gerashegalov · 2022-06-01T17:50:29Z

Describe the bug
Classloading bugs detected by CI may be unnecessarily hard to repro locally. We deploy jars differently based on TEST_PARALLEL call path. When xdist call path is triggered we add --jars

spark-rapids/integration_tests/run_pyspark_from_build.sh

Line 238 in 4b13b04

exec "$SPARK_HOME"/bin/spark-submit --jars "${ALL_JARS// /,}" \

which may change plugin's classloader

Steps/Code to reproduce bug
Compare the repro fixed in #5708. The only reason that TEST_PARALLEL=2 mentioned in #5703 is important can be highlighted with pyspark REPL

was hidden with --jars:

pyspark --jars $PWD/dist/target/rapids-4-spark_2.12-22.06.0-SNAPSHOT-cuda11.jar \
  --conf spark.rapids.sql.enabled=true \
  --conf spark.rapids.force.caller.classloader=false \
  --conf spark.plugins=com.nvidia.spark.SQLPlugin
>>> df=spark.createDataFrame([ ['2022-06-01 07:45'] ], 'a string').selectExpr('hour(a)')
>>> sc._jvm.com.nvidia.spark.rapids.ExplainPlan.explainPotentialGpuPlan(df._jdf, "ALL")
'!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced\n  @Expression <Alias> hour(cast(a#0 as timestamp), Some(America/Los_Angeles)) AS hour(a)#2 could run on GPU\n    !Expression <Hour> hour(cast(a#0 as timestamp), Some(America/Los_Angeles)) cannot run on GPU because input expression Cast cast(a#0 as timestamp) (TimestampType is not supported when the JVM system timezone is set to America/Los_Angeles. Set the timezone to UTC to enable TimestampType support); Only UTC zone id is supported. Actual zone id: America/Los_Angeles\n      !Expression <Cast> cast(a#0 as timestamp) cannot run on GPU because the GPU only supports a subset of formats when casting strings to timestamps. Refer to the CAST documentation for more details. To enable this operation on the GPU, set spark.rapids.sql.castStringToTimestamp.enabled to true.; Cast from StringType to TimestampType is not supported; Parsing the full rage of supported years is not supported. If your years are limited to 4 positive digits set spark.rapids.sql.hasExtendedYearValues to false.\n        @Expression <AttributeReference> a#0 could run on GPU\n  ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec\n    @Expression <AttributeReference> a#0 could run on GPU\n

was broken with extraClassPath

pyspark --driver-class-path $PWD/dist/target/rapids-4-spark_2.12-22.06.0-SNAPSHOT-cuda11.jar \
   --conf spark.executor.extraClassPath=$PWD/dist/target/rapids-4-spark_2.12-22.06.0-SNAPSHOT-cuda11.jar \
   --conf spark.rapids.sql.enabled=true \
   --conf spark.rapids.force.caller.classloader=false \
   --conf spark.plugins=com.nvidia.spark.SQLPlugin
>>> df=spark.createDataFrame([ ['2022-06-01 07:45'] ], 'a string').selectExpr('hour(a)')
>>> sc._jvm.com.nvidia.spark.rapids.ExplainPlan.explainPotentialGpuPlan(df._jdf, "ALL")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/c/Users/gshegalov/dist/spark-3.2.1-bin-hadoop3.2/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1321, in __call__
  File "/mnt/c/Users/gshegalov/dist/spark-3.2.1-bin-hadoop3.2/python/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/mnt/c/Users/gshegalov/dist/spark-3.2.1-bin-hadoop3.2/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.nvidia.spark.rapids.ExplainPlan.explainPotentialGpuPlan.
: java.lang.NoClassDefFoundError: com/nvidia/spark/rapids/GpuOverrides$
        at com.nvidia.spark.rapids.ExplainPlanImpl.explainPotentialGpuPlan(GpuOverrides.scala:4196)
        at com.nvidia.spark.rapids.ExplainPlan$.explainPotentialGpuPlan(ExplainPlan.scala:65)
        at com.nvidia.spark.rapids.ExplainPlan.explainPotentialGpuPlan(ExplainPlan.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.nvidia.spark.rapids.GpuOverrides$
        at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
        ... 15 more

Expected behavior
TEST_PARALLEL should not affect whether integration tests reveal Plugin bugs, just the throughput of integration tests

Environment details (please complete the following information)

any

Additional context
#5703

The text was updated successfully, but these errors were encountered:

gerashegalov added bug Something isn't working ? - Needs Triage Need team to review and classify test Only impacts tests labels Jun 1, 2022

gerashegalov mentioned this issue Jun 6, 2022

[BUG] 22.08 ClassLoader-related issues #5757

Open

7 tasks

sameerz removed the ? - Needs Triage Need team to review and classify label Jun 7, 2022

gerashegalov mentioned this issue Jul 14, 2022

[BUG] Documented deployment of spark-avro is not tested #5657

Closed

res-life self-assigned this Jul 15, 2022

res-life mentioned this issue Jul 21, 2022

[BUG] Fix IT discrepancy which depending on TEST_PARALLEL #6044

Merged

res-life closed this as completed in #6044 Aug 16, 2022

gerashegalov mentioned this issue Sep 2, 2022

[FEA] make nighlty build and test scripts easy-to-use in non-jenkins env #6252

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] discrepancy in the plugin jar deployment in run_pyspark_from_build.sh depending on TEST_PARALLEL #5714

[BUG] discrepancy in the plugin jar deployment in run_pyspark_from_build.sh depending on TEST_PARALLEL #5714

gerashegalov commented Jun 1, 2022 •

edited

Loading

[BUG] discrepancy in the plugin jar deployment in run_pyspark_from_build.sh depending on TEST_PARALLEL #5714

[BUG] discrepancy in the plugin jar deployment in run_pyspark_from_build.sh depending on TEST_PARALLEL #5714

Comments

gerashegalov commented Jun 1, 2022 • edited Loading

gerashegalov commented Jun 1, 2022 •

edited

Loading