Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Customer failure 23.08: Cannot compute hash of a table with a LIST of STRUCT columns. #9010

Closed
tgravescs opened this issue Aug 11, 2023 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@tgravescs
Copy link
Collaborator

Describe the bug
Customer job failed shuffle with the 23.08 pre release jar. It looks like we don't support shuffling list of struct and we aren't falling back properly.

Job aborted due to stage failure: Task 9 in stage 515.0 failed 4 times, most recent failure: Lost task 9.3 in stage 515.0 (TID 2804) (10.13.13.147 executor 4): ai.rapids.cudf.CudfException: CUDF failure at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-170-cuda11/thirdparty/cudf/cpp/src/hash/spark_murmurhash3_x86_32.cu:381: Cannot compute hash of a table with a LIST of STRUCT columns.
	at ai.rapids.cudf.ColumnVector.hash(Native Method)
	at ai.rapids.cudf.ColumnVector.spark32BitMurmurHash3(ColumnVector.java:795)
	at org.apache.spark.sql.rapids.GpuMurmur3Hash$.$anonfun$compute$3(HashFunctions.scala:83)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:56)
	at org.apache.spark.sql.rapids.GpuMurmur3Hash$.$anonfun$compute$1(HashFunctions.scala:82)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at org.apache.spark.sql.rapids.GpuMurmur3Hash$.compute(HashFunctions.scala:77)
	at org.apache.spark.sql.rapids.GpuMurmur3Hash.columnarEval(HashFunctions.scala:95)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:34)
	at com.nvidia.spark.rapids.GpuAlias.columnarEval(namedExpressions.scala:110)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:34)
	at com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$project$1(basicPhysicalOperators.scala:108)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1(implicits.scala:220)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1$adapted(implicits.scala:217)
	at scala.collection.immutable.List.foreach(List.scala:431)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.safeMap(implicits.scala:217)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableProducingSeq.safeMap(implicits.scala:252)
	at com.nvidia.spark.rapids.GpuProjectExec$.project(basicPhysicalOperators.scala:108)
	at com.nvidia.spark.rapids.GpuTieredProject.$anonfun$project$2(basicPhysicalOperators.scala:595)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.GpuTieredProject.recurse$2(basicPhysicalOperators.scala:594)
	at com.nvidia.spark.rapids.GpuTieredProject.project(basicPhysicalOperators.scala:607)
	at com.nvidia.spark.rapids.GpuTieredProject.$anonfun$projectWithRetrySingleBatchInternal$2(basicPhysicalOperators.scala:538)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.GpuTieredProject.$anonfun$projectWithRetrySingleBatchInternal$1(basicPhysicalOperators.scala:537)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:431)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:542)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:468)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:275)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:128)
	at com.nvidia.spark.rapids.GpuTieredProject.projectWithRetrySingleBatchInternal(basicPhysicalOperators.scala:536)
	at com.nvidia.spark.rapids.GpuTieredProject.projectAndCloseWithRetrySingleBatch(basicPhysicalOperators.scala:577)
	at com.nvidia.spark.rapids.GpuProjectExec.$anonfun$internalDoExecuteColumnar$2(basicPhysicalOperators.scala:377)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.GpuProjectExec.$anonfun$internalDoExecuteColumnar$1(basicPhysicalOperators.scala:373)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:318)
	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:340)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2(RapidsShuffleInternalManagerBase.scala:281)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2$adapted(RapidsShuffleInternalManagerBase.scala:274)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1(RapidsShuffleInternalManagerBase.scala:274)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1$adapted(RapidsShuffleInternalManagerBase.scala:273)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.write(RapidsShuffleInternalManagerBase.scala:273)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:81)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
@tgravescs tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 11, 2023
@revans2
Copy link
Collaborator

revans2 commented Aug 11, 2023

This turns out to be a bug where a check was implemented for hash partition but not for murmur3 directly. This is being done as a part of a project so murmur3 is being called directly.

@revans2
Copy link
Collaborator

revans2 commented Aug 11, 2023

val df = Seq(Array((1, 2), (3, 4)), Array((5, 6), (7, 8))).toDF
df.repartition(1).selectExpr("hash(value)").show()

reproduces the problem but

df.repartition(col("value")).show()

falls back to the CPU properly.

I just need to write some tests now to make sure it is all working well.

@sameerz
Copy link
Collaborator

sameerz commented Aug 13, 2023

Closing, as this did not auto-close since the default branch is branch-23.10.

@sameerz sameerz closed this as completed Aug 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants