Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Check that keys are not null when creating a map #8984

Closed
abellina opened this issue Aug 10, 2023 · 1 comment · Fixed by #9237
Closed

[BUG] Check that keys are not null when creating a map #8984

abellina opened this issue Aug 10, 2023 · 1 comment · Fixed by #9237
Assignees
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@abellina
Copy link
Collaborator

I was able to create a map on the GPU with a null key:

scala> spark.sql("select map(x, -1) from (select explode(array(1,null)) as x)").show()
23/08/10 15:00:25 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> cast(map(x#1, -1) as string) AS map(x, -1)#5 will run on GPU
      *Expression <Cast> cast(map(x#1, -1) as string) will run on GPU
        *Expression <CreateMap> map(x#1, -1) will run on GPU
    *Exec <GenerateExec> will run on GPU
      *Expression <Explode> explode([1,null]) will run on GPU
      ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec

+------------+                                                                  
|  map(x, -1)|
+------------+
|   {1 -> -1}|
|{null -> -1}|
+------------+

That said, this is not allowed on the CPU, so we should prevent it from happening. If we allowed these maps with null keys who knows what else could break within spark in really odd ways.

CPU output example;

scala> spark.sql("select map(x, -1) from (select explode(array(1,null)) as x)").show()
23/08/10 15:00:41 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) (executor 0): java.lang.RuntimeException: Cannot use null as map key!
	at org.apache.spark.sql.errors.QueryExecutionErrors$.nullAsMapKeyNotAllowedError(QueryExecutionErrors.scala:260)
	at org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder.put(ArrayBasedMapBuilder.scala:56)
	at org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder.putAll(ArrayBasedMapBuilder.scala:94)
	at org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder.from(ArrayBasedMapBuilder.scala:122)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Aug 10, 2023
@revans2
Copy link
Collaborator

revans2 commented Aug 10, 2023

Yup it looks like

GpuCreateMap.createMapFromKeysValuesAsStructs is not checking for null values on the keys. We should add the check there unconditionally.

@abellina abellina added test Only impacts tests reliability Features to improve reliability or bugs that severly impact the reliability of the plugin and removed test Only impacts tests labels Aug 10, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Aug 16, 2023
@revans2 revans2 self-assigned this Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants