Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ScanJson and JsonToStructs cannot handle nested empty arrays/structs #10595

Open
revans2 opened this issue Mar 14, 2024 · 1 comment
Open
Labels
bug Something isn't working

Comments

@revans2
Copy link
Collaborator

revans2 commented Mar 14, 2024

Describe the bug
This is with #10575

 Seq("""{"a":[]}""").toDF("json").repartition(1).selectExpr("from_json(json, 'a array<string>')").show()

results in an error like.

Caused by: java.lang.AssertionError: Type conversion is not allowed from STRUCT(LIST(INT8)) to StructType(StructField(a,ArrayType(StringType,true),true)) expected STRUCT(LIST(STRING))
  at com.nvidia.spark.rapids.GpuColumnVector.from(GpuColumnVector.java:711)
  at com.nvidia.spark.rapids.GpuUnaryExpression.$anonfun$doItColumnar$1(GpuExpressions.scala:254)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
  at com.nvidia.spark.rapids.GpuUnaryExpression.doItColumnar(GpuExpressions.scala:250)
  at com.nvidia.spark.rapids.GpuUnaryExpression.$anonfun$columnarEval$1(GpuExpressions.scala:261)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
  at com.nvidia.spark.rapids.GpuUnaryExpression.columnarEval(GpuExpressions.scala:260)
  at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:35)

If assertions are enabled.

Similarly

Seq("""{"a":1,"b":"","c":[]}""").toDF("json").repartition(1).selectExpr("from_json(json, 'a int, b string, c array<string>')").show()

throws

Caused by: java.lang.AssertionError: Type conversion is not allowed from STRUCT(INT32,STRING,LIST(INT8)) to StructType(StructField(a,IntegerType,true),StructField(b,StringType,true),StructField(c,ArrayType(StringType,true),true)) expected STRUCT(INT32,STRING,LIST(STRING))
  at com.nvidia.spark.rapids.GpuColumnVector.from(GpuColumnVector.java:711)
  at com.nvidia.spark.rapids.GpuUnaryExpression.$anonfun$doItColumnar$1(GpuExpressions.scala:254)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
  at com.nvidia.spark.rapids.GpuUnaryExpression.doItColumnar(GpuExpressions.scala:250)
  at com.nvidia.spark.rapids.GpuUnaryExpression.$anonfun$columnarEval$1(GpuExpressions.scala:261)

It looks like CUDF ignores our request that the returned value be a LIST(STRING) and returns a LIST(INT8) instead. This feels like a bug in CUDF, but we can probably work around it if we need to. But it is not going to be super simple.

@revans2 revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Mar 14, 2024
@revans2
Copy link
Collaborator Author

revans2 commented Mar 14, 2024

I should add that an empty struct results in a different error.

Seq("""{"a":1,"b":"","c":{}}""").toDF("json").repartition(1).selectExpr("from_json(json, 'a int, b string, c struct<a string>')").show()
Caused by: java.lang.NullPointerException
  at ai.rapids.cudf.Table.gatherJSONColumns(Table.java:1105)
  at ai.rapids.cudf.Table.gatherJSONColumns(Table.java:1225)
  at ai.rapids.cudf.Table.readJSON(Table.java:1391)
  at org.apache.spark.sql.rapids.GpuJsonToStructs.$anonfun$doColumnar$2(GpuJsonToStructs.scala:180)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
  at org.apache.spark.sql.rapids.GpuJsonToStructs.$anonfun$doColumnar$1(GpuJsonToStructs.scala:178)
  at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
  at org.apache.spark.sql.rapids.GpuJsonToStructs.doColumnar(GpuJsonToStructs.scala:176)

This looks almost identical to reading an list with only empty top level structs.

@revans2 revans2 mentioned this issue Mar 14, 2024
62 tasks
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants