Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GPU ORC reader complains errors when specifying columns that do not exist in file schema. #3058

Closed
firestarman opened this issue Jul 28, 2021 · 6 comments · Fixed by #3393
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@firestarman
Copy link
Collaborator

firestarman commented Jul 28, 2021

Describe the bug
Spark ORC reader produces nulls for the columns that do not exist in file schema, but GPU ORC reader fails to read.
GPU:

scala> val rs = StructType(Seq(StructField("_c0",StructType(Seq(StructField("child0",ByteType,true), StructField("no_child",ByteType,true)))), StructField("no_int", IntegerType)))
scala> spark.read.schema(rs).orc("/data/tmp/test.log").show

21/07/28 08:28:06 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.sql.execution.QueryExecutionException: Incompatible schemas for ORC file at file:///data/tmp/test.log
 file schema: struct<_c0:struct<child0:tinyint,child1:int,child2:bigint,child3:double,child4:boolean,child5:date,child6:timestamp>>
 read schema: struct<_c0:struct<child0:tinyint,no_child:tinyint>,no_int:int>
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.$anonfun$checkSchemaCompatibility$1(GpuOrcScan.scala:1099)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.checkSchemaCompatibility(GpuOrcScan.scala:1092)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.getOrcPartitionReaderContext(GpuOrcScan.scala:894)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler.$anonfun$filterStripes$3(GpuOrcScan.scala:760)

CPU:

scala> spark.read.schema(rs).orc("/data/tmp/test.log").show
+------------+------+
|         _c0|no_int|
+------------+------+
|        null|  null|
| {-21, null}|  null|
| {112, null}|  null|
| {-50, null}|  null|
| {-49, null}|  null|
| {-55, null}|  null|
|{null, null}|  null|
| {-61, null}|  null|
|        null|  null|
| {126, null}|  null|
|  {88, null}|  null|
| {-18, null}|  null|
|{-121, null}|  null|
+------------+------+

Steps/Code to reproduce bug
Read the attached ORC file (test.log) by GPU with the read schema as below.

val readSchema = StructType(Seq(
    StructField("_c0",StructType(Seq(
        StructField("child0",ByteType,true),
        StructField("no_child",ByteType,true)))),
    StructField("no_int", IntegerType)))

Expected behavior
GPU ORC reader should output the same data with CPU ORC reader.

@firestarman firestarman added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 28, 2021
@firestarman
Copy link
Collaborator Author

firestarman commented Jul 28, 2021

When ansi mode is on, for CPU, the action show will faill due to a cast exception, but collect works.

scala> spark.conf.set("spark.sql.ansi.enabled", "true")
scala> spark.read.schema(rs).orc("/data/tmp/test.log").show
org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`_c0` AS STRING)' due to data type mismatch:
 cannot cast struct<child0:tinyint,no_child:tinyint> to string with ANSI mode on.
 If you have to cast struct<child0:tinyint,no_child:tinyint> to string, you can set spark.sql.ansi.enabled as false.
;

scala> spark.read.schema(rs).orc("/data/tmp/test.log").collect
res16: Array[org.apache.spark.sql.Row] = Array([null,null], [[-21,null],null], [[112,null],null], [[-50,null],null], [[-49,null],null], [[-55,null],null], [[null,null],null], [[-61,null],null], [null,null], [[126,null],null], [[88,null],null], [[-18,null],null], [[-121,null],null])

So CPU ORC reader will not throw exceptions even ansi mode is on for this case.

@sameerz
Copy link
Collaborator

sameerz commented Aug 3, 2021

We should decide whether this should be handled at a lower layer in cuIO. Related issue rapidsai/cudf#5447

@Salonijain27 Salonijain27 added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Aug 10, 2021
@firestarman
Copy link
Collaborator Author

firestarman commented Aug 25, 2021

Looks like the feaure of adding new columns here has been covered by the issue rapidsai/cudf#5447.

However adding new columns can be supported in a short time if implementing it in the plugin side.

You can see Parquet has partially supported this feature by doing it in plugin. I mean Parquet will add new columns only for the top level ones. We can do a little more in ORC to also support this for nested columns.

So, personally, we can implement only this feature in plugin before cuDF supports it, and will remove the logic after cuDF has it done.

@revans2 @sameerz hi, any suggestion ?

@revans2
Copy link
Collaborator

revans2 commented Aug 31, 2021

@firestarman

rapidsai/cudf#5447 was filed under a mistaken assumption. I saw that the java ORC code had schema evolution built into it, and I assumed that it was a part of the standard. That turned out to not be true. It is very likely that rapidsai/cudf#5447 will never be done on the cudf side. If you want me to push on that issue so we come to some kind of a resolution I can, but personally I think we just need to make this happen on our own, and we close the cudf issue.

Fundamentally it comes down to a two different operations after lining up the names/positions of the columns accordingly.

  1. Adding in null columns where needed. As you said.

We can do a little more in ORC to also support this for nested columns.

If you need help with this please let me know.

  1. "CAST"ing columns from one data type to another.

This is harder because our cast implementations are not great in all cases, so as we add in this type of support we need to be sure that we test corner cases and look at the ORC code to see what corner cases there might be. In the worst case we might need a flag like isAnsi but isOrc so we know if we have to do something special for the ORC cases.

@sameerz if you agree with me on this I will close the cudf feature request.

@sameerz
Copy link
Collaborator

sameerz commented Aug 31, 2021

@revans2 agreed, we can close the cudf feature request and focus on this in the spark-rapids plugin.

@firestarman
Copy link
Collaborator Author

OK, I will work on the null columns first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
4 participants