[BUG] GPU ORC reader complains errors when specifying columns that do not exist in file schema. #3058

firestarman · 2021-07-28T08:26:28Z

Describe the bug
Spark ORC reader produces nulls for the columns that do not exist in file schema, but GPU ORC reader fails to read.
GPU:

scala> val rs = StructType(Seq(StructField("_c0",StructType(Seq(StructField("child0",ByteType,true), StructField("no_child",ByteType,true)))), StructField("no_int", IntegerType)))
scala> spark.read.schema(rs).orc("/data/tmp/test.log").show

21/07/28 08:28:06 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.sql.execution.QueryExecutionException: Incompatible schemas for ORC file at file:///data/tmp/test.log
 file schema: struct<_c0:struct<child0:tinyint,child1:int,child2:bigint,child3:double,child4:boolean,child5:date,child6:timestamp>>
 read schema: struct<_c0:struct<child0:tinyint,no_child:tinyint>,no_int:int>
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.$anonfun$checkSchemaCompatibility$1(GpuOrcScan.scala:1099)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.checkSchemaCompatibility(GpuOrcScan.scala:1092)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler$GpuOrcPartitionReaderUtils.getOrcPartitionReaderContext(GpuOrcScan.scala:894)
	at com.nvidia.spark.rapids.GpuOrcFileFilterHandler.$anonfun$filterStripes$3(GpuOrcScan.scala:760)

CPU:

scala> spark.read.schema(rs).orc("/data/tmp/test.log").show
+------------+------+
|         _c0|no_int|
+------------+------+
|        null|  null|
| {-21, null}|  null|
| {112, null}|  null|
| {-50, null}|  null|
| {-49, null}|  null|
| {-55, null}|  null|
|{null, null}|  null|
| {-61, null}|  null|
|        null|  null|
| {126, null}|  null|
|  {88, null}|  null|
| {-18, null}|  null|
|{-121, null}|  null|
+------------+------+

Steps/Code to reproduce bug
Read the attached ORC file (test.log) by GPU with the read schema as below.

val readSchema = StructType(Seq(
    StructField("_c0",StructType(Seq(
        StructField("child0",ByteType,true),
        StructField("no_child",ByteType,true)))),
    StructField("no_int", IntegerType)))

Expected behavior
GPU ORC reader should output the same data with CPU ORC reader.

The text was updated successfully, but these errors were encountered:

firestarman · 2021-07-28T08:34:36Z

When ansi mode is on, for CPU, the action show will faill due to a cast exception, but collect works.

scala> spark.conf.set("spark.sql.ansi.enabled", "true")
scala> spark.read.schema(rs).orc("/data/tmp/test.log").show
org.apache.spark.sql.AnalysisException: cannot resolve 'CAST(`_c0` AS STRING)' due to data type mismatch:
 cannot cast struct<child0:tinyint,no_child:tinyint> to string with ANSI mode on.
 If you have to cast struct<child0:tinyint,no_child:tinyint> to string, you can set spark.sql.ansi.enabled as false.
;

scala> spark.read.schema(rs).orc("/data/tmp/test.log").collect
res16: Array[org.apache.spark.sql.Row] = Array([null,null], [[-21,null],null], [[112,null],null], [[-50,null],null], [[-49,null],null], [[-55,null],null], [[null,null],null], [[-61,null],null], [null,null], [[126,null],null], [[88,null],null], [[-18,null],null], [[-121,null],null])

So CPU ORC reader will not throw exceptions even ansi mode is on for this case.

sameerz · 2021-08-03T21:14:15Z

We should decide whether this should be handled at a lower layer in cuIO. Related issue rapidsai/cudf#5447

firestarman · 2021-08-25T07:51:37Z

Looks like the feaure of adding new columns here has been covered by the issue rapidsai/cudf#5447.

However adding new columns can be supported in a short time if implementing it in the plugin side.

You can see Parquet has partially supported this feature by doing it in plugin. I mean Parquet will add new columns only for the top level ones. We can do a little more in ORC to also support this for nested columns.

So, personally, we can implement only this feature in plugin before cuDF supports it, and will remove the logic after cuDF has it done.

@revans2 @sameerz hi, any suggestion ?

revans2 · 2021-08-31T13:43:57Z

@firestarman

rapidsai/cudf#5447 was filed under a mistaken assumption. I saw that the java ORC code had schema evolution built into it, and I assumed that it was a part of the standard. That turned out to not be true. It is very likely that rapidsai/cudf#5447 will never be done on the cudf side. If you want me to push on that issue so we come to some kind of a resolution I can, but personally I think we just need to make this happen on our own, and we close the cudf issue.

Fundamentally it comes down to a two different operations after lining up the names/positions of the columns accordingly.

Adding in null columns where needed. As you said.

We can do a little more in ORC to also support this for nested columns.

If you need help with this please let me know.

"CAST"ing columns from one data type to another.

This is harder because our cast implementations are not great in all cases, so as we add in this type of support we need to be sure that we test corner cases and look at the ORC code to see what corner cases there might be. In the worst case we might need a flag like isAnsi but isOrc so we know if we have to do something special for the ORC cases.

@sameerz if you agree with me on this I will close the cudf feature request.

sameerz · 2021-08-31T15:27:02Z

@revans2 agreed, we can close the cudf feature request and focus on this in the spark-rapids plugin.

firestarman · 2021-09-01T07:59:09Z

OK, I will work on the null columns first.

firestarman added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 28, 2021

Salonijain27 added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Aug 10, 2021

GaryShen2008 assigned firestarman Sep 6, 2021

firestarman mentioned this issue Sep 6, 2021

Fill with null columns for the names exist only in read schema in ORC reader #3393

Merged

firestarman closed this as completed in #3393 Sep 9, 2021

nartal1 mentioned this issue Apr 5, 2022

[FEA] Add ANSI support #5120

Open

49 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] GPU ORC reader complains errors when specifying columns that do not exist in file schema. #3058

[BUG] GPU ORC reader complains errors when specifying columns that do not exist in file schema. #3058

firestarman commented Jul 28, 2021 •

edited

Loading

firestarman commented Jul 28, 2021 •

edited

Loading

sameerz commented Aug 3, 2021

firestarman commented Aug 25, 2021 •

edited

Loading

revans2 commented Aug 31, 2021

sameerz commented Aug 31, 2021

firestarman commented Sep 1, 2021

[BUG] GPU ORC reader complains errors when specifying columns that do not exist in file schema. #3058

[BUG] GPU ORC reader complains errors when specifying columns that do not exist in file schema. #3058

Comments

firestarman commented Jul 28, 2021 • edited Loading

firestarman commented Jul 28, 2021 • edited Loading

sameerz commented Aug 3, 2021

firestarman commented Aug 25, 2021 • edited Loading

revans2 commented Aug 31, 2021

sameerz commented Aug 31, 2021

firestarman commented Sep 1, 2021

firestarman commented Jul 28, 2021 •

edited

Loading

firestarman commented Jul 28, 2021 •

edited

Loading

firestarman commented Aug 25, 2021 •

edited

Loading