[BUG] ORC file count statistic for nested type is wrong #13837
Labels
0 - Backlog
In queue waiting for assignment
bug
Something isn't working
cuIO
cuIO issue
libcudf
Affects libcudf (C++/CUDA) code.
Milestone
Describe the bug
The GPU ORC file statistics show that the count for nested type is wrong while the CPU ORC file is correct.
GPU file shows different counts for nested type:
GPU:
CPU:
The data in both files are:
Steps/Code to reproduce bug
Generate GPU file
Read the GPU file
SPARK_HOME/bin/pyspark
spark.read.orc("/tmp/test-count-for-nested-type-gpu.orc").show()
+------------+
| struct_s|
+------------+
|{null, null}|
| {1, 1}|
|{null, null}|
| {3, 3}|
|{null, null}|
| {5, 5}|
|{null, null}|
| {7, 7}|
+------------+
Generate CPU file
SPARK_HOME/bin/pyspark
print count statistic for GPU file
print count statistic for CPU file
Expected behavior
The all statistics should be correct, including the
hasNull
, refer to this issueEnvironment details
Environment details
cuDF 23.08 branch
Spark 3.3.0
orc-core-1.7.4.jar
Additional context
The text was updated successfully, but these errors were encountered: