[BUG] ORC file count statistic for nested type is wrong #13837

res-life · 2023-08-09T05:57:28Z

Describe the bug
The GPU ORC file statistics show that the count for nested type is wrong while the CPU ORC file is correct.

GPU file shows different counts for nested type:
GPU:

File Statistics:
  Column 0: count: 8 hasNull: true
  Column 1: count: 1 hasNull: true

CPU:

File Statistics:
  Column 0: count: 8 hasNull: false
  Column 1: count: 8 hasNull: false

The data in both files are:

+------------+
|    struct_s|
+------------+
|{null, null}|
|      {1, 1}|
|{null, null}|
|      {3, 3}|
|{null, null}|
|      {5, 5}|
|{null, null}|
|      {7, 7}|
+------------+

Steps/Code to reproduce bug

Generate GPU file

TEST_F(OrcWriterTest, NestedColumnSelection)
{
  auto const num_rows  = 8;
  std::vector<int> child_col1_data(num_rows);
  std::vector<int> child_col2_data(num_rows);
  for (int i = 0; i < num_rows; ++i) {
    child_col1_data[i] = i;
    child_col2_data[i] = i;
  }

  auto validity = cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i % 2; });
  int32_col child_col1{child_col1_data.begin(), child_col1_data.end(), validity};
  int32_col child_col2{child_col2_data.begin(), child_col2_data.end(), validity};
  struct_col s_col{child_col1, child_col2};
  cudf::table_view expected({s_col});

  cudf::io::table_input_metadata expected_metadata(expected);
  expected_metadata.column_metadata[0].set_name("struct_s");
  expected_metadata.column_metadata[0].child(0).set_name("field_a");
  expected_metadata.column_metadata[0].child(1).set_name("field_b");

  auto filepath = "/tmp/test-count-for-nested-type-gpu.orc";
  cudf::io::orc_writer_options out_opts =
    cudf::io::orc_writer_options::builder(cudf::io::sink_info{filepath}, expected)
      .metadata(std::move(expected_metadata));
  cudf::io::write_orc(out_opts);
}

Read the GPU file
SPARK_HOME/bin/pyspark

spark.read.orc("/tmp/test-count-for-nested-type-gpu.orc").show()
+------------+
| struct_s|
+------------+
|{null, null}|
| {1, 1}|
|{null, null}|
| {3, 3}|
|{null, null}|
| {5, 5}|
|{null, null}|
| {7, 7}|
+------------+

Generate CPU file

SPARK_HOME/bin/pyspark

from pyspark.sql.types import *
schema = StructType([StructField("struct_s",
    StructType([
        StructField("field_a", IntegerType()),
        StructField("field_b", IntegerType()),
]))])

def get_value(i):
  if i % 2 == 0:
    return None
  else:
    return i

data = [
    ({ 'field_a': get_value(i), 'field_b': get_value(i) }, ) for i in range(0, 8)
]
df = spark.createDataFrame(
        SparkContext.getOrCreate().parallelize(data, numSlices=1),
        schema)

path = '/tmp/test-count-for-nested-type-cpu.orc'
df.coalesce(1).write.mode("overwrite").orc(path)
spark.read.orc(path).show()

+------------+
|    struct_s|
+------------+
|{null, null}|
|      {1, 1}|
|{null, null}|
|      {3, 3}|
|{null, null}|
|      {5, 5}|
|{null, null}|
|      {7, 7}|
+------------+

print count statistic for GPU file

$ orc-tool meta test-count-for-nested-type-gpu.orc
Processing data file test-count-for-nested-type-gpu.orc [length: 360]
Structure for test-count-for-nested-type-gpu.orc
File Version: 0.12 with ORIGINAL by ORC Java 
Rows: 8
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<struct_s:struct<field_a:int,field_b:int>>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 8 hasNull: true
    Column 1: count: 1 hasNull: true
    Column 2: count: 4 hasNull: true min: 1 max: 7 sum: 16
    Column 3: count: 4 hasNull: true min: 1 max: 7 sum: 16

File Statistics:
  Column 0: count: 8 hasNull: true
  Column 1: count: 1 hasNull: true
  Column 2: count: 4 hasNull: true min: 1 max: 7 sum: 16
  Column 3: count: 4 hasNull: true min: 1 max: 7 sum: 16

Stripes:
  Stripe: offset: 3 data: 24 rows: 8 tail: 92 index: 70
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 11
    Stream: column 2 section ROW_INDEX start: 21 length 26
    Stream: column 3 section ROW_INDEX start: 47 length 26
    Stream: column 2 section PRESENT start: 73 length 5
    Stream: column 2 section DATA start: 78 length 7
    Stream: column 3 section PRESENT start: 85 length 5
    Stream: column 3 section DATA start: 90 length 7
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT
    Encoding column 2: DIRECT_V2
    Encoding column 3: DIRECT_V2

File length: 360 bytes
Padding length: 0 bytes
Padding ratio: 0%

print count statistic for CPU file

$ orc-tool meta /tmp/test-count-for-nested-type-cpu.orc 
Processing data file file:/tmp/test-count-for-nested-type-cpu.orc/part-00000-6b490836-0c65-4355-9d0e-fbaff96aec33-c000.snappy.orc [length: 388]
Structure for file:/tmp/test-count-for-nested-type-cpu.orc/part-00000-6b490836-0c65-4355-9d0e-fbaff96aec33-c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.4
Rows: 8
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<struct_s:struct<field_a:int,field_b:int>>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 8 hasNull: false
    Column 1: count: 8 hasNull: false
    Column 2: count: 4 hasNull: true bytesOnDisk: 12 min: 1 max: 7 sum: 16
    Column 3: count: 4 hasNull: true bytesOnDisk: 12 min: 1 max: 7 sum: 16

File Statistics:
  Column 0: count: 8 hasNull: false
  Column 1: count: 8 hasNull: false
  Column 2: count: 4 hasNull: true bytesOnDisk: 12 min: 1 max: 7 sum: 16
  Column 3: count: 4 hasNull: true bytesOnDisk: 12 min: 1 max: 7 sum: 16

Stripes:
  Stripe: offset: 3 data: 24 rows: 8 tail: 71 index: 76
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 11
    Stream: column 2 section ROW_INDEX start: 25 length 27
    Stream: column 3 section ROW_INDEX start: 52 length 27
    Stream: column 2 section PRESENT start: 79 length 5
    Stream: column 2 section DATA start: 84 length 7
    Stream: column 3 section PRESENT start: 91 length 5
    Stream: column 3 section DATA start: 96 length 7
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT
    Encoding column 2: DIRECT_V2
    Encoding column 3: DIRECT_V2

File length: 388 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0

Expected behavior
The all statistics should be correct, including the hasNull, refer to this issue

Environment details
Environment details
cuDF 23.08 branch
Spark 3.3.0
orc-core-1.7.4.jar

Additional context

The text was updated successfully, but these errors were encountered:

res-life added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue labels Aug 9, 2023

github-project-automation bot added this to cuDF/Dask/Numba/UCX Aug 9, 2023

github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX Aug 9, 2023

This was referenced Aug 9, 2023

Statistics tests for ORC files written by GPU NVIDIA/spark-rapids#8763

Closed

[BUG] ORC statistics are wrong when a double column is all NULL. #13793

Closed

GregoryKimball added this to the ORC continuous improvement milestone Aug 18, 2023

GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 18, 2023

vuule mentioned this issue Dec 20, 2023

[FEA] Improve exception message when unknown Parquet page encoding detected #14209

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ORC file count statistic for nested type is wrong #13837

[BUG] ORC file count statistic for nested type is wrong #13837

res-life commented Aug 9, 2023 •

edited

Loading

[BUG] ORC file count statistic for nested type is wrong #13837

[BUG] ORC file count statistic for nested type is wrong #13837

Comments

res-life commented Aug 9, 2023 • edited Loading

Generate GPU file

Generate CPU file

print count statistic for GPU file

print count statistic for CPU file

res-life commented Aug 9, 2023 •

edited

Loading