Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ORC file count statistic for nested type is wrong #13837

Open
res-life opened this issue Aug 9, 2023 · 0 comments
Open

[BUG] ORC file count statistic for nested type is wrong #13837

res-life opened this issue Aug 9, 2023 · 0 comments
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@res-life
Copy link
Contributor

res-life commented Aug 9, 2023

Describe the bug
The GPU ORC file statistics show that the count for nested type is wrong while the CPU ORC file is correct.

GPU file shows different counts for nested type:
GPU:

File Statistics:
  Column 0: count: 8 hasNull: true
  Column 1: count: 1 hasNull: true

CPU:

File Statistics:
  Column 0: count: 8 hasNull: false
  Column 1: count: 8 hasNull: false

The data in both files are:

+------------+
|    struct_s|
+------------+
|{null, null}|
|      {1, 1}|
|{null, null}|
|      {3, 3}|
|{null, null}|
|      {5, 5}|
|{null, null}|
|      {7, 7}|
+------------+

Steps/Code to reproduce bug

Generate GPU file
TEST_F(OrcWriterTest, NestedColumnSelection)
{
  auto const num_rows  = 8;
  std::vector<int> child_col1_data(num_rows);
  std::vector<int> child_col2_data(num_rows);
  for (int i = 0; i < num_rows; ++i) {
    child_col1_data[i] = i;
    child_col2_data[i] = i;
  }

  auto validity = cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i % 2; });
  int32_col child_col1{child_col1_data.begin(), child_col1_data.end(), validity};
  int32_col child_col2{child_col2_data.begin(), child_col2_data.end(), validity};
  struct_col s_col{child_col1, child_col2};
  cudf::table_view expected({s_col});

  cudf::io::table_input_metadata expected_metadata(expected);
  expected_metadata.column_metadata[0].set_name("struct_s");
  expected_metadata.column_metadata[0].child(0).set_name("field_a");
  expected_metadata.column_metadata[0].child(1).set_name("field_b");

  auto filepath = "/tmp/test-count-for-nested-type-gpu.orc";
  cudf::io::orc_writer_options out_opts =
    cudf::io::orc_writer_options::builder(cudf::io::sink_info{filepath}, expected)
      .metadata(std::move(expected_metadata));
  cudf::io::write_orc(out_opts);
}

Read the GPU file
SPARK_HOME/bin/pyspark

spark.read.orc("/tmp/test-count-for-nested-type-gpu.orc").show()
+------------+
| struct_s|
+------------+
|{null, null}|
| {1, 1}|
|{null, null}|
| {3, 3}|
|{null, null}|
| {5, 5}|
|{null, null}|
| {7, 7}|
+------------+

Generate CPU file

SPARK_HOME/bin/pyspark

from pyspark.sql.types import *
schema = StructType([StructField("struct_s",
    StructType([
        StructField("field_a", IntegerType()),
        StructField("field_b", IntegerType()),
]))])

def get_value(i):
  if i % 2 == 0:
    return None
  else:
    return i

data = [
    ({ 'field_a': get_value(i), 'field_b': get_value(i) }, ) for i in range(0, 8)
]
df = spark.createDataFrame(
        SparkContext.getOrCreate().parallelize(data, numSlices=1),
        schema)

path = '/tmp/test-count-for-nested-type-cpu.orc'
df.coalesce(1).write.mode("overwrite").orc(path)
spark.read.orc(path).show()
+------------+
|    struct_s|
+------------+
|{null, null}|
|      {1, 1}|
|{null, null}|
|      {3, 3}|
|{null, null}|
|      {5, 5}|
|{null, null}|
|      {7, 7}|
+------------+
print count statistic for GPU file
$ orc-tool meta test-count-for-nested-type-gpu.orc
Processing data file test-count-for-nested-type-gpu.orc [length: 360]
Structure for test-count-for-nested-type-gpu.orc
File Version: 0.12 with ORIGINAL by ORC Java 
Rows: 8
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<struct_s:struct<field_a:int,field_b:int>>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 8 hasNull: true
    Column 1: count: 1 hasNull: true
    Column 2: count: 4 hasNull: true min: 1 max: 7 sum: 16
    Column 3: count: 4 hasNull: true min: 1 max: 7 sum: 16

File Statistics:
  Column 0: count: 8 hasNull: true
  Column 1: count: 1 hasNull: true
  Column 2: count: 4 hasNull: true min: 1 max: 7 sum: 16
  Column 3: count: 4 hasNull: true min: 1 max: 7 sum: 16

Stripes:
  Stripe: offset: 3 data: 24 rows: 8 tail: 92 index: 70
    Stream: column 0 section ROW_INDEX start: 3 length 7
    Stream: column 1 section ROW_INDEX start: 10 length 11
    Stream: column 2 section ROW_INDEX start: 21 length 26
    Stream: column 3 section ROW_INDEX start: 47 length 26
    Stream: column 2 section PRESENT start: 73 length 5
    Stream: column 2 section DATA start: 78 length 7
    Stream: column 3 section PRESENT start: 85 length 5
    Stream: column 3 section DATA start: 90 length 7
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT
    Encoding column 2: DIRECT_V2
    Encoding column 3: DIRECT_V2

File length: 360 bytes
Padding length: 0 bytes
Padding ratio: 0%
print count statistic for CPU file
$ orc-tool meta /tmp/test-count-for-nested-type-cpu.orc 
Processing data file file:/tmp/test-count-for-nested-type-cpu.orc/part-00000-6b490836-0c65-4355-9d0e-fbaff96aec33-c000.snappy.orc [length: 388]
Structure for file:/tmp/test-count-for-nested-type-cpu.orc/part-00000-6b490836-0c65-4355-9d0e-fbaff96aec33-c000.snappy.orc
File Version: 0.12 with ORC_14 by ORC Java 1.7.4
Rows: 8
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<struct_s:struct<field_a:int,field_b:int>>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 8 hasNull: false
    Column 1: count: 8 hasNull: false
    Column 2: count: 4 hasNull: true bytesOnDisk: 12 min: 1 max: 7 sum: 16
    Column 3: count: 4 hasNull: true bytesOnDisk: 12 min: 1 max: 7 sum: 16

File Statistics:
  Column 0: count: 8 hasNull: false
  Column 1: count: 8 hasNull: false
  Column 2: count: 4 hasNull: true bytesOnDisk: 12 min: 1 max: 7 sum: 16
  Column 3: count: 4 hasNull: true bytesOnDisk: 12 min: 1 max: 7 sum: 16

Stripes:
  Stripe: offset: 3 data: 24 rows: 8 tail: 71 index: 76
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 1 section ROW_INDEX start: 14 length 11
    Stream: column 2 section ROW_INDEX start: 25 length 27
    Stream: column 3 section ROW_INDEX start: 52 length 27
    Stream: column 2 section PRESENT start: 79 length 5
    Stream: column 2 section DATA start: 84 length 7
    Stream: column 3 section PRESENT start: 91 length 5
    Stream: column 3 section DATA start: 96 length 7
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT
    Encoding column 2: DIRECT_V2
    Encoding column 3: DIRECT_V2

File length: 388 bytes
Padding length: 0 bytes
Padding ratio: 0%

User Metadata:
  org.apache.spark.version=3.3.0

Expected behavior
The all statistics should be correct, including the hasNull, refer to this issue

Environment details
Environment details
cuDF 23.08 branch
Spark 3.3.0
orc-core-1.7.4.jar

Additional context

@res-life res-life added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue labels Aug 9, 2023
@github-project-automation github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX Aug 9, 2023
@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: In Progress
Development

No branches or pull requests

2 participants