[Parquet] When reading struct-type data without an id in iceberg-parquet, it returns null values. #11214

FlechazoW · 2024-09-26T11:40:16Z

Apache Iceberg version

main (development)

Query engine

None

Please describe the bug 🐞

For nested struct types, when group.field.getId returns null, it causes iField to be null, and subsequently, the ParquetValueReader is also null, resulting in the struct type being unable to read the data.

Willingness to contribute

I can contribute a fix for this bug independently
I would be willing to contribute a fix for this bug with guidance from the Iceberg community
I cannot contribute a fix for this bug at this time

FlechazoW · 2024-09-26T11:42:05Z

I am not sure if this is a bug or an issue caused by improper usage. If it is a bug, please let me know, and I can help fix it, thanks.

nastra · 2024-09-26T13:57:27Z

@FlechazoW do you have a reproducible example where this happens?

FlechazoW · 2024-09-29T03:01:56Z

This is the meta of parquet file

Schema:
message schema {
  optional boolean col_boolean;
  optional int32 col_tinyint (INTEGER(8,true));
  optional int32 col_smallint (INTEGER(16,true));
  optional int32 col_int;
  optional int64 col_bigint;
  optional float col_float;
  optional double col_double;
  optional fixed_len_byte_array(16) col_decimal (DECIMAL(38,18));
  optional binary col_string (STRING);
  optional binary col_varchar (STRING);
  optional binary col_binary;
  optional int64 col_timestamp (TIMESTAMP(MICROS,true));
  optional int64 col_datetime (TIMESTAMP(MICROS,true));
  optional group col_array (LIST) {
    repeated group list {
      optional group element (MAP) {
        repeated group key_value {
          required int64 key;
          optional int64 value;
        }
      }
    }
  }
  optional group col_array_int (LIST) {
    repeated group list {
      optional int64 element;
    }
  }
  optional group col_array_double (LIST) {
    repeated group list {
      optional double element;
    }
  }
  optional group col_array_string (LIST) {
    repeated group list {
      optional binary element (STRING);
    }
  }
  optional group col_map (MAP) {
    repeated group key_value {
      required binary key (STRING);
      optional group value (LIST) {
        repeated group list {
          optional int64 element;
        }
      }
    }
  }
  optional group col_struct {
    optional binary s1 (STRING);
    optional int64 s2;
  }
  optional int64 col_map_bigint;
  optional group col_map_int (MAP) {
    repeated group key_value {
      required binary key (STRING);
      optional int32 value;
    }
  }
  optional int32 col_date (DATE);
  optional binary col_json (STRING);
}


Row group 0:  count: 2  1.129 kB records  start: 4  total(compressed): 2.258 kB total(uncompressed):1.823 kB 
--------------------------------------------------------------------------------
                                        type      encodings count     avg size   nulls   min / max
col_boolean                             BOOLEAN   Z   _     2         19.50 B    0       "true" / "true"
col_tinyint                             INT32     Z _ R     2         39.00 B    0       "1" / "2"
col_smallint                            INT32     Z _ R     2         39.00 B    0       "2" / "3"
col_int                                 INT32     Z _ R     2         39.00 B    0       "3" / "4"
col_bigint                              INT64     Z _ R     2         43.00 B    0       "4" / "5"
col_float                               FLOAT     Z _ R     2         39.00 B    0       "5.0" / "6.0"
col_double                              DOUBLE    Z _ R     2         43.00 B    0       "6.0" / "7.0"
col_decimal                             FIXED[16] Z _ R     2         48.50 B  0       "7.123000000000000000" / "8.122999999999999000"
col_string                              BINARY    Z _ R     2         44.50 B    0       "字符串示例" / "字符串示例"
col_varchar                             BINARY    Z _ R     2         43.50 B    0       "varchar示例" / "varchar示例"
col_binary                              BINARY    Z _ R     2         50.00 B    0       "0x5B42403636353138353863" / "0x5B42403765616330393937"
col_timestamp                           INT64     Z _ R     2         39.00 B    0       "2023-04-01T04:00:00.00000..." / "2023-04-01T04:00:00.00000..."
col_datetime                            INT64     Z _ R     2         39.00 B    0       "2023-04-01T04:00:00.00000..." / "2023-04-01T04:00:00.00000..."
col_array.list.element.key_value.key    INT64     Z _ R     4         23.25 B            
col_array.list.element.key_value.value  INT64     Z _ R     4         23.25 B            
col_array_int.list.element              INT64     Z _ R     6         16.17 B            
col_array_double.list.element           DOUBLE    Z _ R     4         23.00 B            
col_array_string.list.element           BINARY    Z _ R     6         16.33 B            
col_map.key_value.key                   BINARY    Z _ R     4         23.00 B            
col_map.key_value.value.list.element    INT64     Z _ R     8         12.38 B            
col_struct.s1                           BINARY    Z _ R     2         41.00 B    0       "s1的值" / "s1的值"
col_struct.s2                           INT64     Z _ R     2         39.00 B    0       "1" / "1"
col_map_bigint                          INT64     Z _ R     2         39.00 B    0       "8" / "8"
col_map_int.key_value.key               BINARY    Z _ R     4         23.00 B            
col_map_int.key_value.value             INT32     Z _ R     4         21.00 B            
col_date                                INT32     Z _ R     2         36.50 B    0       "2017-11-11" / "2017-11-11"
col_json                                BINARY    Z _ R     2         54.50 B    0       "123" / "{"id":11,"name":"Lakehouse"}"

FlechazoW · 2024-09-29T03:03:50Z

@nastra Do you need any additional information?

FlechazoW · 2024-09-29T03:46:48Z

FlechazoW · 2024-09-29T03:57:19Z

FlechazoW · 2024-09-29T03:59:18Z

ashokvengala1990 · 2024-10-01T17:40:48Z

I see a similar issue with struct columns.

First, the code checks whether the file schema (parquet file) has IDs. If not, it creates IDs for each column starting from ordinal = 1. However, the fields inside the struct column don't have IDs assigned to them. This discrepancy causes issues when building a record reader for struct column.

Additionally, for struct column, the method recursively calls visitFields with the struct column type. During this process, it cannot find the IDs for the fields inside the struct, leading to a null record reader object. Consequently, the struct column returns null.

I will be sharing unit test case soon.

Fokko · 2024-12-03T12:20:14Z

Looking at this issue, it seems that a writer produced Parquet files but didn't correctly write out the struct fields. Looking at #11378 I don't think that's a safe way of handling this. Instead, I would see if you could unbrick the table using name-mapping and see if you can rewrite the table so the Field-IDs match up.

FlechazoW added the bug Something isn't working label Sep 26, 2024

joyCurry30 mentioned this issue Oct 23, 2024

Fix when reading struct-type data without an id in iceberg-parquet #11378

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parquet] When reading struct-type data without an id in iceberg-parquet, it returns null values. #11214

[Parquet] When reading struct-type data without an id in iceberg-parquet, it returns null values. #11214

FlechazoW commented Sep 26, 2024

FlechazoW commented Sep 26, 2024

nastra commented Sep 26, 2024

FlechazoW commented Sep 29, 2024

FlechazoW commented Sep 29, 2024

FlechazoW commented Sep 29, 2024

FlechazoW commented Sep 29, 2024

FlechazoW commented Sep 29, 2024

ashokvengala1990 commented Oct 1, 2024 •

edited

Loading

Fokko commented Dec 3, 2024 •

edited

Loading

[Parquet] When reading struct-type data without an id in iceberg-parquet, it returns null values. #11214

[Parquet] When reading struct-type data without an id in iceberg-parquet, it returns null values. #11214

Comments

FlechazoW commented Sep 26, 2024

Apache Iceberg version

Query engine

Please describe the bug 🐞

Willingness to contribute

FlechazoW commented Sep 26, 2024

nastra commented Sep 26, 2024

FlechazoW commented Sep 29, 2024

FlechazoW commented Sep 29, 2024

FlechazoW commented Sep 29, 2024

FlechazoW commented Sep 29, 2024

FlechazoW commented Sep 29, 2024

ashokvengala1990 commented Oct 1, 2024 • edited Loading

Fokko commented Dec 3, 2024 • edited Loading

ashokvengala1990 commented Oct 1, 2024 •

edited

Loading

Fokko commented Dec 3, 2024 •

edited

Loading