Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet] When reading struct-type data without an id in iceberg-parquet, it returns null values. #11214

Open
2 of 3 tasks
FlechazoW opened this issue Sep 26, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@FlechazoW
Copy link

Apache Iceberg version

main (development)

Query engine

None

Please describe the bug 🐞

image image

For nested struct types, when group.field.getId returns null, it causes iField to be null, and subsequently, the ParquetValueReader is also null, resulting in the struct type being unable to read the data.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@FlechazoW FlechazoW added the bug Something isn't working label Sep 26, 2024
@FlechazoW
Copy link
Author

I am not sure if this is a bug or an issue caused by improper usage. If it is a bug, please let me know, and I can help fix it, thanks.

@nastra
Copy link
Contributor

nastra commented Sep 26, 2024

@FlechazoW do you have a reproducible example where this happens?

@FlechazoW
Copy link
Author

This is the meta of parquet file

Schema:
message schema {
  optional boolean col_boolean;
  optional int32 col_tinyint (INTEGER(8,true));
  optional int32 col_smallint (INTEGER(16,true));
  optional int32 col_int;
  optional int64 col_bigint;
  optional float col_float;
  optional double col_double;
  optional fixed_len_byte_array(16) col_decimal (DECIMAL(38,18));
  optional binary col_string (STRING);
  optional binary col_varchar (STRING);
  optional binary col_binary;
  optional int64 col_timestamp (TIMESTAMP(MICROS,true));
  optional int64 col_datetime (TIMESTAMP(MICROS,true));
  optional group col_array (LIST) {
    repeated group list {
      optional group element (MAP) {
        repeated group key_value {
          required int64 key;
          optional int64 value;
        }
      }
    }
  }
  optional group col_array_int (LIST) {
    repeated group list {
      optional int64 element;
    }
  }
  optional group col_array_double (LIST) {
    repeated group list {
      optional double element;
    }
  }
  optional group col_array_string (LIST) {
    repeated group list {
      optional binary element (STRING);
    }
  }
  optional group col_map (MAP) {
    repeated group key_value {
      required binary key (STRING);
      optional group value (LIST) {
        repeated group list {
          optional int64 element;
        }
      }
    }
  }
  optional group col_struct {
    optional binary s1 (STRING);
    optional int64 s2;
  }
  optional int64 col_map_bigint;
  optional group col_map_int (MAP) {
    repeated group key_value {
      required binary key (STRING);
      optional int32 value;
    }
  }
  optional int32 col_date (DATE);
  optional binary col_json (STRING);
}


Row group 0:  count: 2  1.129 kB records  start: 4  total(compressed): 2.258 kB total(uncompressed):1.823 kB 
--------------------------------------------------------------------------------
                                        type      encodings count     avg size   nulls   min / max
col_boolean                             BOOLEAN   Z   _     2         19.50 B    0       "true" / "true"
col_tinyint                             INT32     Z _ R     2         39.00 B    0       "1" / "2"
col_smallint                            INT32     Z _ R     2         39.00 B    0       "2" / "3"
col_int                                 INT32     Z _ R     2         39.00 B    0       "3" / "4"
col_bigint                              INT64     Z _ R     2         43.00 B    0       "4" / "5"
col_float                               FLOAT     Z _ R     2         39.00 B    0       "5.0" / "6.0"
col_double                              DOUBLE    Z _ R     2         43.00 B    0       "6.0" / "7.0"
col_decimal                             FIXED[16] Z _ R     2         48.50 B  0       "7.123000000000000000" / "8.122999999999999000"
col_string                              BINARY    Z _ R     2         44.50 B    0       "字符串示例" / "字符串示例"
col_varchar                             BINARY    Z _ R     2         43.50 B    0       "varchar示例" / "varchar示例"
col_binary                              BINARY    Z _ R     2         50.00 B    0       "0x5B42403636353138353863" / "0x5B42403765616330393937"
col_timestamp                           INT64     Z _ R     2         39.00 B    0       "2023-04-01T04:00:00.00000..." / "2023-04-01T04:00:00.00000..."
col_datetime                            INT64     Z _ R     2         39.00 B    0       "2023-04-01T04:00:00.00000..." / "2023-04-01T04:00:00.00000..."
col_array.list.element.key_value.key    INT64     Z _ R     4         23.25 B            
col_array.list.element.key_value.value  INT64     Z _ R     4         23.25 B            
col_array_int.list.element              INT64     Z _ R     6         16.17 B            
col_array_double.list.element           DOUBLE    Z _ R     4         23.00 B            
col_array_string.list.element           BINARY    Z _ R     6         16.33 B            
col_map.key_value.key                   BINARY    Z _ R     4         23.00 B            
col_map.key_value.value.list.element    INT64     Z _ R     8         12.38 B            
col_struct.s1                           BINARY    Z _ R     2         41.00 B    0       "s1的值" / "s1的值"
col_struct.s2                           INT64     Z _ R     2         39.00 B    0       "1" / "1"
col_map_bigint                          INT64     Z _ R     2         39.00 B    0       "8" / "8"
col_map_int.key_value.key               BINARY    Z _ R     4         23.00 B            
col_map_int.key_value.value             INT32     Z _ R     4         21.00 B            
col_date                                INT32     Z _ R     2         36.50 B    0       "2017-11-11" / "2017-11-11"
col_json                                BINARY    Z _ R     2         54.50 B    0       "123" / "{"id":11,"name":"Lakehouse"}"

@FlechazoW
Copy link
Author

@nastra Do you need any additional information?

@FlechazoW
Copy link
Author

image

@FlechazoW
Copy link
Author

image

@FlechazoW
Copy link
Author

image

@ashokvengala1990
Copy link

ashokvengala1990 commented Oct 1, 2024

I see a similar issue with struct columns.

First, the code checks whether the file schema (parquet file) has IDs. If not, it creates IDs for each column starting from ordinal = 1. However, the fields inside the struct column don't have IDs assigned to them. This discrepancy causes issues when building a record reader for struct column.

Additionally, for struct column, the method recursively calls visitFields with the struct column type. During this process, it cannot find the IDs for the fields inside the struct, leading to a null record reader object. Consequently, the struct column returns null.

I will be sharing unit test case soon.

@Fokko
Copy link
Contributor

Fokko commented Dec 3, 2024

Looking at this issue, it seems that a writer produced Parquet files but didn't correctly write out the struct fields. Looking at #11378 I don't think that's a safe way of handling this. Instead, I would see if you could unbrick the table using name-mapping and see if you can rewrite the table so the Field-IDs match up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants