-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect Metrics Calculation for Iceberg Table Due to Column Name Transformation with Special Characters #10115
Comments
To resolve this issue, there are two potential solutions: Utilize fieldId instead of fieldName to determine MetricsMode in the ParquetUtil.java file. By doing so, the correct MetricsMode associated with the column can be retrieved irrespective of any transformations. Alternatively, within the MetricsUtil.metricsMode function, preprocess fieldName using AvroSchemaUtil.makeCompatibleName to ensure consistency between the provided field name and the one stored in Parquet files. This approach guarantees that the correct MetricsMode is determined based on the transformed field name. Implementing either of these solutions will rectify the bug and ensure accurate metrics calculation for the event_time.string column within Iceberg tables. |
@szehon-ho @stevenzwu Can you help me take a look? |
I see, is it because of the dot character? I havent looked deeply at the problem but both solutions make sense, I think the first one may be preferable because then there is less chance of effect (like for example I am not sure if we process the metrics key with makeCompatibleName for non-Parquet files, so in that case we may break those kind of files)? Is my understanding correct? Probably it will be clearer once there is a pr for it. |
Related: #10120 |
@szehon-ho I have submitted a pr for solution one. |
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. |
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' |
Apache Iceberg version
1.3.1
Query engine
Spark
Please describe the bug 🐞
When creating a table using the provided DDL statement for Iceberg tables, a bug arises where the metrics calculation for the column event_time.string becomes erroneous. This issue stems from the transformation applied to column names during storage in Parquet files. Specifically, the column name event_time.string undergoes conversion to event_time_x2Estring during the transformation process within the AvroSchemaUtil.makeCompatibleName(originalName) code. Consequently, in the ParquetUtil.java file, when fetching the MetricsMode using the statement MetricsMode metricsMode = MetricsUtil.metricsMode(fileSchema, metricsConfig, fieldId), an incorrect MetricsMode is retrieved due to the mismatch between the provided field name and the one stored in Parquet files.
The text was updated successfully, but these errors were encountered: