Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Metrics Calculation for Iceberg Table Due to Column Name Transformation with Special Characters #10115

Closed
lintingbin opened this issue Apr 10, 2024 · 7 comments
Labels
bug Something isn't working stale

Comments

@lintingbin
Copy link
Contributor

Apache Iceberg version

1.3.1

Query engine

Spark

Please describe the bug 🐞

CREATE TABLE tmp.iceberg_test3 (
  `log_type.string` STRING,
  `event_time.string` STRING,
  `version.string` STRING,
  `version.bigint` BIGINT)
USING iceberg
PARTITIONED BY (truncate(10, `event_time.string`), `log_type.string`)
TBLPROPERTIES (
  'write.metadata.metrics.column.event_time.string' = 'truncate(16)',
  'write.metadata.metrics.default' = 'none');

When creating a table using the provided DDL statement for Iceberg tables, a bug arises where the metrics calculation for the column event_time.string becomes erroneous. This issue stems from the transformation applied to column names during storage in Parquet files. Specifically, the column name event_time.string undergoes conversion to event_time_x2Estring during the transformation process within the AvroSchemaUtil.makeCompatibleName(originalName) code. Consequently, in the ParquetUtil.java file, when fetching the MetricsMode using the statement MetricsMode metricsMode = MetricsUtil.metricsMode(fileSchema, metricsConfig, fieldId), an incorrect MetricsMode is retrieved due to the mismatch between the provided field name and the one stored in Parquet files.

@lintingbin lintingbin added the bug Something isn't working label Apr 10, 2024
@lintingbin
Copy link
Contributor Author

To resolve this issue, there are two potential solutions:

Utilize fieldId instead of fieldName to determine MetricsMode in the ParquetUtil.java file. By doing so, the correct MetricsMode associated with the column can be retrieved irrespective of any transformations.

Alternatively, within the MetricsUtil.metricsMode function, preprocess fieldName using AvroSchemaUtil.makeCompatibleName to ensure consistency between the provided field name and the one stored in Parquet files. This approach guarantees that the correct MetricsMode is determined based on the transformed field name.

Implementing either of these solutions will rectify the bug and ensure accurate metrics calculation for the event_time.string column within Iceberg tables.

@lintingbin
Copy link
Contributor Author

@szehon-ho @stevenzwu Can you help me take a look?

@szehon-ho
Copy link
Collaborator

szehon-ho commented Apr 16, 2024

I see, is it because of the dot character? I havent looked deeply at the problem but both solutions make sense, I think the first one may be preferable because then there is less chance of effect (like for example I am not sure if we process the metrics key with makeCompatibleName for non-Parquet files, so in that case we may break those kind of files)? Is my understanding correct? Probably it will be clearer once there is a pr for it.

@Fokko
Copy link
Contributor

Fokko commented Apr 16, 2024

Related: #10120

@lintingbin
Copy link
Contributor Author

I see, is it because of the dot character? I havent looked deeply at the problem but both solutions make sense, I think the first one may be preferable because then there is less chance of effect (like for example I am not sure if we process the metrics key with makeCompatibleName for non-Parquet files, so in that case we may break those kind of files)? Is my understanding correct? Probably it will be clearer once there is a pr for it.

@szehon-ho I have submitted a pr for solution one.

Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Oct 29, 2024
Copy link

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
3 participants