Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS: provide option to hide old fields in Glue table #7584

Open
LucasRoesler opened this issue May 11, 2023 · 6 comments
Open

AWS: provide option to hide old fields in Glue table #7584

LucasRoesler opened this issue May 11, 2023 · 6 comments

Comments

@LucasRoesler
Copy link

Feature Request / Improvement

In #3888 the Glue schema generation was adjusted so that all old fields are included in the schema. The original reasoning was

so that people know what were the columns that were already used in the past and avoid adding the same name column.

In my organization, there are many users of these tables via Athena who are not data engineers that own the schema. They no idea about the old schema, they are not editing the schema, and their default use case is querying the current data. They report it as confusing that the schema shows a field that does not exist and produces errors if they attempt to use it.

Neither Athena nor Glue seem to have any support to display these old fields as non-active or deprecated or to hide these fields. Therefore, it would be nice to have a configuration option to disable including non-current fields in the schema.

Query engine

Athena

@dertodestod
Copy link

I also don't quite understand the current behavior in Athena/Glue when a column is dropped. I can see that a new schema is created in the metadata file without the column and in Glue the column moves to the end of the table and gets a "iceberg.field.current": "false" setting. However, the column still shows up for consumers in Athena web console (but not when doing a DESCRIBE of the table) so this has led to some confusion in our business.

I couldn't check if the column appears via JDBC (because of some errors) but I guess the column won't be listed because I see in Athena that a DESCRIBE query is used to retrieve that information. Can someone confirm that?

I personally think that Athena should not show the deleted columns (neither in the web nor via JDBC). Is there perhaps a way to keep track of the dropped column(s) without showing them in Athena? If not, it would be great if one could be created.

@LucasRoesler LucasRoesler changed the title AWS: provide option to hid old fields in Glue table AWS: provide option to hide old fields in Glue table May 19, 2023
@wojciechjak
Copy link

Also the same issue when renaming columns.

@pdehaansbp
Copy link

Same issue. Curious to read what @jackye1995 and @yyanyy think about it.

@tcassou
Copy link

tcassou commented Dec 30, 2023

Hello! Our organization is facing the same problem. In particular, the Glue API will return columns that cannot be resolved in the source data, causing queries to fail. We've been using Presto views created dynamically, and breaking every time a column is dropped.

Technically, schema versioning is meant to solve this challenge:

so that people know what were the columns that were already used in the past and avoid adding the same name column.

The latest schema of a table should be aligned with the data, and previous versions will keep track of historical modifications.
Could we think of publishing new schema versions in Glue instead of this workaround that introduced bugs/defects?
Or at the very least making this newly introduced behavior optional?

Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Aug 30, 2024
@tcassou
Copy link

tcassou commented Aug 31, 2024

Hi there!
This is still an issue, and the only workaround we found is to build a custom Iceberg jar without the faulty commit which is not really sustainable of course.
Any change this could get prioritized, or even just acknowledged to start with?

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants